Skip to content

[PLAT-440] roll up routing in metrics view#9180

Open
pjain1 wants to merge 12 commits intomainfrom
rollup_mv
Open

[PLAT-440] roll up routing in metrics view#9180
pjain1 wants to merge 12 commits intomainfrom
rollup_mv

Conversation

@pjain1
Copy link
Copy Markdown
Member

@pjain1 pjain1 commented Apr 3, 2026

  • Add rollup table config to metrics view proto and YAML parser
  • Implement query routing: eligible rollups are selected based on grain derivability, dimension/measure coverage, timezone match, time range alignment, and time coverage
  • Prefer coarsest grain among eligible rollups; break ties by smallest data range
  • For no-time-range queries ("all data"), verify the rollup covers the base table's full range rather than skipping coverage checks

Checklist:

  • Covered by tests
  • Ran it and it works as intended
  • Reviewed the diff before requesting a review
  • Checked for unhandled edge cases
  • Linked the issues it closes
  • Checked if the docs need to be updated. If so, create a separate Linear DOCS issue
  • Intend to cherry-pick into the release branch
  • I'm proud of this work!

@pjain1 pjain1 closed this Apr 3, 2026
@pjain1 pjain1 reopened this Apr 3, 2026
@pjain1 pjain1 changed the title roll up routing in metrics view [PLAT-440] roll up routing in metrics view Apr 5, 2026
@pjain1 pjain1 requested a review from begelundmuller April 6, 2026 05:32
return wm.min, wm.max, true
}

mn, mx, err := e.fetchTimestamps(ctx, rollup.Database, rollup.DatabaseSchema, rollup.Table)
Copy link
Copy Markdown
Member Author

@pjain1 pjain1 Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of directly querying for watermarks here, I explored using metrics_time_range resolver approach so that we can rely on user defined cache_key_ttl and cache_key_sql to not fetch watermark until needed. It also helps with automatically invalidating cache if rollups are Rill managed as the resolver cache key will rely on metrics view status updated on.

However, the only issue I see is for external olap (majority of cases), mv cache is disabled by default so resolver ends up querying watermarks for all eligible rollups for every single query (contrast with current behaviour where time range is only queried once for time picker). In this case I was thinking of an L1 cache having simple time based ttl of lets say 1 or 5 minutes in this file and only if that expires then call metrics_time_range resolver. I already have the changes locally if needed. Thoughts?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds alright to me. I'm eager that we get the time range caching as simple/standalone/re-usable as possible (and I'm worried about the implementation diverging too much from that in Timestamps / BindQuery). See my comments below and also in Slack.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I already pushed this change

Comment on lines +357 to +358
// IANA timezone the rollup was aggregated in; defaults to UTC
string timezone = 9;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits:

  1. most other places, we call it time_zone, not timezone
  2. move it up before/after the time_grain field to group time-related fields

syntax = "proto3";
package rill.runtime.v1;

// note - if adding new grain, also update it in executor_rewrite_rollup.go and rollup.go
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many other places in the code than these that also needs to be updated if a new grain is added. Not sure if it's worth calling out all the exact files, I think it's implicit that if you refacto an enum, you have to check all the code that uses it.

Database string `yaml:"database"`
DatabaseSchema string `yaml:"database_schema"`
TimeGrain string `yaml:"time_grain"`
Timezone string `yaml:"timezone"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: time_zone instead of timezone for consistency with other props

Dimensions *FieldSelectorYAML `yaml:"dimensions"`
Measures *FieldSelectorYAML `yaml:"measures"`
} `yaml:"rollups"`
WatermarkCacheTTL string `yaml:"watermark_cache_ttl"`
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have a cache: key, it feels a little weird for this property not to be part of that. I understand it's different, just feels a little weird, in case you have any better ideas.

Comment on lines +312 to +320
// Check time dimension column exists
if mv.TimeDimension != "" {
if !cols[strings.ToLower(mv.TimeDimension)] {
res.OtherErrs = append(res.OtherErrs, fmt.Errorf("rollup[%d]: time dimension column %q not found in table %q", i, mv.TimeDimension, rollup.Table))
}
}

// Check dimension columns exist
for _, dim := range rollup.Dimensions {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Since we add the default time dimension to the list of dimensions in the spec, doesn't the normal dimension check cover the time dimension as well?
  2. What about time dimensions that use a custom expression?

Comment on lines +319 to +333
// Check dimension columns exist
for _, dim := range rollup.Dimensions {
colName := dim
for _, d := range mv.Dimensions {
if strings.EqualFold(d.Name, dim) {
if d.Column != "" {
colName = d.Column
}
break
}
}
if !cols[strings.ToLower(colName)] {
res.OtherErrs = append(res.OtherErrs, fmt.Errorf("rollup[%d]: dimension column %q not found in table %q", i, colName, rollup.Table))
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about dimensions that use expressions instead of fixed column name?

Comment on lines +350 to +352
if len(measureExprs) > 0 {
query := fmt.Sprintf(
"SELECT 1, %s FROM %s GROUP BY 1",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you consider refactoring/re-using validateAllDimensionsAndMeasures and validateIndividualDimensionsAndMeasures to enable checking rollup tables as well?

Seems like it might be possible by passing a different table name to those functions, and passing an optional dimension/measure selector.

It would be nice to not have validation logic duplicated/diverge.

Comment on lines +13 to +17
type watermarkEntry struct {
min time.Time
max time.Time
fetchedAt time.Time
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the word "watermark" confusing here since it's actually a time range / timestamp set. In other places, the watermark is a single timestamp that defaults to MAX(<time dimension>), not a range.

In other places in this package, we call it "timestamps" (see Executor.Timestamps and metricsview.TimestampsResult).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm ok can use that.

Comment on lines +19 to +22
var watermarkCache = struct {
mu sync.Mutex
items map[string]watermarkEntry
}{items: make(map[string]watermarkEntry)}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer we avoid having a global variable that caches data like this. The entire executor package is currently stateless, which is a very nice guarantee.

We have had a similar problem of needing to cache time ranges previously, which we solved with caching outside the package and optional binding – see calls to BindQuery for an example. Maybe something similar can be applied here?

It's also worth considering if/how this could be leveraged in the Timestamps function to ensure a consistent treatment of time ranges across the package.

Copy link
Copy Markdown
Member Author

@pjain1 pjain1 Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree and this is entirely removed in favor of using resolver and thus uses the global resolver cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants