-
Notifications
You must be signed in to change notification settings - Fork 25.5k
Late materialization of dimension fields in time-series #135961
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
segments.set(groupId, docVector.segments().getInt(valuePosition)); | ||
docIds = bigArrays.grow(docIds, groupId + 1); | ||
docIds.set(groupId, docVector.docs().getInt(valuePosition)); | ||
contextRefs.computeIfAbsent(shard, s -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Felix noticed separately that this is too slow, mind replacing with a check and insert?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, I pushed e333cdc
try { | ||
blocks[offset] = new DocVector(shardRefs::get, shardVector, segmentVector, docVector, null).asBlock(); | ||
} catch (Exception e) { | ||
throw e; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do we get by catching and rethrowing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry, leftover
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So awesome.
I assume this doesn't apply if there are functions applied to dimensions in grouping, e.g.
TS metrics | STATS sum(rate(reqs)) BY substr(host, 3)
This change adds an optimization rule for time-series queries that moves reading dimension fields from before the time-series operator to after, reading each dimension field once per group. This is possible because dimension field values for
_tsid
are identical across all documents in the same time-series.For example:
Without this rule:
With this rule:
Ideally, dimension fields should be read once per _tsid in the final result, similar to the fetch phase. Currently, dimension fields are read once per group key in each pipeline; if there are multiple time buckets, dimensions for the same _tsid are read multiple times. This can be avoided by extending ValuesSourceReaderOperator to understand the ordinals of _tsid. I will follow up with this improvement later to keep this PR small.