Query performance optimization #1363

v0y4g3r · 2023-04-04T10:04:22Z

v0y4g3r
Apr 4, 2023
Maintainer

What type of enhancement is this?

Performance

What does the enhancement do?

Optimization rules

topN on timestamp desc (according to stats info)
- split into time windows maintained in stats
cost based optimization

Data statistics & index

SST stats info collection on flush/compaction
- memtable also maintain a stats during insertion?
Datanode & region&table level stats based on SSTs
Index

Storage format

List/object data type
Time-series support in memtable and SST
- existign operators support composite data types

Implementation challenges

No response

evenyag · 2023-04-07T09:49:01Z

evenyag
Apr 7, 2023
Maintainer

Some Ideas

Normal

Write-through cache for object stores
Provide different scan mode
- Based on query type
- Parallel scan
- Skip merge and/or dedup if users can tolerate
Make use of time window
- Stop scan earlier
- Cache
Improve compaction
Borrow optimize rules from other DBs

TopK

latest points for each time-seris
histogram statistics

0 replies

waynexia · 2023-04-10T17:11:21Z

waynexia
Apr 10, 2023
Maintainer

Time-series format

Release the power unique in time-series data

Semantic type

Repeat one more time here. TIME INDEX / timestamp -> time index, PRIMARY KEY -> tag and others -> field.

Storage Format

In our current model, data are stored in the parquet file plainly. I.e., row to row and column to column:

row_num	ts	tag1	tag2	field1	field2
0		x	x		x		x		x
1		x	x		x		x		x
2		x	x		x		x		x
...
8		x	x		x		x		x

Thanks to the powerful parquet format, this method works fine in overall disk size. And has good write performance. However, it falls short when measuring read ability. Queries like where tag1=A or order by tag1, tag2 (they are frequenters to tsdb) are always coming with complaints from user and boss due to the excessively long execution time.
A common way to optimize read performance is the Inverted index. It can do quick look up for time-series, but is very consumption at the mean time.
What we want and what inverted index provides is the ability to locate one or more time-series among a large cardinality dataset. The current format scatters data from one time-series to several rows. When try to locate and extract them always lead to a full-table scan (or full row group scan in parquet, whatever). Thus naturally, we will think of preserve the series structure:

row_num	ts		tag1	tag2	field1	field2
0		[x,x,x]		x		x	[x,x,x]	[x,x,x]
1		[x,x,x]		x		x	[x,x,x]	[x,x,x]
2		[x,x,x]		x		x	[x,x,x]	[x,x,x]

The proposed format collapse ts (time index) and fields into lists, while keeping tags unchanged. Reconsider queries above, where tag1=A only need to examine 3 rows, and the order by might be omitted when the data is persisted in sequence of tag1, tag2. And we need not to reassemble the time-series -- data is still organized in series.
BTW, This can also reduce the repeated tags, but considering parquet might already use dictionary encoding, it might not be that considerable.
And there is one more bonds. Parquet is designed for storing nested types. Change field from plain column to list column brings no change to the generated parquet data (if you ignore the repetition level, then they are indistinguishable).

Memory Format

Next part is the memory format, it's related to query execution.
Some compute logic require data to be delivered in time-series (that's why the order by tag1, tag2 clause is needed). And again, in our current model, data are represented as RecordBatch plainly. The illustration is the same with the previous one.
When one record batch only contains one series, it means the tags are exactly the same. We have two ways to resolve this:

(a) wrap a DictionaryArray over tag arrays, and
(b) inherit the format from previous proposed storage format, wrap a ListArray over value arrays.

Both of them can alleviate the repeated tags. But both of them requires new feature to the existing query engine, as these nested type are always second-class citizen. And (b) ListArray may require some changes to the logic. The operator cannot treat what input as is, but need to extract and operate on the underlying data.
To gain the benefit without paying too much, I wanna propose a wilder third way. The problem here is we usually only want to do computation on values, and then group by tags. That says, tags are sort of "identifier" to the value. Then why don't we process them separately? Diagram:

In short, this proposes a new divide plan, that divides data into tags and values. And connect them by an ID. Then values being calculate as before. And insert a join to convert the data back. I just come up with this so can only offer some qualitative analysis.

Is this viable?
- Likely. The divide plan needs to separate one input into two output. We can inject some marker to let it know which to output on polled. And the join plan is a simple inner join. The other part is a new optimizer rule which can insert these new nodes.
What about the performance
- The hash join probe should be consumption when the id set is small. We can fallback to the current way if it's too large. This is a memory-time trade-off after all.
Then memory efficiency?
- Should be more small than the current method. There is no repeated time-series tags (before the final result)

Other notes

What to include in a single record batch?

I prefer to only include one time-series per record batch if there are execution plan is based on time-series logic. This may increase the number of processed record batch and the function call. But this is also a great place to apply pipeline execution (the data domain is very clear, and pipeline breaker can help to concat batches), though we don't have one for now.
And when a plan requires inputs to be time-series, feeds multiple series to it can't help a lot (except reducing function calls).

Is it okay to change our persist format?

I think it's acceptable. We don't change too much, and it's easy to convert back, and the improvement it brings worth it.

0 replies

This comment has been hidden.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Greptime

Query performance optimization #1363

{{title}}

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

This comment has been hidden.

Select a reply

Greptime

Query performance optimization #1363

v0y4g3r Apr 4, 2023 Maintainer

What type of enhancement is this?

What does the enhancement do?

Optimization rules

Data statistics & index

Storage format

Implementation challenges

Replies: 3 comments

evenyag Apr 7, 2023 Maintainer

Some Ideas

Normal

TopK

waynexia Apr 10, 2023 Maintainer

Time-series format

Semantic type

Storage Format

Memory Format

Other notes

What to include in a single record batch?

Is it okay to change our persist format?

This comment has been hidden.

v0y4g3r
Apr 4, 2023
Maintainer

evenyag
Apr 7, 2023
Maintainer

waynexia
Apr 10, 2023
Maintainer