compaction: skip output level files with no data overlap #6021

huachaohuang · 2019-11-09T12:05:13Z

The idea is to skip output level files that are not overlapping with the data of
start level files on compaction. By an output level file overlapping with the
data of a start level file, I mean that there is at least one key in the start
level file that is inside the range of the output level file. For example, an
output level file O has range ["e", "f"] and keys "e" and "f", a start level
file S has range ["a", "z"] and keys "a" and "z", although the range of file O
overlaps with the range of file S, file O does not overlap with the data of file
S.

So when is this idea useful? We know that when we do sequential writes, all
generated SST files don't overlap with each other and all compactions are just
trivial moves, which is perfect. However, if we do concurrent sequential writes in
multiple ranges, life gets hard.

Take a relational database as an example. A common construction of the record
keys is a table ID prefix concatenating with an auto-increment record ID (e.g.
"1_1" means table 1, record 1). Now let's see what happens if we insert records
into two tables (table 1 and table 2) in this order: "1_1", "2_1", "1_2",
"2_2", "1_3", "2_3", "1_4", "2_4" ...

Assume that RocksDB uses level compaction and each memtable and SST file
contains at most two keys. After putting eight keys, we get four level 0 files:

L0: ["1_1", "2_1"], ["1_2", "2_2"], ["1_3", "2_3"], ["1_4", "2_4"]
L1:

Then a level 0 compaction is triggered and we get this:

L0:
L1: ["1_1", "1_2"], ["1_3", "1_4"], ["2_1", "2_2"], ["2_3", "2_4"]

Then after putting four more keys:

L0: ["1_5", "2_5"], ["1_6", "2_6"], ["1_7", "2_7"], ["1_8", "2_8"]
L1: ["1_1", "1_2"], ["1_3", "1_4"], ["2_1", "2_2"], ["2_3", "2_4"]

Now if a level 0 compaction is triggered, according to the current
implementation, the start level inputs will be all files in level 0, which cover
range ["1_5", "2_8"], and the output level inputs will be ["2_1", "2_2"] and
["2_3", "2_4"] because these two files overlap with the range of the start
level. However, files ["2_1", "2_2"], ["2_3", "2_4"] don't overlap with the data
of the start level inputs at all. So can we compact the start level inputs
without rewriting these two output level files? The answer is yes, as long as we
ensure that newly generated files don't overlap with existing files in the
output level. We can use the ranges of skipped output level files as split
points for the compaction output files. For this compaction, "2_1" will be a
split point, which prevents the compaction from generating a file like
["1_8", "2_5"]. With this optimization, we reduce two file reads and writes,
which is 1/3 of the IO in this compaction.

While the above example seems a bit artificial, I also experimented on a
real-world database with this idea. A simple sysbench insert benchmark on TiDB
shows more than 30% compaction IO reduction in some cases. I think other similar
databases can benefit from this optimization too.

Note that the current change is ugly, so just consider it as a proof of concept
implementation for now.

Possible related: #5201 #6016 @yiwu-arbug @zhangjinpeng1987 @matthewvon

The idea is to skip output level files that are not overlapping with the data of start level files on compaction. By an output level file overlapping with the data of a start level file, I mean that there is at least one key in the start level file that is inside the range of the output level file. For example, an output level file *O* has range ["e", "f"] and keys "e" and "f", a start level file *S* has range ["a", "z"] and keys "a" and "z", although the range of file O overlaps with the range of file S, file O does not overlap with the data of file S. So when is this idea useful? We know that when we do sequential writes, all generated SST files don't overlap with each other and all compactions are just trivial moves, which is perfect. However, if we do concurrent sequential writes in multiple ranges, life gets hard. Take a relational database as an example. A common construction of the record keys is a table ID prefix concatenating with an auto-increment record ID (e.g. "1_1" means table 1, record 1). Now let's see what happens if we insert records into three tables (table 1 and table 2) in this order: "1_1", "2_1", "1_2", "2_2", "1_3", "2_3", "1_4", "2_4" ... Assume that RocksDB uses level compaction and each memtable and SST file contains at most two keys. After putting eight keys, we get four level 0 files: L0: ["1_1", "2_1"], ["1_2", "2_2"], ["1_3", "2_3"], ["1_4", "2_4"] L1: Then a level 0 compaction is triggered and we this: L0: L1: ["1_1", "1_2"], ["1_3", "1_4"], ["2_1", "2_2"], ["2_3", "2_4"] Then after putting four more keys: L0: ["1_5", "2_5"], ["1_6", "2_6"], ["1_7", "2_7"], ["1_8", "2_8"] L1: ["1_1", "1_2"], ["1_3", "1_4"], ["2_1", "2_2"], ["2_3", "2_4"] Now if a level 0 compaction is triggered, according to the current implementation, the start level inputs will be all files in level 0, which cover range ["1_5", "2_8"], and the output level inputs will be ["2_1", "2_2"] and ["2_3", "2_4"] because these two files overlap with the range of the start level. However, files ["2_1", "2_2"], ["2_3", "2_4"] don't overlap with the data of the start level inputs at all. So can we compact the start level inputs without rewriting these two output level files? The answer is yes, as long as we ensure that newly generated files don't overlap with existing files in the output level. We can use the ranges of skipped output level files as split points for the compaction output files. For this compaction, "2_1" will be a split point, which prevents the compaction from generating a file like ["1_8", "2_5"]. With this optimization, we reduce two file reads and writes, which is 1/3 of the IO in this compaction. While the above example seems a bit artificial, I also experimented on a real-world database with this idea. A simple sysbench insert benchmark on TiDB shows more than 30% compaction IO reduction in some cases. I think other similar databases can benefit from this optimization too. Note that the current change is ugly, so just consider it as a proof of concept implementation for now.

matthewvon · 2019-11-09T15:38:59Z

@huachaohuang I considered this technique. It appeared to have bad interactions with range deletes. I did not prove or disprove a range delete problem. I simply took a different approach. Simply suggesting you consider that potential problem.

zhangjinpeng87 · 2019-11-10T09:22:38Z

@huachaohuang this idea is very similar with my proposal tikv/rust-rocksdb#375

huachaohuang · 2019-11-10T09:28:00Z

It appeared to have bad interactions with range deletes.

@matthewvon can you give more details about the problem?

huachaohuang · 2019-11-10T09:28:25Z

@zhangjinpeng1987 cool, I don't notice that before.

Little-Wallace · 2019-11-10T13:33:56Z

This idea is cool. But PickCompaction must be called with holding the mutex of DB. If you create iterator and call Seek in PickCompaction, it would block all Get and Write requests.

huachaohuang · 2019-11-10T14:19:00Z

@Little-Wallace that's a good point, just consider it an easy hack for now :)

matthewvon · 2019-11-10T17:25:57Z

@huachaohuang Range deletes for compactions are written within FinishCompactionOutputFiles(). There is no equivalent for flushes (I am rewriting that function to work with both flush and compaction). FinishCompactionOutputFiles() takes care of making sure range delete objects appropriately cover the key range of each .sst file being finished. My read of the code suggests that omitting files from the middle of a large compaction could leave the key range omitted without range delete coverage. Hence, the range delete is "lost" for those files simply removed from the large compaction.

zhangjinpeng87 · 2019-11-11T06:38:30Z

@huachaohuang Range deletes for compactions are written within FinishCompactionOutputFiles(). There is no equivalent for flushes (I am rewriting that function to work with both flush and compaction). FinishCompactionOutputFiles() takes care of making sure range delete objects appropriately cover the key range of each .sst file being finished. My read of the code suggests that omitting files from the middle of a large compaction could leave the key range omitted without range delete coverage. Hence, the range delete is "lost" for those files simply removed from the large compaction.

How about disable this optimization when there is range deletion in the range?

siying · 2019-11-11T19:27:23Z

I had this PR: #1963 which I believe is to solve a similar problem with a different approach. I hesitated in pushing it through because I worried that in some special cases, we are creating a lot of small files and they may not be able to eventually compacted together. I think this PR may have the same risk. I think the risk can be mitigated by looking at the size of the current output file. If the file is too small, then we skip this optimization.

huachaohuang · 2019-11-15T02:12:38Z

So, I think we all understand the problem we want to solve here and we have four different PRs to do it now. Let me put them together and see what we should proceed:

Interface to customize compaction output file boundary #5201 adds a CompactionPolicy to enable users to decide when to stop building an output file. This is flexible and may solve other potential cases too.
Prefix compaction #6016 adds a compaction_prefix_extractor to enable users to cut compaction output files according to some prefixes. This seems like a special case of CompactionPolicy in Interface to customize compaction output file boundary #5201 . It is more convenient for users facing the same problem with prefixes but doesn't help with other cases.
Compaction to cut output file if it can avoid to overlap a grandparent file #1963 cuts compaction output files to avoid some future compactions at the grandparent level. It needs a smart algorithm to decide when to cut an output file to avoid as most as future compactions and also generates as least as small files as possible.
compaction: skip output level files with no data overlap #6021 skips output level files if they have no keys overlapping with input level files. Instead of predicting future compactions like Compaction to cut output file if it can avoid to overlap a grandparent file #1963 , this PR delays the decision until compactions really happen but can also result in creating a lot of small files.

There are actually two paths here.

The first path is to provide some options to enable users to make the decision about how to cut their compation files. Since users have more application knowledge, they can do better than RocksDB inside. But it also relies on users to understand their pattern and do it right. As for the implementation, #5201 is more flexible and #6016 is more convenient, I think we can do both. #5201 is a mechanism and #6016 is a strategy, like Comparator and BytewiseComparator.

The second path is to let RocksDB handle the problem inside so that users don't need to worry about it. As for the implementation, #1963 seems more straightforward since RocksDB already checks overlapped bytes at the grandparent level and #6021 needs to build an iterator of input level files somewhere.

Both paths can result in creating a lot of small files depending on the data pattern. IMO, the problems of small files:

There are some overhead managing a lot of files, whether for RocksDB or the operating system. I'm not sure about the cost.
Sorted data scattered in different small files causes a lot of random IO. This may not be a problem for some applications. For example, if applications never do range scans across prefixes, cutting files according to prefixes doesn't hurt read performance.

OK, these my opinions so far. I just try to clear my mind here but I actually have no direct interest in this problem, so I'm not going to work on it right now.

facebook-github-bot added the CLA Signed label Nov 9, 2019

zhangjinpeng87 mentioned this pull request Nov 10, 2019

PCP-27: level compaction skip some sst to reduce write amplification tikv/rust-rocksdb#375

Open

petermattis mentioned this pull request Nov 10, 2019

perf: skip output level sstables with no data overlap with input level cockroachdb/pebble#389

Open

huachaohuang closed this Nov 15, 2019

yiwu-arbug mentioned this pull request Nov 25, 2019

Prefix compaction #6016

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compaction: skip output level files with no data overlap #6021

compaction: skip output level files with no data overlap #6021

huachaohuang commented Nov 9, 2019 •

edited

Loading

matthewvon commented Nov 9, 2019

zhangjinpeng87 commented Nov 10, 2019

huachaohuang commented Nov 10, 2019

huachaohuang commented Nov 10, 2019

Little-Wallace commented Nov 10, 2019

huachaohuang commented Nov 10, 2019

matthewvon commented Nov 10, 2019

zhangjinpeng87 commented Nov 11, 2019

siying commented Nov 11, 2019

huachaohuang commented Nov 15, 2019

compaction: skip output level files with no data overlap #6021

compaction: skip output level files with no data overlap #6021

Conversation

huachaohuang commented Nov 9, 2019 • edited Loading

matthewvon commented Nov 9, 2019

zhangjinpeng87 commented Nov 10, 2019

huachaohuang commented Nov 10, 2019

huachaohuang commented Nov 10, 2019

Little-Wallace commented Nov 10, 2019

huachaohuang commented Nov 10, 2019

matthewvon commented Nov 10, 2019

zhangjinpeng87 commented Nov 11, 2019

siying commented Nov 11, 2019

huachaohuang commented Nov 15, 2019

huachaohuang commented Nov 9, 2019 •

edited

Loading