Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compaction: skip output level files with no data overlap #6021

Closed
wants to merge 1 commit into from
Closed

compaction: skip output level files with no data overlap #6021

wants to merge 1 commit into from

Conversation

huachaohuang
Copy link
Contributor

@huachaohuang huachaohuang commented Nov 9, 2019

The idea is to skip output level files that are not overlapping with the data of
start level files on compaction. By an output level file overlapping with the
data of a start level file, I mean that there is at least one key in the start
level file that is inside the range of the output level file. For example, an
output level file O has range ["e", "f"] and keys "e" and "f", a start level
file S has range ["a", "z"] and keys "a" and "z", although the range of file O
overlaps with the range of file S, file O does not overlap with the data of file
S.

So when is this idea useful? We know that when we do sequential writes, all
generated SST files don't overlap with each other and all compactions are just
trivial moves, which is perfect. However, if we do concurrent sequential writes in
multiple ranges, life gets hard.

Take a relational database as an example. A common construction of the record
keys is a table ID prefix concatenating with an auto-increment record ID (e.g.
"1_1" means table 1, record 1). Now let's see what happens if we insert records
into two tables (table 1 and table 2) in this order: "1_1", "2_1", "1_2",
"2_2", "1_3", "2_3", "1_4", "2_4" ...

Assume that RocksDB uses level compaction and each memtable and SST file
contains at most two keys. After putting eight keys, we get four level 0 files:

L0: ["1_1", "2_1"], ["1_2", "2_2"], ["1_3", "2_3"], ["1_4", "2_4"]
L1:

Then a level 0 compaction is triggered and we get this:

L0:
L1: ["1_1", "1_2"], ["1_3", "1_4"], ["2_1", "2_2"], ["2_3", "2_4"]

Then after putting four more keys:

L0: ["1_5", "2_5"], ["1_6", "2_6"], ["1_7", "2_7"], ["1_8", "2_8"]
L1: ["1_1", "1_2"], ["1_3", "1_4"], ["2_1", "2_2"], ["2_3", "2_4"]

Now if a level 0 compaction is triggered, according to the current
implementation, the start level inputs will be all files in level 0, which cover
range ["1_5", "2_8"], and the output level inputs will be ["2_1", "2_2"] and
["2_3", "2_4"] because these two files overlap with the range of the start
level. However, files ["2_1", "2_2"], ["2_3", "2_4"] don't overlap with the data
of the start level inputs at all. So can we compact the start level inputs
without rewriting these two output level files? The answer is yes, as long as we
ensure that newly generated files don't overlap with existing files in the
output level. We can use the ranges of skipped output level files as split
points for the compaction output files. For this compaction, "2_1" will be a
split point, which prevents the compaction from generating a file like
["1_8", "2_5"]. With this optimization, we reduce two file reads and writes,
which is 1/3 of the IO in this compaction.

While the above example seems a bit artificial, I also experimented on a
real-world database with this idea. A simple sysbench insert benchmark on TiDB
shows more than 30% compaction IO reduction in some cases. I think other similar
databases can benefit from this optimization too.

Note that the current change is ugly, so just consider it as a proof of concept
implementation for now.

Possible related: #5201 #6016 @yiwu-arbug @zhangjinpeng1987 @matthewvon

The idea is to skip output level files that are not overlapping with the data of
start level files on compaction. By an output level file overlapping with the
data of a start level file, I mean that there is at least one key in the start
level file that is inside the range of the output level file. For example, an
output level file *O* has range ["e", "f"] and keys "e" and "f", a start level
file *S* has range ["a", "z"] and keys "a" and "z", although the range of file O
overlaps with the range of file S, file O does not overlap with the data of file
S.

So when is this idea useful? We know that when we do sequential writes, all
generated SST files don't overlap with each other and all compactions are just
trivial moves, which is perfect. However, if we do concurrent sequential writes in
multiple ranges, life gets hard.

Take a relational database as an example. A common construction of the record
keys is a table ID prefix concatenating with an auto-increment record ID (e.g.
"1_1" means table 1, record 1). Now let's see what happens if we insert records
into three tables (table 1 and table 2) in this order: "1_1", "2_1", "1_2",
"2_2", "1_3", "2_3", "1_4", "2_4" ...

Assume that RocksDB uses level compaction and each memtable and SST file
contains at most two keys. After putting eight keys, we get four level 0 files:

L0: ["1_1", "2_1"], ["1_2", "2_2"], ["1_3", "2_3"], ["1_4", "2_4"]
L1:

Then a level 0 compaction is triggered and we this:

L0:
L1: ["1_1", "1_2"], ["1_3", "1_4"], ["2_1", "2_2"], ["2_3", "2_4"]

Then after putting four more keys:

L0: ["1_5", "2_5"], ["1_6", "2_6"], ["1_7", "2_7"], ["1_8", "2_8"]
L1: ["1_1", "1_2"], ["1_3", "1_4"], ["2_1", "2_2"], ["2_3", "2_4"]

Now if a level 0 compaction is triggered, according to the current
implementation, the start level inputs will be all files in level 0, which cover
range ["1_5", "2_8"], and the output level inputs will be ["2_1", "2_2"] and
["2_3", "2_4"] because these two files overlap with the range of the start
level. However, files ["2_1", "2_2"], ["2_3", "2_4"] don't overlap with the data
of the start level inputs at all. So can we compact the start level inputs
without rewriting these two output level files? The answer is yes, as long as we
ensure that newly generated files don't overlap with existing files in the
output level. We can use the ranges of skipped output level files as split
points for the compaction output files. For this compaction, "2_1" will be a
split point, which prevents the compaction from generating a file like
["1_8", "2_5"]. With this optimization, we reduce two file reads and writes,
which is 1/3 of the IO in this compaction.

While the above example seems a bit artificial, I also experimented on a
real-world database with this idea. A simple sysbench insert benchmark on TiDB
shows more than 30% compaction IO reduction in some cases. I think other similar
databases can benefit from this optimization too.

Note that the current change is ugly, so just consider it as a proof of concept
implementation for now.
@matthewvon
Copy link
Contributor

@huachaohuang I considered this technique. It appeared to have bad interactions with range deletes. I did not prove or disprove a range delete problem. I simply took a different approach. Simply suggesting you consider that potential problem.

@zhangjinpeng87
Copy link
Contributor

@huachaohuang this idea is very similar with my proposal tikv/rust-rocksdb#375

@huachaohuang
Copy link
Contributor Author

It appeared to have bad interactions with range deletes.

@matthewvon can you give more details about the problem?

@huachaohuang
Copy link
Contributor Author

@zhangjinpeng1987 cool, I don't notice that before.

@Little-Wallace
Copy link
Contributor

This idea is cool. But PickCompaction must be called with holding the mutex of DB. If you create iterator and call Seek in PickCompaction, it would block all Get and Write requests.

@huachaohuang
Copy link
Contributor Author

@Little-Wallace that's a good point, just consider it an easy hack for now :)

@matthewvon
Copy link
Contributor

@huachaohuang Range deletes for compactions are written within FinishCompactionOutputFiles(). There is no equivalent for flushes (I am rewriting that function to work with both flush and compaction). FinishCompactionOutputFiles() takes care of making sure range delete objects appropriately cover the key range of each .sst file being finished. My read of the code suggests that omitting files from the middle of a large compaction could leave the key range omitted without range delete coverage. Hence, the range delete is "lost" for those files simply removed from the large compaction.

@zhangjinpeng87
Copy link
Contributor

@huachaohuang Range deletes for compactions are written within FinishCompactionOutputFiles(). There is no equivalent for flushes (I am rewriting that function to work with both flush and compaction). FinishCompactionOutputFiles() takes care of making sure range delete objects appropriately cover the key range of each .sst file being finished. My read of the code suggests that omitting files from the middle of a large compaction could leave the key range omitted without range delete coverage. Hence, the range delete is "lost" for those files simply removed from the large compaction.

How about disable this optimization when there is range deletion in the range?

@siying
Copy link
Contributor

siying commented Nov 11, 2019

I had this PR: #1963 which I believe is to solve a similar problem with a different approach. I hesitated in pushing it through because I worried that in some special cases, we are creating a lot of small files and they may not be able to eventually compacted together. I think this PR may have the same risk. I think the risk can be mitigated by looking at the size of the current output file. If the file is too small, then we skip this optimization.

@huachaohuang
Copy link
Contributor Author

So, I think we all understand the problem we want to solve here and we have four different PRs to do it now. Let me put them together and see what we should proceed:

There are actually two paths here.

The first path is to provide some options to enable users to make the decision about how to cut their compation files. Since users have more application knowledge, they can do better than RocksDB inside. But it also relies on users to understand their pattern and do it right. As for the implementation, #5201 is more flexible and #6016 is more convenient, I think we can do both. #5201 is a mechanism and #6016 is a strategy, like Comparator and BytewiseComparator.

The second path is to let RocksDB handle the problem inside so that users don't need to worry about it. As for the implementation, #1963 seems more straightforward since RocksDB already checks overlapped bytes at the grandparent level and #6021 needs to build an iterator of input level files somewhere.

Both paths can result in creating a lot of small files depending on the data pattern. IMO, the problems of small files:

  • There are some overhead managing a lot of files, whether for RocksDB or the operating system. I'm not sure about the cost.
  • Sorted data scattered in different small files causes a lot of random IO. This may not be a problem for some applications. For example, if applications never do range scans across prefixes, cutting files according to prefixes doesn't hurt read performance.

OK, these my opinions so far. I just try to clear my mind here but I actually have no direct interest in this problem, so I'm not going to work on it right now.

@yiwu-arbug yiwu-arbug mentioned this pull request Nov 25, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants