Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add some per key optimization for UDT in memtable only feature #13031

Closed
wants to merge 2 commits into from

Conversation

jowlyzhang
Copy link
Contributor

@jowlyzhang jowlyzhang commented Sep 24, 2024

This PR added some optimizations for the per key handling for SST file for the user-defined timestamps in Memtable only feature. CPU profiling shows this part is a big culprit for regression. This optimization saves some string construction/destruction/appending/copying. vector operations like reserve/emplace_back.

When iterating keys in a block, we need to copy some shared bytes from previous key, put it together with the non shared bytes and find a right location to pad the min timestamp. Previously, we create a tmp local string buffer to first construct the key from its pieces, and then copying this local string's content into IterKey's buffer. To avoid having this local string and to avoid this extra copy. Instead of piecing together the key in a local string first, we just track all the pieces that make this key in a reused Slice array. And then copy the pieces in order into IterKey's buffer. Since the previous key should be kept intact while we are copying some shared bytes from it, we added a secondary buffer in IterKey and alternate between primary buffer and secondary buffer.

Test plan:
Existing tests.

@jowlyzhang jowlyzhang marked this pull request as ready for review September 25, 2024 21:27
@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Copy link
Contributor

@ltamasi ltamasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for improving this @jowlyzhang !

db/dbformat.cc Outdated

void IterKey::EnlargeSecondaryBufferIfNeeded(size_t key_size) {
// If size is smaller than buffer size, continue using current buffer,
// or the static allocated one, as default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor but seems to me the buffers are not actually statically allocated; maybe call them something like "fixed-size" or "inline"

@@ -562,18 +562,25 @@ inline uint64_t GetInternalKeySeqno(const Slice& internal_key) {
// allocation for smaller keys.
// 3. It tracks user key or internal key, and allow conversion between them.
class IterKey {
static constexpr char kTsMin[] = "\x00\x00\x00\x00\x00\x00\x00\x00";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice to add the usual comment here about only 64-bit timestamps being supported currently.

db/dbformat.h Outdated
Comment on lines 839 to 840
char* secondary_buf_;
char space_for_secondary_buf_[39]; // Avoid allocation for short keys
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably wouldn't cause any issues in practice but since secondary_buf_ can potentially point to space_for_secondary_buf_, it would be nice to have these two ordered the other way around. (Technically, secondary_buf_ currently gets constructed before and destroyed after space_for_secondary_buf_.) Also, we could introduce a named constant for the size of the inline buffers (39).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, thank you for the suggestion!

db/dbformat.h Outdated
// Use to track the pieces that together make the whole key. We then copy
// these pieces in order either into buf_ or secondary_buf_ depending on where
// the previous key is held.
Slice key_slices_[5];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could consider using std::array instead of a C-style arrray

db/dbformat.h Outdated
secondary_buf_ = space_for_secondary_buf_;
}
secondary_buf_size_ = sizeof(space_for_secondary_buf_);
key_size_ = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to clear key_size_ iff key_ points to the secondary buffer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! This is only supposed to be called when key_ points to secondary buffer, or during destruction. It's good to make a check for this.

db/dbformat.h Outdated
size_t actual_total_bytes = 0;
#endif // NDEBUG
for (size_t i = 0; i < num_key_slices; i++) {
size_t key_size = key_slices_[i].size();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

key_size might not be the best name for this variable; how about something like key_slice_size or slice_size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, the name is indeed confusing.

key_parts.emplace_back(slice_data, left_sz);
key_parts.emplace_back(min_timestamp);
key_parts.emplace_back(slice_data + left_sz, slice_sz - left_sz);
key_slices_[(*next_key_slice_idx)++] = Slice(slice_data, left_sz);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could assert that next_key_slice_idx is not null and that we don't overrun the key_slices_ buffer (i.e. that we don't end up with more than 5 parts)

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

Copy link
Contributor

@ltamasi ltamasi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks @jowlyzhang !

Comment on lines +911 to +915
if (key_ == secondary_buf_) {
key_size_ = 0;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we have a similar check in ResetBuffer too (with buf_)?

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has updated the pull request. You must reimport the pull request before landing.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@jowlyzhang merged this pull request in 32dd657.

@jowlyzhang jowlyzhang deleted the per_key_optimization branch October 4, 2024 16:51
jowlyzhang added a commit that referenced this pull request Oct 16, 2024
Summary:
This PR added some optimizations for the per key handling for SST file for the user-defined timestamps in Memtable only feature. CPU profiling shows this part is a big culprit for regression. This optimization saves some string construction/destruction/appending/copying. vector operations like reserve/emplace_back.

When iterating keys in a block, we need to copy some shared bytes from previous key, put it together with the non shared bytes and find a right location to pad the min timestamp. Previously, we create a tmp local string buffer to first construct the key from its pieces, and then copying this local string's content into `IterKey`'s buffer. To avoid having this local string and to avoid this extra copy. Instead of piecing together the key in a local string first, we just track all the pieces that make this key in a reused Slice array. And then copy the pieces in order into `IterKey`'s buffer. Since the previous key should be kept intact while we are copying some shared bytes from it,  we added a secondary buffer in `IterKey` and alternate between primary buffer and secondary buffer.

Pull Request resolved: #13031

Test Plan: Existing tests.

Reviewed By: ltamasi

Differential Revision: D63416531

Pulled By: jowlyzhang

fbshipit-source-id: 9819b0e02301a2dbc90621b2fe4f651bc912113c
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants