-
Notifications
You must be signed in to change notification settings - Fork 485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: scalable metadata key-value store in pageserver #7290
Comments
After a short discussion with John, two more design choices in my mind:
|
an early draft of the proposal -> https://www.notion.so/neondatabase/DRAFT-Generic-Key-Value-Storage-on-Page-Server-fbe03ae11d4a4eb7b60ef800f89f0fa3 |
Aux file part is: |
drop the issue from my list b/c I don't think this will be implemented in the near future, feel free to reassign in the future :) |
We can close this - the path we took for aux files makes sense & provides the primitives we would need in future (sparse key spaces) |
The pageserver is good at storing pages referenced by a block number, but not so good at generic key-value data.
There are two places we have large key-value maps without a good place to store them:
rel_size_to_key
)A more scalable store will enable:
This epic assumes:
Goals:
Possible implementation
We could write a hash table that uses a page for each slot.
Scale: 1 million relation sizes at ~16 bytes per relation -> use about 2000 pages.
Writes to pages will typically be repetitive: e.g. the logical size of a relation may be re-written very many times.
A single page per table can be used as a "superblock" that describes the number of slots that exist and the range of block numbers which contain them. This may be used to implement re-sharding of the table, as an optimization to avoid using a large page count for small databases (a very common case).
For storage, hash table slots will have a delta and value format. Runtime state will be used to bound the depth of deltas for each slot: the image value for the slot will be periodically written to enforce this bound. This will result in some I/O amplification for logical sizes compared with the current scheme of simply writing each size as an image every time, to a different page. We can control this I/O amplification by choosing the page count.
Where a particular KV collection requires larger values than typical (e.g.
pg_stat
can be multiple megabytes), we should use separate KV collections for the "big" values to avoid incurring heavy write amplification from mixing them with more frequently updated smaller values.So within each timeline, we would have three instances of this new hash table:
The text was updated successfully, but these errors were encountered: