Epic: scalable metadata key-value store in pageserver #7290

jcsp · 2024-04-02T15:36:20Z

The pageserver is good at storing pages referenced by a block number, but not so good at generic key-value data.

There are two places we have large key-value maps without a good place to store them:

"aux files", which store data that in vanilla postgres is stored as small files on local disk (see Aux files rfc #6613, which discusses a broader range of possible solutions for storing these)
Relation metadata, especially logical size, which we currently store in one page per relation (see rel_size_to_key)

A more scalable store will enable:

Robust deployment of logical replication features that require potentially huge numbers of aux files (100k+)
Support for very high relation counts, e.g. more than 1 million.

This epic assumes:

that the same underlying data structure will serve both purposes, and also be reusable in future: for example if we wanted to store an index of LSN to timestamp or some other pageserver-internal index, we could use it there too.
that our KV store will be layered on top of our existing page store, rather than inventing some totally separate storage stack.
that the values in the KV store will be of bounded size that reasonably fits inside a page (i.e. we will not have a layer that spreads values across pages)
that while the KV store might be large in terms of value counts, it will be small enough to fit in memory.

Goals:

Calculating logical size for N relations should not require reading N pages
Writing N aux files should not require writing N pages
Writing N aux files should not scale the size of a single page as O(N)

Possible implementation

We could write a hash table that uses a page for each slot.

Scale: 1 million relation sizes at ~16 bytes per relation -> use about 2000 pages.

Writes to pages will typically be repetitive: e.g. the logical size of a relation may be re-written very many times.

A single page per table can be used as a "superblock" that describes the number of slots that exist and the range of block numbers which contain them. This may be used to implement re-sharding of the table, as an optimization to avoid using a large page count for small databases (a very common case).

For storage, hash table slots will have a delta and value format. Runtime state will be used to bound the depth of deltas for each slot: the image value for the slot will be periodically written to enforce this bound. This will result in some I/O amplification for logical sizes compared with the current scheme of simply writing each size as an image every time, to a different page. We can control this I/O amplification by choosing the page count.

Where a particular KV collection requires larger values than typical (e.g. pg_stat can be multiple megabytes), we should use separate KV collections for the "big" values to avoid incurring heavy write amplification from mixing them with more frequently updated smaller values.

So within each timeline, we would have three instances of this new hash table:

Relation metadata
Small Aux Files
Large Aux Files

The text was updated successfully, but these errors were encountered:

skyzh · 2024-04-02T16:06:38Z

After a short discussion with John, two more design choices in my mind:

Reuse the current architecture: maybe we can have a third value type for storing AUX values (we have delta + image for now); an alternative design is to have a separate LSM tree.
Keep the current write path: for all data we want to write to the pageserver, the easiest design could be that we still write them into the safekeeper and apply them at the pageserver; an alternative design is to allow the compute node to directly write to the pageserver.

skyzh · 2024-04-04T17:55:49Z

an early draft of the proposal -> https://www.notion.so/neondatabase/DRAFT-Generic-Key-Value-Storage-on-Page-Server-fbe03ae11d4a4eb7b60ef800f89f0fa3

jcsp · 2024-04-22T13:40:00Z

Aux file part is:

Epic: Aux file store v2 #7462

skyzh · 2024-10-20T19:40:51Z

drop the issue from my list b/c I don't think this will be implemented in the near future, feel free to reassign in the future :)

jcsp · 2024-10-21T08:13:23Z

We can close this - the path we took for aux files makes sense & provides the primitives we would need in future (sparse key spaces)

jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver labels Apr 2, 2024

skyzh mentioned this issue Apr 14, 2024

DNM/PoC feat(pageserver): store one key per aux file #7375

Closed

5 tasks

jcsp assigned skyzh Apr 15, 2024

skyzh mentioned this issue Apr 22, 2024

Epic: Aux file store v2 #7462

Closed

24 tasks

skyzh removed their assignment Oct 20, 2024

jcsp closed this as completed Oct 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: scalable metadata key-value store in pageserver #7290

Epic: scalable metadata key-value store in pageserver #7290

jcsp commented Apr 2, 2024 •

edited

Loading

skyzh commented Apr 2, 2024 •

edited

Loading

skyzh commented Apr 4, 2024

jcsp commented Apr 22, 2024

skyzh commented Oct 20, 2024

jcsp commented Oct 21, 2024

Epic: scalable metadata key-value store in pageserver #7290

Epic: scalable metadata key-value store in pageserver #7290

Comments

jcsp commented Apr 2, 2024 • edited Loading

Possible implementation

skyzh commented Apr 2, 2024 • edited Loading

skyzh commented Apr 4, 2024

jcsp commented Apr 22, 2024

skyzh commented Oct 20, 2024

jcsp commented Oct 21, 2024

jcsp commented Apr 2, 2024 •

edited

Loading

skyzh commented Apr 2, 2024 •

edited

Loading