Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epic: scalable metadata key-value store in pageserver #7290

Closed
jcsp opened this issue Apr 2, 2024 · 5 comments
Closed

Epic: scalable metadata key-value store in pageserver #7290

jcsp opened this issue Apr 2, 2024 · 5 comments
Labels
c/storage/pageserver Component: storage: pageserver t/feature Issue type: feature, for new features or requests

Comments

@jcsp
Copy link
Collaborator

jcsp commented Apr 2, 2024

The pageserver is good at storing pages referenced by a block number, but not so good at generic key-value data.

There are two places we have large key-value maps without a good place to store them:

  • "aux files", which store data that in vanilla postgres is stored as small files on local disk (see Aux files rfc #6613, which discusses a broader range of possible solutions for storing these)
  • Relation metadata, especially logical size, which we currently store in one page per relation (see rel_size_to_key)

A more scalable store will enable:

  • Robust deployment of logical replication features that require potentially huge numbers of aux files (100k+)
  • Support for very high relation counts, e.g. more than 1 million.

This epic assumes:

  • that the same underlying data structure will serve both purposes, and also be reusable in future: for example if we wanted to store an index of LSN to timestamp or some other pageserver-internal index, we could use it there too.
  • that our KV store will be layered on top of our existing page store, rather than inventing some totally separate storage stack.
  • that the values in the KV store will be of bounded size that reasonably fits inside a page (i.e. we will not have a layer that spreads values across pages)
  • that while the KV store might be large in terms of value counts, it will be small enough to fit in memory.

Goals:

  • Calculating logical size for N relations should not require reading N pages
  • Writing N aux files should not require writing N pages
  • Writing N aux files should not scale the size of a single page as O(N)

Possible implementation

We could write a hash table that uses a page for each slot.

Scale: 1 million relation sizes at ~16 bytes per relation -> use about 2000 pages.

Writes to pages will typically be repetitive: e.g. the logical size of a relation may be re-written very many times.

A single page per table can be used as a "superblock" that describes the number of slots that exist and the range of block numbers which contain them. This may be used to implement re-sharding of the table, as an optimization to avoid using a large page count for small databases (a very common case).

For storage, hash table slots will have a delta and value format. Runtime state will be used to bound the depth of deltas for each slot: the image value for the slot will be periodically written to enforce this bound. This will result in some I/O amplification for logical sizes compared with the current scheme of simply writing each size as an image every time, to a different page. We can control this I/O amplification by choosing the page count.

Where a particular KV collection requires larger values than typical (e.g. pg_stat can be multiple megabytes), we should use separate KV collections for the "big" values to avoid incurring heavy write amplification from mixing them with more frequently updated smaller values.

So within each timeline, we would have three instances of this new hash table:

  • Relation metadata
  • Small Aux Files
  • Large Aux Files
@jcsp jcsp added t/feature Issue type: feature, for new features or requests c/storage/pageserver Component: storage: pageserver labels Apr 2, 2024
@skyzh
Copy link
Member

skyzh commented Apr 2, 2024

After a short discussion with John, two more design choices in my mind:

  • Reuse the current architecture: maybe we can have a third value type for storing AUX values (we have delta + image for now); an alternative design is to have a separate LSM tree.
  • Keep the current write path: for all data we want to write to the pageserver, the easiest design could be that we still write them into the safekeeper and apply them at the pageserver; an alternative design is to allow the compute node to directly write to the pageserver.

@skyzh
Copy link
Member

skyzh commented Apr 4, 2024

@jcsp
Copy link
Collaborator Author

jcsp commented Apr 22, 2024

Aux file part is:

@skyzh
Copy link
Member

skyzh commented Oct 20, 2024

drop the issue from my list b/c I don't think this will be implemented in the near future, feel free to reassign in the future :)

@skyzh skyzh removed their assignment Oct 20, 2024
@jcsp
Copy link
Collaborator Author

jcsp commented Oct 21, 2024

We can close this - the path we took for aux files makes sense & provides the primitives we would need in future (sparse key spaces)

@jcsp jcsp closed this as completed Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/storage/pageserver Component: storage: pageserver t/feature Issue type: feature, for new features or requests
Projects
None yet
Development

No branches or pull requests

2 participants