List / Sub-Table Friendly Operations #22

victorstewart · 2020-11-18T21:04:19Z

victorstewart
Nov 18, 2020

so hse_kvs_prefix_delete is pfx_len restricted, and that's great when a given subset of data is important enough to warrant it's own KVS. but in the general case of many lists all of varying prefix length, spanning multiple keys, as pressured by the HSE_KVS_VLEN_MAX value, it would be great to have a way to just prefix delete an entire list.

or maybe that's not technically practical?

i'd make the same point about a "variable-prefix read" (or cursor), but also assume there's a reason these abilities don't exist.

regardless thought it warranted a discussion?

tristan957 · 2020-11-18T21:18:08Z

tristan957
Nov 18, 2020
Maintainer

Hey again Victor,

What do you mean by list is this case?

I am also having trouble understanding what you mean here:

i'd make the same point about a "variable-prefix read" (or cursor), but also assume there's a reason these abilities don't exist.

0 replies

victorstewart · 2020-11-18T23:35:34Z

victorstewart
Nov 18, 2020
Author

basically what I was getting it is some "list friendly" operations.

so if you want to build a list datatype overtop the key/value interface, you're going to have to spread out the values in that list over many key/value pairs as it grows beyond the HSE_KVS_VLEN_MAX limit of one. I write a list blueprint at the key for the list, "listkeyname", and then grow list entries into segments like "listkeyname.0", "listkeyname.1", "listkeyname.2" etc

and just some general observations I had while implementing this.

one super convenient operation would be renaming keys in a KVS (with overwriting allowed).

take the situation where I've now deleted all the values in "listkeyname.1", and want to move "listkeyname.2" into "listkeyname.1". at present I have to pull the value of "listkeyname.2" out, delete it, and then write it into "listkeyname.1". but theoretically I could just do an overwriting rename operation.

another as I mentioned in the first posting, would be the ability to delete by prefix in a KVS of pfx_len == 0. as hse_kvs_prefix_delete reads...

passes a filter with a length equal to the KVS's key prefix length

I do have some important lists and subtables (again, logical objects i've built over the key/value interface) in hot paths, that I've put into their own KVSs, but I have tons more lists that reside in a general purpose KVS of pfx_len == 0. And at the moment to delete them I have to do that key by key.

if any others come to mind i'll share them.

0 replies

tristan957 · 2020-11-19T00:46:59Z

tristan957
Nov 19, 2020
Maintainer

If I am understanding you correctly @victorstewart, HSE_KVS_VLEN_MAX is not the maximum number of kv pairs. It is the maximum length that a value in a kv pair could be. Maybe I misunderstood your first paragraph.

My colleagues might have more to add on this subject, but at present and probably for always a KV tuple is immutable. The key cannot change and the value cannot change for that specific tuple. You can call hse_kvs_put() with the same key and a different value, but that will exist as a different tuple. When you call something like hse_kvs_delete(), the KV tuple is given a special value called a tombstone. This tombstone is a marker that a value has been deleted, so when compaction occurs, the compaction process knows it can drop the old value if nothing has a view of the old value. So given this logic, I don't think a rename operation could be done with better performance than the process you described above, and I am not so sure that there would be a lot of use-cases for such an API in HSE.

You seem really familiar with KV stores though. Do you have any APIs for other KV stores that implement this kind of idea or other list-like APIs in general?

As for your last idea of would be the ability to delete by prefix in a KVS of pfx_len == 0, I am not familiar with the the implementation of prefixes. I will ask around tomorrow, and get back to you. I am thinking that list-like data in a KVS with pfx_len=0 is probably an antipattern, but to be truthful, I have not modeled any data within HSE yet. Maybe it would help if you could provide your reasoning for storing your data like this.

Have you read the KVS/prefix section in our best practices section of the wiki?

https://github.com/hse-project/hse/wiki/Best-Practices#keys-and-kvs

Are you hitting the limit of maximum amount of KVSs allowed in a KVDB by chance?

0 replies

davidboles · 2020-11-19T03:03:36Z

davidboles
Nov 19, 2020
Collaborator

@victorstewart - I think it would be helpful to rewind to the problem you are actually looking to solve. The combination of that problem and the form of the HSE API may well have sent you down one path while another might better address your need. If that's not the case, then we can help with either using what facilities are there or take into consideration extensions.

What can you say about the following:

Expected # of elements of lists (multiple values if list length has a multi-modal distribution)
Size of the elements of the list (the application-level values)
Expected number of lists
Lifetime of a particular list
Insert behavior (FIFO pattern?, LIFO pattern?, random insert/delete, add/delete from the tail/head?)
Query behavior for lists (head/tail only?, n-th?, etc.)

Answers to these things, and related items that the above questions bring to mind for you, will help.

0 replies

victorstewart · 2020-11-19T12:51:40Z

victorstewart
Nov 19, 2020
Author

@tristan957

right I know that HSE_KVS_VLEN_MAX is the maximum length of a value stored at a key. so I guess an overwriting hse_kvs_put() also generates a tombstone? okay if a rename theoretically couldn't be done with better performance, that was the only question I was posing.

on the prefixes, I assume that when no prefix is set, that the hashed key values are uniformly distributed about the keyspace, but a prefix would provide no better performance than individually getting then putting each value at a new key. again just thought it worth mentioning.

so the reason I built the list like this is because practically all items will be below the HSE_KVS_VLEN_MAX threshold. so one specific example of a list i have, i can store 43K aligned values inside of one HSE_KVS_VLEN_MAX. thus, compacting them into a single value does a lot to avoid practical keyspace exhaustion. Other situations maybe I'll get 10K in one, still much better than individuals keys for each list value.

and for the time being I artificially limited all values stored in my database to HSE_KVS_VLEN_MAX for ease. but could similarly span them over multiples keys in the future.

at the moment I'm just using one KVDB, but of course could grow to 8, and 16 KVSs per. and i'm not hitting any limits yet.

as i stated i have some very important hotpath lists and subtables that i've created custom KVSs for, to exploit the prefix performance benefits on access. but in the general case i can't be creating custom KVSs per list or subtable.

that said the performance with the current API is fantastic regardless! just thought it worth mentioning the subject of creating logic data structures on top of key/value pairs.

0 replies

victorstewart · 2020-11-19T13:49:06Z

victorstewart
Nov 19, 2020
Author

@davidboles

important to note firstly that, in probably MOST cases, every hot list not important enough for it's own KVS will be able to fit all its items into a single HSE_KVS_VLEN_MAX, thus this discussion was a bit more in the abstract. (but i still of course needed to implement logic to handle those unlikely cases where they don't for safety.. and that's where this thoughts came to mind). so i'm 100% good with the current interface and face no issues with it.

and all these points almost equally apply to the concept of subtables, where you must spread each sub entry onto its own key, thus the logical object grows over many more keys than than a list. but in those cases really the only "missing" abilities would be delete all or get all, (aka prefixed based operations when a KVS has not defined prefix). (and of course again I have some very important ones that get their own KVSs). (and obviously a "get all" isn't as practical as a delete all given the needed receive buffer size is unknown). both delete all and get all currently require a slow cursor iteration. but again, this is fine because if any where in a hot enough path you just upgrade them to their own KVS.

so i only provide the below responses because maybe they trigger some productive thoughts, but as far as I can see, all are satisfied by the current interface.

Expected # of elements of lists (multiple values if list length has a multi-modal distribution)

the larger they are the colder they are. so i have some lists that will grow by maybe +100K items every 3 years. fitting about 10K values per HSE_KVS_VLEN_MAX. these are essentially "append to tail only". other cold ones maybe +100 every 3 years fitting maybe 500 per HSE_KVS_VLEN_MAX.

other lists and subtables that are probed often to semi-often (but not frequently) are no more than dozens to 100s of items.

Size of the elements of the list (the application-level values)

either in the range of 100B-200B, or 2KB-5KB.

Expected number of lists

a dozen or so application wide lists.

ballpark of 15 (lists + subtables) per user. plus another batch with variable scaling (will begin around 300 per user approaching 1 per user over infinite time).

Lifetime of a particular list

all either permanent (95%+) or semi-permanent.

Insert behavior (FIFO pattern?, LIFO pattern?, random insert/delete, add/delete from the tail/head?)

insert to head or tail. random deletes by value. delete from head or tail. delete all.

Query behavior for lists (head/tail only?, n-th?, etc.)

get all. get all and delete all. getting n items from the head or tail. checking for the existence of an item in a list, so iterative from head.

0 replies

smoyerx · 2020-11-20T19:14:20Z

smoyerx
Nov 20, 2020
Collaborator

@victorstewart if I understand correctly, you’re mapping Redis semantics to HSE and this question is specifically around implementing Redis lists, correct? I am far from being a Redis expert, but after reading the Redis List description let me propose an example mapping to HSE, and then outline a potential optimization. In doing so, I’m using the notation established in the “Concepts” section of the HSE wiki.

We define three KVS, two of which (MetaKvs and KeyKvs) you’ll all use to support all Redis data types in your system, and one (ListKvs) for storing all Redis list data.

MetaKvs stores various metadata values for implementing your system. MetaKvs stores KV pairs with unsegmented keys (pfx_len=0) where the key is the name of the piece of metadata, and the value is the current value for that piece of metadata. In the examples below, we’ll store the key “keyID-next”, whose value is a uint64 initially set (put) to 0 when the system is first initialized.

KeyKvs maps Redis keys to a unique identifier. We do this both to support rename and to save storage space, as will become clear. KeyKvs stores KV pairs with unsegmented keys (pfx_len=0) where the key is just the Redis key, and the value is a tuple (keyID, keyType), where keyID is the unique identifier for that Redis key, and keyType is its Redis data type (e.g., “L” for list). To generate keyID, we get, increment, and put (in a transaction) the value for key “keyID-next” in MetaKvs.

ListKvs contains the data for all Redis lists. ListKvs stores KV pairs with multi-segment keys of the form (listID, typeID, elemID) where

listID and elemID are identifiers (uint64) for Redis lists and list elements, respectively
typeID is the entry (KV pair) type, either “D” for data or “I” for index
the key prefix for ListKvs is listID, so pfx_len=len(listID)=8
the ordering of the segments (listID, typeID, elemID) is chosen to 1) lexicographically group entries in ListKvs by listID first, and then by the type of entry for a given listID, which will deliver best performance for how they will be accessed (always group things that will be accessed together / consecutively), and 2) to support prefix-delete(listID) so that all data related to a list can be deleted in a single operation.

Values for KV pairs in ListKvs are defined as follows

Values for keys (listID, D, elemID) are the actual Redis element values (except for the distinguished keys defined below)
Values for keys (listID, I, elemID) are of the form (prevID, nextID), where prevID and nextID are identifiers (uint64) for the previous and next elements from elemID in the list listID

Furthermore

(listID, D, 0) is a distinguished key whose value is the elemID for the head of list listID
(listID, D, 1) is a distinguished key whose value is the elemID for the tail of list listID
(listID, D, 2) is a distinguished key whose value is the elemID that will be assigned to the next new element added to list listID

EXAMPLE: to create a new Redis list whose Redis key is “mylist” and whose only element is “a” (e.g., LPUSH mylist a), do all the following in a transaction.

KeyKvs: get(“mylist”) to confirm the Redis key doesn’t already exist (I’m assuming here it doesn’t)
MetaKvs: get(“keyID-next”) to get an identifier (listID) for “mylist”, increment the returned value, and put(“keyID-next”, incremented value)
KeyKvs: put(“mylist”, (listID, “L”))
ListKvs: put((listID, D, 0), 100); put((listID, D, 1), 100); put((listID, D, 2), 101)

Above we are initializing the new listID in ListKvs. I am starting elemID values at 100 to leave (lots) of room for more distinguished keys in ListKvs if needed in the future.

ListKvs: put((listID, D, 100), “a”); put((listID, I, 100), (0, 0))

Above we are storing the Redis list element. Since this is the only element in the list, we set its (prevID, nextID) to (0, 0) indicating no previous or next element in the list.

EXAMPLE: to add element “b” to the tail of existing Redis list “mylist” (e.g., RPUSH mylist b), do all the following in a transaction.

KeyKvs: get(“mylist”) to get the identifier (listID) and Redis data type (“L”) for Redis list “mylist” (I’m assuming here it exists and is of the correct type)
ListKvs: get((listID, D, 2) to get newID; put((listID, D, 2), newID+1)

Above we get an identifier for the new element and increment it for future use.

ListKvs: get((listID, D, 1)) to get tailID; put((listID, D, newID), “b”); put((listID, I, newID), (tailID, 0))

Above we add the new element to ListKvs with its prevID pointing to the former tail element (tailID)

ListKvs: get((listID, I, tailID)) to get the prevID from the former tail; put((listID, I, tailID), (prevID, newID))

Above we make the former tail element point to the new tail (newID)

ListKvs: put((listID, D, 1), newID)

Above we update the distinguished key pointing to the tail of list listID.

EXAMPLE: to rename Redis list “mylist” to “todolist” (e.g., RENAME mylist todolist), do all the following in a transaction.

KeyKvs: get(“mylist”) to get the identifier (listID) and Redis data type for “mylist” (I’m assuming here it exists and is type “L”)
KeyKvs: put(“todolist”, (listID, “L”)); del(“mylist”)

In an actual implementation of rename you need to check for “todolist” in KeyKvs and if it exists then delete all related data as determined by its Redis data type. You want to structure your KVS key prefixes so you can do this with a single prefix delete. For example, if “todolist” were an existing Redis list, we could delete all associated data with a single prefix-delete(todolistID) in ListKvs.

The above examples use a doubly linked list of elements, which the Redis docs imply is how they implement it. There are numerous other ways to structure this data to potentially achieve better performance, such as grouping (prevID, nextID) within a single value.

The main point here is to demonstrate how to use KVS and structure keys in HSE.

0 replies

victorstewart · 2020-11-20T19:42:34Z

victorstewart
Nov 20, 2020
Author

@smoyergh

wow thank you that was exceedingly thorough, let me digest it and see if i have any points to add.

but you're right adding those layers of KVS indirection is the only way to ensure the type safety of logical types built over key-value space.

also collapsing to a constant length key blueprint as you suggest does solve all the prefix based performance i was missing... fantastic! and again that's clearly the only way to do so generically without relying on KVS spread + integrating a brittle KVS design based on "hard coded" data assumptions.

i once heard someone said every problem in computer science can be solved by another layer of indirection, and now i'm always reminded of that.

currently, I didn't build in any type safety, just assumed that the application will always do the right thing, plus instituted naming conventions for keys that will assure no interference between anything, but as my team grows i'll definitely circle back and reimplemented the database as you've laid out for data integrity reasons.

0 replies

victorstewart · 2020-11-20T20:56:32Z

victorstewart
Nov 20, 2020
Author

@smoyergh

at the moment i implemented subtables as...

"HSET key subkey value" just maps to "key.subkey" -> value

and lists where... a blueprint exists at the list key, what holds a list count plus head and tail suffixes. those suffixes map to segments. say the head at key "listkey.0" and the tail at "listkey.2".

and i fill entries into those segments until the entire 1MB is consumed, and then the overflow into a new segment (like those water doors the have in the hull of a ship to arrest contagion).

i mapped out all the usage in mind, and the minimal indirection and lack of transactions or cursor construction in my design actually optimizes for the fast path, at the expense of the type safety and vast elegance of your design.

as far as lists go, it optimizes for the fast path since almost every time the operation can be served by the first segment (head or tail) gathered. and when it can't, there's a 1 time cost of overflow. or at worst served by seeking to the next segment (in the case of deleting the first instance of a value found). this also makes fully consuming a non head or tail segment, thus requiring "renaming" aka compaction, exceedingly rare.

for subtables, the only time i do a "get all values" (besides for 2 hyper specialized collections of data which thus require custom commands to transfer intent efficiently, thus personal KVSs are no added complexity), is one super cold path.. thus a non frequent performance cost.

that said, i'm sure we will eventually reimplement to your specification one people must less cavalier than me urge safety over fast path optimized design.

thank you so much again. V

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List / Sub-Table Friendly Operations #22

{{title}}

Replies: 9 comments

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

List / Sub-Table Friendly Operations #22

victorstewart Nov 18, 2020

Replies: 9 comments

tristan957 Nov 18, 2020 Maintainer

victorstewart Nov 18, 2020 Author

tristan957 Nov 19, 2020 Maintainer

davidboles Nov 19, 2020 Collaborator

victorstewart Nov 19, 2020 Author

victorstewart Nov 19, 2020 Author

smoyerx Nov 20, 2020 Collaborator

victorstewart Nov 20, 2020 Author

victorstewart Nov 20, 2020 Author

victorstewart
Nov 18, 2020

tristan957
Nov 18, 2020
Maintainer

victorstewart
Nov 18, 2020
Author

tristan957
Nov 19, 2020
Maintainer

davidboles
Nov 19, 2020
Collaborator

victorstewart
Nov 19, 2020
Author

victorstewart
Nov 19, 2020
Author

smoyerx
Nov 20, 2020
Collaborator

victorstewart
Nov 20, 2020
Author

victorstewart
Nov 20, 2020
Author