Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jemalloc defrag situation #364

Open
zuiderkwast opened this issue Apr 24, 2024 · 13 comments
Open

Jemalloc defrag situation #364

zuiderkwast opened this issue Apr 24, 2024 · 13 comments
Labels
de-crapify Correct crap decisions made in the past

Comments

@zuiderkwast
Copy link
Contributor

          > We do have some special code to allow for active defragmentation of Jemalloc arenas, so we might also want to figure out how to implement that in mimalloc.

defrag is enabled explicitly only when jemalloc is used so this should be fine. in other words, you get defrag with jemalloc only.

I am aligned with adding mimalloc as long as we don't vendor it. On a related note, I think we should move forward and remove the one-off je_get_defrag_hint patch and instead use more generic heuristics for defrag so that other allocators could benefit from the defrag feature as well. jemalloc should be devendored too.

Originally posted by @PingXie in #346 (comment)

@zuiderkwast
Copy link
Contributor Author

Regarding the defrag and the need to defrag, I think we can try to think of ways to avoid the fragmentation. Few other project seem to need defrag.

Am I wrong in assuming we get more fragmentation when I/O threads are used?

A request is parsed into string objects which are later stored in the database as keys and values. When this is done in different threads, they get allocated in different arenas. Then they are all stored in the database. Maybe we can avoid it if we let the main thread allocate the keys and values that will be stored. Then, all the allocations done by I/O threads would be short-lived and not cause fragmentation.

An idea I saw somewhere is that the request parser could return pointers into the query buffer, rather than duplicating the data as sds strings. A RESP request is always an array of strings (AKA multibulk) so we'd just need to store an offset and a length for each argument temporarily until the command returns.

@madolson
Copy link
Member

At AWS, we very rarely see places where defragmentation is actually useful. Most of the time, new data is getting added and removed at about the same rate and it naturally defragments itself. The only time it's really useful is when there is a fundamental shift in the workload and the data becomes permanently fragmented, and a one-time defrag will help free up a bunch of memory.

A request is parsed into string objects which are later stored in the database as keys and values. When this is done in different threads, they get allocated in different arenas. Then they are all stored in the database. Maybe we can avoid it if we let the main thread allocate the keys and values that will be stored. Then, all the allocations done by I/O threads would be short-lived and not cause fragmentation.

Reminds me a bit of the segcache architecture, which allocates in large segments with a lot of key/value pairs that are all freed as a bunch.

An idea I saw somewhere is that the request parser could return pointers into the query buffer, rather than duplicating the data as sds strings. A RESP request is always an array of strings (AKA multibulk) so we'd just need to store an offset and a length for each argument temporarily until the command returns.

@hpatro ^

@PingXie
Copy link
Member

PingXie commented Apr 24, 2024

The only time it's really useful is when there is a fundamental shift in the workload and the data becomes permanently fragmented, and a one-time defrag will help free up a bunch of memory.

This ^

This is why I don't think it justifies the overhead/complexity to vendor jemalloc. linking #15 (comment)

@zuiderkwast
Copy link
Contributor Author

The only time it's really useful is when there is a fundamental shift in the workload and the data becomes permanently fragmented, and a one-time defrag will help free up a bunch of memory.

Then I think should gradually get rid of the get_defrag_hint() patch.

The first thing we can do is to make it optional, I mean add a dumb variant of defrag. If we're running unpatched jemalloc, we can just realloc everything.

Most of the time, new data is getting added and removed at about the same rate and it naturally defragments itself.

So to avoid excessive defrag work, we can measure fragmentation but wait an hour or so before we start the defrag to see if it resolves itself?

The third thing would be to try to avoid creating fragmentation in the first place (if my speculation about I/O threads right).

If we these works well in practice (live, not just release candidate) then in another release we can make it permanent and unvendor jemalloc. How about this plan?

@zuiderkwast zuiderkwast added the de-crapify Correct crap decisions made in the past label Apr 25, 2024
@hpatro
Copy link
Contributor

hpatro commented Apr 25, 2024

An idea I saw somewhere is that the request parser could return pointers into the query buffer, rather than duplicating the data as sds strings. A RESP request is always an array of strings (AKA multibulk) so we'd just need to store an offset and a length for each argument temporarily until the command returns.

I came across this idea on a rust rewrite of Redis https://github.com/seppo0010/rsedis/blob/master/parser/src/lib.rs#L13-L28
I did try a dirty hack around this and the problem(s) I observed was that we would still need to create those objects down the layer to store the data. So, object creation is inevitable and with the workload Madelyn mentioned above fragmentation is bound to happen.
The other issue with that is we need to significantly change our db layer and plenty of other API(s) that require the key/values as robj to this. So, we would need to touch most of our code flow(s) dealing with robj.
Overall, I didn't see much of performance gain as well in SET/GET operations.

@zuiderkwast
Copy link
Contributor Author

@hpatro Yes, we still need to allocate those objects, but my point is that if it's done in the main thread (rather than in the I/O threads), they will be in the same jemalloc arena as the other allocations in the main thread. Jemalloc allocates from one arena per thread and if we have objects in lots of different arenas, that would cause more fragmentation. Does it make sense?

We can still get fragmentation though, just a bit less, if this is true.

@hpatro
Copy link
Contributor

hpatro commented Apr 26, 2024

@hpatro Yes, we still need to allocate those objects, but my point is that if it's done in the main thread (rather than in the I/O threads), they will be in the same jemalloc arena as the other allocations in the main thread. Jemalloc allocates from one arena per thread and if we have objects in lots of different arenas, that would cause more fragmentation. Does it make sense?

We can still get fragmentation though, just a bit less, if this is true.

That makes sense. I was trying to highlight the delayed robj creation for the storage layer can get really messy with the overall code structure in place.

@zuiderkwast
Copy link
Contributor Author

I was trying to highlight the delayed robj creation for the storage layer can get really messy with the overall code structure in place.

I see. The easiest solution is probably that the main thread allocates robj for all arguments immediately when it starts executing the command. It's not optimal but maybe it fixes the fragmentation issue.

@zvi-code
Copy link

zvi-code commented May 1, 2024

Am I wrong in assuming we get more fragmentation when I/O threads are used?

I don't think this is true for most cases. By default we do not use the per-cpu or per-thread arena, jemalloc does create by default multiple arenas(depend on # cpus), but usually all allocations are from arena 0. But allocations are going through the thread cache. It amortizes the access to arena to reduce the threads contention, by allocating batches of entries. But if system is not fragmented already, close in time allocations, even from different threads, will likely be allocated from close slabs/extents/mmaps. Thread cache also avoids going down to arena in case of short lived alloc/free.

In any case, as the number of threads is not huge, I expect they have very little affect on fragmentation either way, the customer workload, as @madolson mentioned, is bigger factor in fragmentation(but not the only one).

@zvi-code
Copy link

zvi-code commented May 2, 2024

At AWS, we very rarely see places where defragmentation is actually useful. Most of the time, new data is getting added and removed at about the same rate and it naturally defragments itself. The only time it's really useful is when there is a fundamental shift in the workload and the data becomes permanently fragmented, and a one-time defrag will help free up a bunch of memory.

@madolson I agree the current defrag process is not so useful (I'm talking purely about the logic in jemalloc, it was improved with jemalloc 5.3 but still it is not guaranteed to even converge at all), but I disagree with the statement that "naturally defragment itself". I think that the current (pre-serverless) AWS experience hides the issue of memory being "locked" and not freed back to the OS. Another factor for evaluating the issues with defrag is the utilization, thanks to jemalloc bucket allocator, if you are at low utilization to begin with, fragmented memory will not cause major customer pain (besides visibility). Forward looking, when considering things like multi-tenancy, I believe fragmented memory will become a much bigger issue.

@PingXie
Copy link
Member

PingXie commented May 2, 2024

Thanks for the insight, @zvi-code! Appreciate it.

Forward looking, when considering things like multi-tenancy, I believe fragmented memory will become a much bigger issue.

I understand that you are providing a counter argument to the statement of "naturally defragment itself", which makes sense to me. I just wanted to clarify that my main rationale for de-vendoring jemalloc is that the overhead of vendoring any allocator for the sake of active_defrag outweighs the perceived benefit by a large margin, given the current use cases being single tenancy exclusively.

@zvi-code
Copy link

@PingXie here is the followup #692

@madolson
Copy link
Member

madolson commented Jul 1, 2024

Forward looking, when considering things like multi-tenancy, I believe fragmented memory will become a much bigger issue.

I didn't think about this case, you are right about the case where we want to defragment to release memory to the OS when running in a mixed environment (either multiple valkey processes or other processes running).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
de-crapify Correct crap decisions made in the past
Projects
None yet
Development

No branches or pull requests

5 participants