-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implements qs2 serialization as an option #71
base: main
Are you sure you want to change the base?
Conversation
@traversc I'm using the parameter defaults you currently have in your header file (without actually using your header file). Let me know if you think something else might be optimal here. Thanks! |
Default parameters are safe, but optimal depends on use case. I think it would be nice to expose the parameters somehow (or at least the compression parameter). Could be a future update. |
The interface would get really busy. An alternative I'm thinking about is having a global switch that swaps out R serialization with qs, and that function could then possibly set the parameters. |
Scratch that idea with the global parameter - it doesn't sit well with the design of nanonext. On the other hand, if there's a way for qs2 to set default parameters independently, then I'd be happy to use the default rather than a fixed setting. |
By design, do you mean user interface design? In which case I can implement global parameters in the qs2 package and you could pull from there. Another idea, you have Just an idea, totally understand if it doesnt jive with your preference. |
Hi @traversc where I think I'm coming out at the moment is I would want to use qs2 serialization automatically instead of R serialization if qs2 is already loaded. This would avoid any changes to the UI entirely. So I have the following questions:
Thanks! |
Yes, there is a 4 byte sequence: https://github.com/qsbase/qs2/blob/main/src/qx_file_headers.h#L19 Here are some benchmarks and others in that repo: https://raw.githubusercontent.com/qsbase/qs2_analysis/main/benchmarks/plots/ubuntu_write_benchmarks.png The point on the far left is the compression level = 1. What do you mean by focusing on purely speed? Does this account for network transfer times? If it is only serialization to memory then R_serialize without any compression will be faster. |
Sounds good! I'll put together an implementation on this basis. I may be getting ahead of myself with the speed comment. Compression may actually still be useful in memory-limited environments. I think I'll just approach this empirically. If qs2 eventually offers a way to set a global default, then I'll move it to use that. |
Regarding empirical benchmarking, it highly depends on the dataset. The repo I linked to above collates 20 small to medium size datasets 100 MB ~ 10 GB in size that you could use. I went ahead and implemented global options: https://github.com/qsbase/qs2?tab=readme-ov-file#global-options-for-qs2 If that approach looks good, I can send it to CRAN. |
Thanks @traversc I just took a quick look at the global options, and this wasn't entirely what I had in mind.
I've only been able to look at this quickly, so if there's an issue you're encountering, just let me know. Btw. this isn't essential and won't be blocking - I can choose a default for the initial implementation. So please take your time on this. |
I didn't want to use special values like 0 to mean default, so here's what I came up with in the latest commit. From the R side, a user can set global options with e.g. From the C-side, you can retrieve this global option with Here's a full Rcpp example:
|
In reply to your question on the C level functions for getting the defaults - if there is any meaningful performance impact then I would obviously prefer a I've made progress on this - you can see what I have in mind from the current state of this PR. I'd still need to add tests, but mirai tests fine with The choice of serialization function is determined by a global variable with the different paths implemented directly in Because of this, I am actually thinking it makes sense to incorporate straight into |
I dont think there would be a performance difference, I'd be pulling the global static variables the same way.
Sakura currently has in memory serialize/unserialize, but the most likely use case would be writing to disk, right? If so ideally we would want to stream to disk, rather than serialize to memory and write to disk all at once. Otherwise we would be leaving performance on the table and also requiring higher memory usage. |
Leaving aside the global options for now, I started testing, and I'm getting something like this with I'm choosing to ship a largish object so that most of the time is spent on serialization. Is this roughly what you would expect? x <- rnorm(1e6)
library(mirai)
daemons(8)
#> [1] 8
microbenchmark::microbenchmark(collect_mirai(mirai(x, x = x)))
#> Unit: milliseconds
#> expr min lq mean median uq
#> collect_mirai(mirai(x, x = x)) 14.86581 18.27487 21.88207 21.07917 23.89938
#> max neval
#> 39.41691 100
nanonext::use_qs2()
#> NULL
microbenchmark::microbenchmark(collect_mirai(mirai(x, x = x)))
#> Unit: milliseconds
#> expr min lq mean median uq
#> collect_mirai(mirai(x, x = x)) 25.55989 28.87685 33.61875 30.80444 34.23783
#> max neval
#> 124.0863 100 Created on 2025-03-03 with reprex v2.1.1 |
Yes that looks reasonable. If you are limited by serialization (and not memory or transfer speed) then using qs2 is not beneficial. But also you could try -1 .. -5 as compress level and adding more threads. |
I'm sorry I've been under the impression that speed would be similar to base R with better compression built in i.e. it would strictly dominate. As there is a trade-off, then it seems that qs2 would really only make sense where memory is relatively constrained. You may have seen that I've released sakura to CRAN. I am now thinking that this is going to be the best place to integrate qs2. As that package may have alternative uses, it may be more suited to qs2 and you'd also be free to add more functionality e.g. to write directly to files. You would probably not implement the integration in exactly the same way - I'll leave it to you to consider. But please feel free to take anything in this PR over there. Also please don't let this discussion hold up a CRAN release on your end. The intention is for |
Really sorry for not merging this PR as is, but I think this will be better and cleaner in the long run. Also I do want to encourage you to contribute to that package. I intend to move it to the |
You can get closer by using negative compression levels and
But for sure if you are not memory constrained/bandwidth constrained there is no value. For me, disk usage is important so I targeted compression levels around what
That sounds reasonable. I will think on how best to integrate from the code perspective. |
Closes shikokuchuo/mirai#162.
Adds 'qs2' as an option for all send/recv functions.
Requires the qs2 package with registered C callables reaching CRAN.