-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement brotli's version of lz_extend. #364
base: master
Are you sure you want to change the base?
Conversation
bf13c0d
to
26d4beb
Compare
Thanks for looking into this! This is a part of the code where it feels like there should be something better, but it's been hard to find something that can actually be shown to be better. When saying that there's a certain percent performance improvement, especially for small differences, it would be helpful to mention which CPU and compiler is used. In a couple quick tests, I don't see the 2-3% improvement on the CPU and compiler combinations I currently happen to do the most testing on most on -- AMD Zen 1 with the latest versions of gcc and clang, and an arm64 processor with clang. It's either a negligible improvement or even a slight regression. I think you're probably onto something, though; loops should only be unrolled when it provides a clear benefit, and writing the code with incrementing the pointers is probably a better choice than using indexing, especially since many non-x86 CPUs do not support indexed addressing anyway so the compiler would have to convert the code to incrementing the pointers anyway. Before making a change here, it would be helpful to really understand how the two separate changes (re-rolling the loop, and not using indexed addressing) each affect the generated assembly with gcc and clang, and what result that has on different CPUs. That should allow making a more informed choice, and maybe point to something even better. |
I’m testing on Epyc Rome (Zen 2) with head clang. It could also be the dataset I’m testing on - a pile of minified JavaScript and CSS. What is your preferred dataset? |
This provides a 2-3% speed up for compression. Before: ``` Compression Total Compressed | Compression | Method Size Size Ratio | Iters Time Rate | ---------------------------------------------------------------------------------- libdeflate-gzip:1 270062086 85630362 3.154 | 20 26.493 s 194.43 MB/s | libdeflate-gzip:2 270062086 84037129 3.214 | 20 35.747 s 144.10 MB/s | libdeflate-gzip:3 270062086 82391861 3.278 | 20 39.707 s 129.73 MB/s | libdeflate-gzip:4 270062086 81420541 3.317 | 20 43.029 s 119.71 MB/s | libdeflate-gzip:5 270062086 78832080 3.426 | 20 50.630 s 101.74 MB/s | libdeflate-gzip:6 270062086 78021372 3.461 | 20 63.719 s 80.84 MB/s | libdeflate-gzip:7 270062086 77594012 3.480 | 20 87.918 s 58.59 MB/s | libdeflate-gzip:8 270062086 77190199 3.499 | 20 147.452 s 34.93 MB/s | libdeflate-gzip:9 270062086 77156775 3.500 | 20 191.025 s 26.97 MB/s | ``` After: ``` Compression Total Compressed | Compression | Method Size Size Ratio | Iters Time Rate | ---------------------------------------------------------------------------------- libdeflate-gzip:1 270062086 85630362 3.154 | 20 26.228 s 196.39 MB/s | libdeflate-gzip:2 270062086 84037129 3.214 | 20 34.950 s 147.38 MB/s | libdeflate-gzip:3 270062086 82391861 3.278 | 20 39.140 s 131.61 MB/s | libdeflate-gzip:4 270062086 81420541 3.317 | 20 41.927 s 122.86 MB/s | libdeflate-gzip:5 270062086 78832080 3.426 | 20 50.023 s 102.97 MB/s | libdeflate-gzip:6 270062086 78021372 3.461 | 20 61.799 s 83.35 MB/s | libdeflate-gzip:7 270062086 77594012 3.480 | 20 85.528 s 60.23 MB/s | libdeflate-gzip:8 270062086 77190199 3.499 | 20 145.867 s 35.31 MB/s | libdeflate-gzip:9 270062086 77156775 3.500 | 20 189.208 s 27.22 MB/s | ```
26d4beb
to
32a7739
Compare
For quick testing I just use an amalgamation of a bunch of different files. For more detailed testing I do look at different types of files specifically, including types that users of libdeflate have mentioned are important to them, for example bioformatics data. I should probably add more Javascript and CSS to my testing. Testing on a file containing DNA sequencing data, I happen to see a ~2% improvement with clang, which is similar to what you're reporting. I'm guessing that the significant factor here is the distribution of match lengths in the data. At the same time, on that file I don't see that improvement with gcc, and just building with gcc instead of clang results in a 10% improvement on that file in the first place. I'm thinking your change is probably a benefit overall, but I'd like to really understand what's going on here as there may be a way to do even better. |
A quick test suggests just removing the loop unrolling and the goto's is better than this change: (quality 6) HEAD: 81.82 MB/s
|
Just switching to pointer arithmetic also seems better than head, but admittedly the variability of my measurements is pretty high: (quality 6) head: 80.82 MB/s
|
This provides a 2-3% speed up for compression, with no change to the bytestream. See here for the brotli version.
Before:
After: