[Discussion] the bit unpacking performance #11000

zombee0 · 2024-09-13T10:50:15Z

zombee0
Sep 13, 2024

I like the job of #3000 #2353, but i can't reproduce your result,
there are two files, arrow_none.txt is the result of default compile configuration
while arrow_avx2.txt is the result of enable avx2 for arrow,
for result of uint16 and uint32, I found that arrow-avx2 performs better.
my test is done on Intel(R) Xeon(R) Platinum 8269CY CPU @ 2.50GHz
would you be willing to help confirm this? @yingsu00 @Yuhta

Yuhta · 2024-09-18T22:10:32Z

Yuhta
Sep 18, 2024
Collaborator

I just rerun the benchmark and arrow is only a little bit faster (< 13%) in arrow_unpack_fullrows_{1,2,3,4,8}_32 and not others.

6 replies

Yuhta Sep 19, 2024
Collaborator

Yes maybe it's due to our build of arrow. I am not using an environment that can easily rebuild arrow though, so would you fix the runtime_simd_level in cmake build and apply any change on the file that can make it faster? The example change is not handling trailing data so cannot be used as it is.

zombee0 Sep 20, 2024
Author

@Yuhta Yes, the example need extra logic to handle the trailing data, I will try to introduce a pr in velox to make it faster.

mapleFU Sep 20, 2024

The BMI implementation is more faster, I think maybe it's because of the compiler and the usage of the stack and register.
Would you like to have a check?

Just curious how the code changed here? Would it better to have a godbolt link to see the asm changes and find out it?

zombee0 Oct 9, 2024
Author

@mapleFU the difference is between memcpy and value assignment, while assignment need extra logic to deal with trailing data

zombee0 Oct 9, 2024
Author

https://godbolt.org/z/fPefzMvf3

mapleFU · 2024-09-20T15:16:31Z

mapleFU
Sep 20, 2024

Another thing I observed is the datainput size is important, the arrow algorithm might be good for some case that input is long enough. Otherwise the bmi2 would be faster

2 replies

zombee0 Oct 9, 2024
Author

Another thing I observed is the datainput size is important, the arrow algorithm might be good for some case that input is long enough. Otherwise the bmi2 would be faster

Yes, but not too much, for arrow's implementation with avx512 and avx2, i think not too much data is enough to reflect the performance advantage.

zombee0 Oct 9, 2024
Author

and from my test, avx2 is better than avx512 just as many essaies said.

[Discussion] the bit unpacking performance #11000

Uh oh!

Uh oh!

zombee0 Sep 13, 2024

Replies: 2 comments · 8 replies

Uh oh!

Yuhta Sep 18, 2024 Collaborator

Uh oh!

Yuhta Sep 19, 2024 Collaborator

Uh oh!

zombee0 Sep 20, 2024 Author

Uh oh!

mapleFU Sep 20, 2024

Uh oh!

zombee0 Oct 9, 2024 Author

Uh oh!

zombee0 Oct 9, 2024 Author

Uh oh!

mapleFU Sep 20, 2024

Uh oh!

zombee0 Oct 9, 2024 Author

Uh oh!

zombee0 Oct 9, 2024 Author

zombee0
Sep 13, 2024

Replies: 2 comments 8 replies

Yuhta
Sep 18, 2024
Collaborator

Yuhta Sep 19, 2024
Collaborator

zombee0 Sep 20, 2024
Author

zombee0 Oct 9, 2024
Author

zombee0 Oct 9, 2024
Author

mapleFU
Sep 20, 2024

zombee0 Oct 9, 2024
Author

zombee0 Oct 9, 2024
Author