Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore: Remove USE_ALIGNED_ACCESS and enhance BYTE_ORDER handling #2456

Open
wants to merge 9 commits into
base: unstable
Choose a base branch
from

Conversation

mapleFU
Copy link
Member

@mapleFU mapleFU commented Jul 31, 2024

Redis uses lot of USE_ALIGNED_ACCESS as the "fastpath" for ARM like archtecture, however, I think modern compiler can handle this kind of optimization well. So this part of code is removed.

Besides, some vendored libraray uses BYTE_ORDER macro, this might not being defined in these files. So I use BYTE_ORDER instead

@mapleFU mapleFU requested a review from PragmaTwice July 31, 2024 10:30
}
}
}
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the performance difference between this fast path and vanilla algorithm is large.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I think https://github.com/apache/arrow/blob/main/cpp/src/arrow/util/bitmap_ops.h#L160-L242 would be better, I'll do a benchmark there

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked the current impl, and I guess the fast path would be faster😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use https://quick-bench.com/ which can generate a chart.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I revert the part because the underlying code is too slow 😅

Besides, I change to use memcpy to generalize it. I'll try a benchmark later

Copy link
Member Author

@mapleFU mapleFU Aug 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apache/arrow@76cebfa
Lucky, arrow has similiar testing here. On My MacOS M1Pro with Release -O2:

BenchmarkBitmapVisitBitsetAnd/32768/0      753392 ns       749634 ns          937 bytes_per_second=41.687M/s
BenchmarkBitmapVisitBitsetAnd/131072/0    2986097 ns      2985449 ns          234 bytes_per_second=41.8698M/s
BenchmarkBitmapVisitBitsetAnd/32768/1      746267 ns       746040 ns          939 bytes_per_second=41.8878M/s
BenchmarkBitmapVisitBitsetAnd/131072/1    2991597 ns      2990679 ns          234 bytes_per_second=41.7965M/s
BenchmarkBitmapVisitBitsetAnd/32768/2      747519 ns       747314 ns          940 bytes_per_second=41.8164M/s
BenchmarkBitmapVisitBitsetAnd/131072/2    2985102 ns      2984500 ns          234 bytes_per_second=41.8831M/s

The code has no different from bit-hacking and

git-hulk
git-hulk previously approved these changes Jul 31, 2024
@mapleFU
Copy link
Member Author

mapleFU commented Aug 3, 2024

@git-hulk @PragmaTwice I've paste the result #2456 (comment)

https://github.com/apache/arrow/blob/45b176716cc667384577a2a1218c6da454854109/cpp/src/arrow/util/bit_util_benchmark.cc#L165-L189

The code runs same speed with highly optimized code in macos, and x86 would share this optimization

Copy link

sonarcloud bot commented Aug 3, 2024

@mapleFU
Copy link
Member Author

mapleFU commented Aug 7, 2024

@PragmaTwice would you mind check again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants