-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to mitigate data race in the HashAggregate #4906
Conversation
We're tracking the number of threads running the hash aggregate so that we can wait to do the parallel finalization until all threads have finished However it's possible for threads to finish before other threads have started, leading the remaining threads to start finalization early. This tries to mitigate that by registering the thread as soon as possible, however to truly fix it the finalization should be moved into a separate operator
Benchmark ResultMaster commit hash:
|
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #4906 +/- ##
=======================================
Coverage 86.48% 86.48%
=======================================
Files 1403 1403
Lines 60466 60466
Branches 7429 7431 +2
=======================================
+ Hits 52294 52296 +2
+ Misses 8003 8001 -2
Partials 169 169 ☔ View full report in Codecov by Sentry. |
Has this issue been fixed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we heading to the actual fix as you said? Maybe let's discuss this a bit more.
however to truly fix it the finalization should be moved into a separate operator so we can guarantee that all data is partitioned before starting the parallel finalization.
We're tracking the number of threads running the hash aggregate so that we can wait to do the parallel finalization until all threads have finished However it's possible for threads to finish before other threads have started, leading the remaining threads to start finalization early. This tries to mitigate that by registering the thread as soon as possible, however to truly fix it the finalization should be moved into a separate operator
There's a test that has occasionally been failing in CI (e.g. afd089b), it's happening very infrequently and I've narrowed down the issue to a data race in the HashAggregate operator.
We're tracking the number of threads running the hash aggregate so that we can wait to do the parallel finalization until all threads have finished However it's possible for threads to finish before other threads have started, leading the remaining threads to start finalization early. This tries to mitigate that by registering the thread as soon as possible (not that it's skipping a lot of work, but this might improve things slightly), however to truly fix it the finalization should be moved into a separate operator so we can guarantee that all data is partitioned before starting the parallel finalization.
There's probably a similar issue in the IndexBuilder, it's just that the IndexBuilder doesn't have a multithreaded finalization stage, and the thread tracking is just to determine if a thread should continue working on the global queues after its finished processing its local data.