Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to mitigate data race in the HashAggregate #4906

Merged
merged 1 commit into from
Feb 14, 2025

Conversation

benjaminwinger
Copy link
Collaborator

There's a test that has occasionally been failing in CI (e.g. afd089b), it's happening very infrequently and I've narrowed down the issue to a data race in the HashAggregate operator.

We're tracking the number of threads running the hash aggregate so that we can wait to do the parallel finalization until all threads have finished However it's possible for threads to finish before other threads have started, leading the remaining threads to start finalization early. This tries to mitigate that by registering the thread as soon as possible (not that it's skipping a lot of work, but this might improve things slightly), however to truly fix it the finalization should be moved into a separate operator so we can guarantee that all data is partitioned before starting the parallel finalization.

There's probably a similar issue in the IndexBuilder, it's just that the IndexBuilder doesn't have a multithreaded finalization stage, and the thread tracking is just to determine if a thread should continue working on the global queues after its finished processing its local data.

We're tracking the number of threads running the hash aggregate so that we can wait to do the parallel finalization until all threads have finished
However it's possible for threads to finish before other threads have started, leading the remaining threads to start finalization early.
This tries to mitigate that by registering the thread as soon as possible, however to truly fix it the finalization should be moved into a separate operator
Copy link

Benchmark Result

Master commit hash: 5399489c9c70e09d78b45fb203bbf4b56404301e
Branch commit hash: 9b89350bf066afc6758769c3d9f52ac8f2fee002

Query Group Query Name Mean Time - Commit (ms) Mean Time - Master (ms) Diff
aggregation q24 722.04 729.62 -7.57 (-1.04%)
aggregation q28 6347.54 6362.61 -15.07 (-0.24%)
filter q14 126.29 129.79 -3.50 (-2.70%)
filter q15 125.91 124.99 0.93 (0.74%)
filter q16 307.79 309.40 -1.61 (-0.52%)
filter q17 453.53 445.21 8.32 (1.87%)
filter q18 1955.80 1924.43 31.37 (1.63%)
filter zonemap-node 89.15 91.79 -2.64 (-2.88%)
filter zonemap-node-lhs-cast 90.18 90.31 -0.13 (-0.15%)
filter zonemap-node-null 88.86 88.59 0.27 (0.31%)
filter zonemap-rel 5390.72 5434.17 -43.45 (-0.80%)
fixed_size_expr_evaluator q07 579.92 577.97 1.95 (0.34%)
fixed_size_expr_evaluator q08 809.18 812.23 -3.05 (-0.38%)
fixed_size_expr_evaluator q09 807.95 808.80 -0.85 (-0.11%)
fixed_size_expr_evaluator q10 243.65 245.35 -1.70 (-0.69%)
fixed_size_expr_evaluator q11 236.36 237.47 -1.11 (-0.47%)
fixed_size_expr_evaluator q12 236.03 238.64 -2.60 (-1.09%)
fixed_size_expr_evaluator q13 1453.51 1453.42 0.09 (0.01%)
fixed_size_seq_scan q23 115.50 119.23 -3.73 (-3.13%)
join q29 715.69 721.07 -5.38 (-0.75%)
join q30 10238.88 10508.05 -269.17 (-2.56%)
join q31 8.04 6.26 1.77 (28.34%)
join SelectiveTwoHopJoin 53.77 55.28 -1.51 (-2.73%)
ldbc_snb_ic q35 2619.81 2596.48 23.33 (0.90%)
ldbc_snb_ic q36 454.27 474.31 -20.05 (-4.23%)
ldbc_snb_is q32 3.58 5.20 -1.62 (-31.11%)
ldbc_snb_is q33 14.42 15.39 -0.97 (-6.27%)
ldbc_snb_is q34 1.20 1.26 -0.06 (-4.76%)
multi-rel multi-rel-large-scan 1371.04 1328.34 42.70 (3.21%)
multi-rel multi-rel-lookup 10.24 32.59 -22.35 (-68.58%)
multi-rel multi-rel-small-scan 68.73 58.12 10.61 (18.26%)
order_by q25 132.46 127.19 5.27 (4.14%)
order_by q26 455.91 462.67 -6.76 (-1.46%)
order_by q27 1415.71 1414.42 1.29 (0.09%)
recursive_join recursive-join-bidirection 290.14 308.41 -18.26 (-5.92%)
recursive_join recursive-join-dense 7403.25 7426.05 -22.79 (-0.31%)
recursive_join recursive-join-path 24164.30 24248.80 -84.50 (-0.35%)
recursive_join recursive-join-sparse 1057.39 1060.88 -3.49 (-0.33%)
recursive_join recursive-join-trail 7365.96 7383.34 -17.38 (-0.24%)
scan_after_filter q01 172.83 175.73 -2.90 (-1.65%)
scan_after_filter q02 157.18 159.94 -2.76 (-1.73%)
shortest_path_ldbc100 q37 86.98 96.52 -9.54 (-9.88%)
shortest_path_ldbc100 q38 359.86 406.01 -46.14 (-11.37%)
shortest_path_ldbc100 q39 61.17 65.20 -4.03 (-6.18%)
shortest_path_ldbc100 q40 463.40 415.60 47.80 (11.50%)
var_size_expr_evaluator q03 2084.88 2136.30 -51.42 (-2.41%)
var_size_expr_evaluator q04 2241.80 2259.95 -18.15 (-0.80%)
var_size_expr_evaluator q05 2915.72 2995.92 -80.19 (-2.68%)
var_size_expr_evaluator q06 1330.78 1330.58 0.20 (0.02%)
var_size_seq_scan q19 1505.48 1504.61 0.87 (0.06%)
var_size_seq_scan q20 2400.76 2373.59 27.16 (1.14%)
var_size_seq_scan q21 2361.49 2315.78 45.71 (1.97%)
var_size_seq_scan q22 127.67 126.69 0.97 (0.77%)

Copy link

codecov bot commented Feb 13, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.48%. Comparing base (5399489) to head (47283ee).
Report is 5 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #4906   +/-   ##
=======================================
  Coverage   86.48%   86.48%           
=======================================
  Files        1403     1403           
  Lines       60466    60466           
  Branches     7429     7431    +2     
=======================================
+ Hits        52294    52296    +2     
+ Misses       8003     8001    -2     
  Partials      169      169           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@acquamarin
Copy link
Collaborator

acquamarin commented Feb 14, 2025

Has this issue been fixed?
#4903

Copy link
Contributor

@ray6080 ray6080 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we heading to the actual fix as you said? Maybe let's discuss this a bit more.

however to truly fix it the finalization should be moved into a separate operator so we can guarantee that all data is partitioned before starting the parallel finalization.

@benjaminwinger benjaminwinger merged commit 0e1c19f into master Feb 14, 2025
25 checks passed
@benjaminwinger benjaminwinger deleted the hash-aggregate-data-race-mitigation branch February 14, 2025 17:01
acquamarin pushed a commit that referenced this pull request Feb 19, 2025
We're tracking the number of threads running the hash aggregate so that we can wait to do the parallel finalization until all threads have finished
However it's possible for threads to finish before other threads have started, leading the remaining threads to start finalization early.
This tries to mitigate that by registering the thread as soon as possible, however to truly fix it the finalization should be moved into a separate operator
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants