[SPARK-39328][SQL][TESTS] Fix flaky test `SPARK-37753: Inhibit broadcast in left outer join when there are many empty partitions on outer/left side` #52388

Last-remote11 · 2025-09-18T14:16:30Z

What changes were proposed in this pull request?

Improve test SPARK-37753: Inhibit broadcast in left outer join when there are many empty partitions on outer/left side of AdaptiveQueryExecSuite

Why are the changes needed?

This test appears to always succeed in the Apache GitHub Action runner environment, But some environments, test does not seem to proceed as intended.

On my environment:
4.18.0-553.8.1.el8_10.x86_64
Intel(R) Xeon(R) Silver 4210 CPU @ 2.20GHz
64G Mem
And ran test in master branch following the guide of official documentation

./build/sbt
testOnly org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite
...
- SPARK-37753: Inhibit broadcast in left outer join when there are many empty partitions on outer/left side *** FAILED ***
  The code passed to eventually never returned normally. Attempted 25 times over 15.040156205999999 seconds. Last failure message:

even increasing the test's timeout to 1500 seconds results to failure after lots of retries.

SPARK-37753: Inhibit broadcast in left outer join when there are many empty partitions on outer/left side *** FAILED ***
  The code passed to failAfter did not complete within 20 minutes. (AdaptiveQueryExecSuite.scala:743)

The test says

    // if the right side is completed first and the left side is still being executed,
    // the right side does not know whether there are many empty partitions on the left side,
    // so there is no demote, and then the right side is broadcast in the planning stage.
    // so retry several times here to avoid unit test failure.
    eventually(timeout(15.seconds), interval(500.milliseconds)) {
...

It seems test failure occurs with very high probability by loading the ‘right side’ completes first.

While the reason is unclear, I believe it would be better to regulate the subquery loading speed in a predictable manner via applying simple udf rather than retrying until both sides load in the desired order.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Rerun the test.

Was this patch authored or co-authored using generative AI tooling?

No.

dongjoon-hyun · 2025-09-18T16:40:27Z

sql/core/src/test/scala/org/apache/spark/sql/execution/adaptive/AdaptiveQueryExecSuite.scala

+    // so apply `slow_udf` to delay right side to avoid unit test failure.
+    withUserDefinedFunction("slow_udf" -> true) {
+      spark.udf.register("slow_udf", (x: Int) => {
+        Thread.sleep(300)


300 looks like a magic number, doesn't it? May I ask why you choose this value?

Yes, I picked just a random magic number and there's no specific reason.

cafri.sun added 2 commits September 18, 2025 22:40

Improve flaky test in AdaptiveQueryExecSuite

d528d02

modify quote of Inhibit broadcast test

9dba4b1

github-actions bot added the SQL label Sep 18, 2025

[empty commit to trigger action]

102d6a8

dongjoon-hyun reviewed Sep 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-39328][SQL][TESTS] Fix flaky test `SPARK-37753: Inhibit broadcast in left outer join when there are many empty partitions on outer/left side` #52388

[SPARK-39328][SQL][TESTS] Fix flaky test `SPARK-37753: Inhibit broadcast in left outer join when there are many empty partitions on outer/left side` #52388

Last-remote11 commented Sep 18, 2025

Uh oh!

dongjoon-hyun Sep 18, 2025

Uh oh!

Last-remote11 Sep 18, 2025

Uh oh!

Uh oh!

[SPARK-39328][SQL][TESTS] Fix flaky test SPARK-37753: Inhibit broadcast in left outer join when there are many empty partitions on outer/left side #52388

Are you sure you want to change the base?

[SPARK-39328][SQL][TESTS] Fix flaky test SPARK-37753: Inhibit broadcast in left outer join when there are many empty partitions on outer/left side #52388

Conversation

Last-remote11 commented Sep 18, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Last-remote11 Sep 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

[SPARK-39328][SQL][TESTS] Fix flaky test `SPARK-37753: Inhibit broadcast in left outer join when there are many empty partitions on outer/left side` #52388

[SPARK-39328][SQL][TESTS] Fix flaky test `SPARK-37753: Inhibit broadcast in left outer join when there are many empty partitions on outer/left side` #52388