Skip to content

Test: configuration fuzzer for (external) sort queries #15501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Apr 4, 2025

Conversation

2010YOUY01
Copy link
Contributor

@2010YOUY01 2010YOUY01 commented Mar 31, 2025

Which issue does this PR close?

  • Closes #.

Rationale for this change

Recently we have detected multiple bugs for out-of-core sorting, and there are many potential fixes/improvements that will change the sort execution code, thus more test coverage is intended.

The existing sort fuzzer https://github.com/apache/datafusion/blob/main/datafusion/core/tests/fuzz_cases/sort_fuzz.rs only tests on SortExec instead of end-to-end sort queries, and it only includes int and string types.
This new fuzzer will run end-to-end sort queries with multiple partiitons, and increases the test coverage for more dataset types, include TopK executors, and test all related configuration options.

The new sort fuzzer is able to detect known bugs in #14748, so the random query generation will avoid the known issues (e.g. don't generate utf8 sort key, more details can be found in code comment)
It has also detected several new bugs, for example: #15469 and #15355
There are other bugs detected and it's not obvious why it fails, so now by default, it will only run with unbounded memory limit, perhaps we should retry it after the known issues like #14748 are cleared. An example of such failed query is in sort_query_fuzz.rs's test case test_sort_query_fuzzer_reproduce

Implementation

There are two key structs in the sort query fuzzer:

  • SortQueryFuzzer: controls the runner config like how many rounds to run, and how many queries to test inside each round.
  • SortFuzzerTestGenerator: Generates random datasets, queries, and configs.

The log looks like

[SortQueryFuzzer] Round 0, Query 0 (Config 9)
  Seeds:
    init_seed   = 5007153919587973719
    query_seed  = 4047642740781589262
    config_seed = 13112349042886494469
  Dataset schema:
    [timestamp_ns:Timestamp(Nanosecond, None);N, u32:UInt32;N, time32_ms:Time32(Millisecond);N, i16:Int16;N, timestamp_ms:Timestamp(Millisecond, None);N, u8_low:UInt8;N]
  Query:
    SELECT * FROM sort_fuzz_table ORDER BY u32 DESC
  Config:
    Dataset size: 93.2 KB
    Number of partitions: 3
    Batch size: 6
    Memory limit: Unbounded
    Per partition memory limit: Unbounded
    Sort spill reservation bytes: 10.3 KB
    Sort in place threshold bytes: 540.0 B

There is a utility function to reproduce the failed execution deterministically using the above seeds, see the example in in sort_query_fuzz.rs's test case test_sort_query_fuzzer_reproduce

What changes are included in this PR?

  1. Refactor: move RecordBatchGenerator and related structs from datafusion/core/tests/fuzz_cases/aggregation_fuzzer/data_generator.rs to datafusion/core/tests/fuzz_cases/record_batch_generator.rs, to reuse the dataset generation utilities for aggregation fuzzer to generate dataset with more types.
  2. Implement SortQueryFuzzer and SortFuzzerTestGenerator in sort_query_fuzz.rs

The entry point is now in sort_query_fuzz.rs, and the fuzzer is limited to run for up to 20 seconds (around 75 queries on my machine) to keep the overall test runtime reasonable. Once the known issues are resolved and the fuzzer runs stably without uncovering new bugs, we can move it to the extended test CI job and allow it to run for a longer duration.

Are these changes tested?

NA

Are there any user-facing changes?

No.

@github-actions github-actions bot added the core Core DataFusion crate label Mar 31, 2025
@2010YOUY01 2010YOUY01 requested a review from Copilot March 31, 2025 07:53
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a new configuration fuzzer for end-to-end sort queries to improve test coverage for sorting, including TopK executors and multiple dataset types. Key changes include:

  • Adding the new module "sort_query_fuzz" and its associated entry point.
  • Refactoring dataset generation utilities by moving RecordBatchGenerator-related code into a shared utility module.
  • Modifying the AggregationFuzzer API to require a mutable self reference during async execution.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

File Description
datafusion/core/tests/fuzz_cases/mod.rs Registers the new sort_query_fuzz module and adds the record_batch_generator utility module.
datafusion/core/tests/fuzz_cases/aggregation_fuzzer/mod.rs Updates the import of ColumnDescr to use the shared record_batch_generator module.
datafusion/core/tests/fuzz_cases/aggregation_fuzzer/fuzzer.rs Changes async functions to accept a mutable self reference to support internal state modifications.
datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs Refactors inline column definitions to use the shared get_supported_types_columns function for dataset setup.
Comments suppressed due to low confidence (2)

datafusion/core/tests/fuzz_cases/aggregation_fuzzer/fuzzer.rs:167

  • [nitpick] Changing the function signature to require a mutable self reference indicates that the fuzzer's internal state is modified during execution; consider updating the function's documentation to clearly reflect this behavior for future maintainers.
pub async fn run(&mut self) {

datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs:203

  • [nitpick] Ensure that get_supported_types_columns returns a comprehensive set of columns covering all the data types previously specified explicitly to avoid any unintended gap in test coverage.
let columns = get_supported_types_columns(rng.gen());

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome -- thank you @2010YOUY01

I think we should consider the random seed along with the runtime of this test, but otherwise it looks really nice

There are other bugs detected and it's not obvious why it fails, so now by default, it will only run with unbounded memory limit,

Could you make a list of known issues that are preventing this fuzz test from being able to run with bounded memory limits (or a placeholder if we don't know why it fails)

I think it would be valuable to track the progress towards "enable sort fuzzing with limited memory" as I think others would be interested in helping and we would avoid leaving a half completed test in here.


/// Columns that are supported by the record batch generator
/// The RNG is used to generate the precision and scale for the decimal columns, thread
/// RNG is not used because this is used in fuzzing and deterministic results are preferred
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think when fuzzing it probably would be preferred to use the thread_rng to add additional coverage.

It does make the test failures unpredictable however

}

#[tokio::test(flavor = "multi_thread")]
#[ignore] // TODO: This query failes, fix this known bug
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add the test ticket URL here as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new bug, I think. I'll open an issue after this PR is merged so we can include a link to this reproducer in the issue.


/// Given the same seed, the result should be the same
#[tokio::test]
async fn test_sort_query_fuzzer_deterministic() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fuzz test takes almost 30 seconds on my laptop where most of the other tests are a few seconds

Any chance we can reduce the number of iterations?

ALso, as I mentioned using a fixed seed I think somewhat defeats the purpose of fuzz testing so I think we should change this to use the actual thread random generator

(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ cargo nextest run --test fuzz -- sort
    Finished `test` profile [unoptimized + debuginfo] target(s) in 0.21s
------------
 Nextest run ID f3c8f111-560a-414b-8f19-89b14765cfa0 with nextest profile: default
    Starting 13 tests across 1 binary (57 tests skipped)
        PASS [   0.101s] datafusion::fuzz fuzz_cases::sort_preserving_repartition_fuzz::sp_repartition_fuzz_tests::stream_merge_multi_order_preserve
        PASS [   0.175s] datafusion::fuzz fuzz_cases::sort_fuzz::test_sort_strings_100k_mem
        PASS [   0.193s] datafusion::fuzz fuzz_cases::sort_fuzz::test_sort_multi_columns_100k_mem
        PASS [   0.383s] datafusion::fuzz fuzz_cases::sort_query_fuzz::test::test_sort_query_fuzzer_deterministic
        PASS [   0.430s] datafusion::fuzz fuzz_cases::limit_fuzz::test_sort_topk_i32
        PASS [   0.539s] datafusion::fuzz fuzz_cases::limit_fuzz::test_sort_topk_f64
        PASS [   1.087s] datafusion::fuzz fuzz_cases::limit_fuzz::test_sort_topk_str
        PASS [   1.132s] datafusion::fuzz fuzz_cases::limit_fuzz::test_sort_topk_i64str
        PASS [   1.140s] datafusion::fuzz fuzz_cases::sort_fuzz::test_sort_10k_mem
        PASS [   1.561s] datafusion::fuzz fuzz_cases::sort_preserving_repartition_fuzz::sp_repartition_fuzz_tests::sort_preserving_repartition_test
        PASS [   1.742s] datafusion::fuzz fuzz_cases::sort_fuzz::test_sort_100k_mem
        PASS [   1.929s] datafusion::fuzz fuzz_cases::sort_fuzz::test_sort_unlimited_mem
        PASS [  27.918s] datafusion::fuzz fuzz_cases::sort_query_fuzz::sort_query_fuzzer_runner
------------

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I will address the running for too long issue, do you think letting it run for 10 seconds is okay?
  2. Regarding the fixed seed issue: its randomness is decided by a single start seed, and now it's using the timestamp when it starts:
// test entry point in `sort_query_fuzz.rs`
#[tokio::test(flavor = "multi_thread")]
async fn sort_query_fuzzer_runner() {
    let random_seed = std::time::SystemTime::now()
        .duration_since(std::time::UNIX_EPOCH)
        .unwrap()
        .as_secs();
...

This way every individual fuzzer run is different, however, if we know the start seed, the whole test sequence is deterministic, and heisenbugs can't slip away.
That said, I haven't printed out this start seed yet, I will update it later if there are no other considerations.

// ==== Execute the query ====

// Only enable memory limits if:
// 1. Query does not contain LIMIT (since topK does not support external execution)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the topk comment -- as long as the number of rows being limited is reasonable this should work I would think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, only when the limit number is very big, the query fails.
I'll include it to the TODOs to also generate query with small LIMIT number for memory-limited configurations.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- sorting out how topk will work with memory limits is probably a good idea 🤔

If the plan doesn't have enough memory to hold "K" rows it could potentially fall back to normal sorting for example. Or maybe we need to eventually extend topk to support spilling (not in the near term, I am just musing here)

let init_seed = self.runner_rng.gen();
for query_i in 0..self.queries_per_round {
let query_seed = self.runner_rng.gen();
let mut expected_results: Option<Vec<RecordBatch>> = None; // use first config's result as the expected result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
@2010YOUY01
Copy link
Contributor Author

Thank you for the review!

I think we should consider the random seed along with the runtime of this test, but otherwise it looks really nice

It's using a random start seed, thread rng is only avoided in the middle, and the same rng will be propagated along the process. This way it can be both random and deterministic. See #15501 (comment) for more discussion.

There are other bugs detected and it's not obvious why it fails, so now by default, it will only run with unbounded memory limit,

Could you make a list of known issues that are preventing this fuzz test from being able to run with bounded memory limits (or a placeholder if we don't know why it fails)

I think it would be valuable to track the progress towards "enable sort fuzzing with limited memory" as I think others would be interested in helping and we would avoid leaving a half completed test in here.

I agree, filed the tracking issue #15517

@alamb
Copy link
Contributor

alamb commented Apr 2, 2025

In my mind the only thing remaining for this PR is to reduce the time down from 30 seconds somehow (maybe split it into multiple smaller tests that can run in parallel, for example?)

@2010YOUY01
Copy link
Contributor Author

In my mind the only thing remaining for this PR is to reduce the time down from 30 seconds somehow (maybe split it into multiple smaller tests that can run in parallel, for example?)

I let it run for a smaller amount of time in 1b015fb

[SortQueryFuzzer] Time limit reached: 7 queries (5 random configs each) in 1 rounds


successes:
    fuzz_cases::sort_query_fuzz::sort_query_fuzzer_runner

test result: ok. 1 passed; 0 failed; 0 ignored; 0 measured; 68 filtered out; finished in 5.03s

It will always break after 5 sec is reached, and it has executed 35 query/config combinations on my machine.

Breaking it into smaller tests might help, because now the testing throughput bottleneck is verifying the result: it has to do the expensive formatting on the result set, sort the formatted result, and check equal.
I think this is worth to do after this fuzzer can stably pass without discovering new issue, added this idea to the tracking issue.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thank you so much @2010YOUY01

This is a major step forward in getting externalized sorting working better

        PASS [   5.036s] datafusion::fuzz fuzz_cases::sort_query_fuzz::sort_query_fuzzer_runner

@alamb alamb merged commit f6ecbcd into apache:main Apr 4, 2025
27 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 4, 2025

Onward!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants