-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Test: configuration fuzzer for (external) sort queries #15501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
c9ff883
to
0c6f558
Compare
0c6f558
to
77789a4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new configuration fuzzer for end-to-end sort queries to improve test coverage for sorting, including TopK executors and multiple dataset types. Key changes include:
- Adding the new module "sort_query_fuzz" and its associated entry point.
- Refactoring dataset generation utilities by moving RecordBatchGenerator-related code into a shared utility module.
- Modifying the AggregationFuzzer API to require a mutable self reference during async execution.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
File | Description |
---|---|
datafusion/core/tests/fuzz_cases/mod.rs | Registers the new sort_query_fuzz module and adds the record_batch_generator utility module. |
datafusion/core/tests/fuzz_cases/aggregation_fuzzer/mod.rs | Updates the import of ColumnDescr to use the shared record_batch_generator module. |
datafusion/core/tests/fuzz_cases/aggregation_fuzzer/fuzzer.rs | Changes async functions to accept a mutable self reference to support internal state modifications. |
datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs | Refactors inline column definitions to use the shared get_supported_types_columns function for dataset setup. |
Comments suppressed due to low confidence (2)
datafusion/core/tests/fuzz_cases/aggregation_fuzzer/fuzzer.rs:167
- [nitpick] Changing the function signature to require a mutable self reference indicates that the fuzzer's internal state is modified during execution; consider updating the function's documentation to clearly reflect this behavior for future maintainers.
pub async fn run(&mut self) {
datafusion/core/tests/fuzz_cases/aggregate_fuzz.rs:203
- [nitpick] Ensure that get_supported_types_columns returns a comprehensive set of columns covering all the data types previously specified explicitly to avoid any unintended gap in test coverage.
let columns = get_supported_types_columns(rng.gen());
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome -- thank you @2010YOUY01
I think we should consider the random seed along with the runtime of this test, but otherwise it looks really nice
There are other bugs detected and it's not obvious why it fails, so now by default, it will only run with unbounded memory limit,
Could you make a list of known issues that are preventing this fuzz test from being able to run with bounded memory limits (or a placeholder if we don't know why it fails)
I think it would be valuable to track the progress towards "enable sort fuzzing with limited memory" as I think others would be interested in helping and we would avoid leaving a half completed test in here.
|
||
/// Columns that are supported by the record batch generator | ||
/// The RNG is used to generate the precision and scale for the decimal columns, thread | ||
/// RNG is not used because this is used in fuzzing and deterministic results are preferred |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think when fuzzing it probably would be preferred to use the thread_rng to add additional coverage.
It does make the test failures unpredictable however
} | ||
|
||
#[tokio::test(flavor = "multi_thread")] | ||
#[ignore] // TODO: This query failes, fix this known bug |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please add the test ticket URL here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a new bug, I think. I'll open an issue after this PR is merged so we can include a link to this reproducer in the issue.
|
||
/// Given the same seed, the result should be the same | ||
#[tokio::test] | ||
async fn test_sort_query_fuzzer_deterministic() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this fuzz test takes almost 30 seconds on my laptop where most of the other tests are a few seconds
Any chance we can reduce the number of iterations?
ALso, as I mentioned using a fixed seed I think somewhat defeats the purpose of fuzz testing so I think we should change this to use the actual thread random generator
(venv) andrewlamb@Andrews-MacBook-Pro-2:~/Software/datafusion$ cargo nextest run --test fuzz -- sort
Finished `test` profile [unoptimized + debuginfo] target(s) in 0.21s
------------
Nextest run ID f3c8f111-560a-414b-8f19-89b14765cfa0 with nextest profile: default
Starting 13 tests across 1 binary (57 tests skipped)
PASS [ 0.101s] datafusion::fuzz fuzz_cases::sort_preserving_repartition_fuzz::sp_repartition_fuzz_tests::stream_merge_multi_order_preserve
PASS [ 0.175s] datafusion::fuzz fuzz_cases::sort_fuzz::test_sort_strings_100k_mem
PASS [ 0.193s] datafusion::fuzz fuzz_cases::sort_fuzz::test_sort_multi_columns_100k_mem
PASS [ 0.383s] datafusion::fuzz fuzz_cases::sort_query_fuzz::test::test_sort_query_fuzzer_deterministic
PASS [ 0.430s] datafusion::fuzz fuzz_cases::limit_fuzz::test_sort_topk_i32
PASS [ 0.539s] datafusion::fuzz fuzz_cases::limit_fuzz::test_sort_topk_f64
PASS [ 1.087s] datafusion::fuzz fuzz_cases::limit_fuzz::test_sort_topk_str
PASS [ 1.132s] datafusion::fuzz fuzz_cases::limit_fuzz::test_sort_topk_i64str
PASS [ 1.140s] datafusion::fuzz fuzz_cases::sort_fuzz::test_sort_10k_mem
PASS [ 1.561s] datafusion::fuzz fuzz_cases::sort_preserving_repartition_fuzz::sp_repartition_fuzz_tests::sort_preserving_repartition_test
PASS [ 1.742s] datafusion::fuzz fuzz_cases::sort_fuzz::test_sort_100k_mem
PASS [ 1.929s] datafusion::fuzz fuzz_cases::sort_fuzz::test_sort_unlimited_mem
PASS [ 27.918s] datafusion::fuzz fuzz_cases::sort_query_fuzz::sort_query_fuzzer_runner
------------
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I will address the running for too long issue, do you think letting it run for 10 seconds is okay?
- Regarding the fixed seed issue: its randomness is decided by a single start seed, and now it's using the timestamp when it starts:
// test entry point in `sort_query_fuzz.rs`
#[tokio::test(flavor = "multi_thread")]
async fn sort_query_fuzzer_runner() {
let random_seed = std::time::SystemTime::now()
.duration_since(std::time::UNIX_EPOCH)
.unwrap()
.as_secs();
...
This way every individual fuzzer run is different, however, if we know the start seed, the whole test sequence is deterministic, and heisenbugs can't slip away.
That said, I haven't printed out this start seed yet, I will update it later if there are no other considerations.
// ==== Execute the query ==== | ||
|
||
// Only enable memory limits if: | ||
// 1. Query does not contain LIMIT (since topK does not support external execution) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand the topk comment -- as long as the number of rows being limited is reasonable this should work I would think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, only when the limit number is very big, the query fails.
I'll include it to the TODOs to also generate query with small LIMIT
number for memory-limited configurations.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see -- sorting out how topk will work with memory limits is probably a good idea 🤔
If the plan doesn't have enough memory to hold "K" rows it could potentially fall back to normal sorting for example. Or maybe we need to eventually extend topk to support spilling (not in the near term, I am just musing here)
let init_seed = self.runner_rng.gen(); | ||
for query_i in 0..self.queries_per_round { | ||
let query_seed = self.runner_rng.gen(); | ||
let mut expected_results: Option<Vec<RecordBatch>> = None; // use first config's result as the expected result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
Thank you for the review!
It's using a random start seed, thread rng is only avoided in the middle, and the same rng will be propagated along the process. This way it can be both random and deterministic. See #15501 (comment) for more discussion.
I agree, filed the tracking issue #15517 |
In my mind the only thing remaining for this PR is to reduce the time down from 30 seconds somehow (maybe split it into multiple smaller tests that can run in parallel, for example?) |
I let it run for a smaller amount of time in 1b015fb
It will always break after 5 sec is reached, and it has executed 35 query/config combinations on my machine. Breaking it into smaller tests might help, because now the testing throughput bottleneck is verifying the result: it has to do the expensive formatting on the result set, sort the formatted result, and check equal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Thank you so much @2010YOUY01
This is a major step forward in getting externalized sorting working better
PASS [ 5.036s] datafusion::fuzz fuzz_cases::sort_query_fuzz::sort_query_fuzzer_runner
Onward! |
Which issue does this PR close?
Rationale for this change
Recently we have detected multiple bugs for out-of-core sorting, and there are many potential fixes/improvements that will change the sort execution code, thus more test coverage is intended.
The existing sort fuzzer https://github.com/apache/datafusion/blob/main/datafusion/core/tests/fuzz_cases/sort_fuzz.rs only tests on
SortExec
instead of end-to-end sort queries, and it only includes int and string types.This new fuzzer will run end-to-end sort queries with multiple partiitons, and increases the test coverage for more dataset types, include
TopK
executors, and test all related configuration options.The new sort fuzzer is able to detect known bugs in #14748, so the random query generation will avoid the known issues (e.g. don't generate utf8 sort key, more details can be found in code comment)
It has also detected several new bugs, for example: #15469 and #15355
There are other bugs detected and it's not obvious why it fails, so now by default, it will only run with unbounded memory limit, perhaps we should retry it after the known issues like #14748 are cleared. An example of such failed query is in
sort_query_fuzz.rs
's test casetest_sort_query_fuzzer_reproduce
Implementation
There are two key structs in the sort query fuzzer:
SortQueryFuzzer
: controls the runner config like how many rounds to run, and how many queries to test inside each round.SortFuzzerTestGenerator
: Generates random datasets, queries, and configs.The log looks like
There is a utility function to reproduce the failed execution deterministically using the above seeds, see the example in in
sort_query_fuzz.rs
's test casetest_sort_query_fuzzer_reproduce
What changes are included in this PR?
RecordBatchGenerator
and related structs fromdatafusion/core/tests/fuzz_cases/aggregation_fuzzer/data_generator.rs
todatafusion/core/tests/fuzz_cases/record_batch_generator.rs
, to reuse the dataset generation utilities for aggregation fuzzer to generate dataset with more types.SortQueryFuzzer
andSortFuzzerTestGenerator
insort_query_fuzz.rs
The entry point is now in
sort_query_fuzz.rs
, and the fuzzer is limited to run for up to 20 seconds (around 75 queries on my machine) to keep the overall test runtime reasonable. Once the known issues are resolved and the fuzzer runs stably without uncovering new bugs, we can move it to the extended test CI job and allow it to run for a longer duration.Are these changes tested?
NA
Are there any user-facing changes?
No.