-
Notifications
You must be signed in to change notification settings - Fork 179
Bench: Add more dataset in router evaluation #270
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👥 vLLM Semantic Team NotificationThe following members have been identified for the changed files in this PR and have been automatically assigned: 📁
|
bd78dd3
to
2743aad
Compare
✅ Deploy Preview for vllm-semantic-router ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for multiple new datasets in the router evaluation benchmark, expanding from the original 6 datasets to include 7 new mathematical reasoning, multi-step reasoning, and scientific reasoning datasets.
Key changes:
- Added implementations for 7 new datasets: AQUA-RAT, DROP, GSM8K, MATH (disabled), OpenBookQA, SciQ, and StrategyQA
- Enhanced answer extraction with format-specific parsers for multiple-choice, binary, and free-form questions
- Updated token allocation system with dataset-specific optimal limits and model-aware multipliers
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.
Show a summary per file
File | Description |
---|---|
router_reason_bench_multi_dataset.py | Enhanced answer extraction patterns, expanded dataset token configs, improved model compatibility |
strategyqa_dataset.py | New dataset for multi-step Yes/No reasoning questions |
sciq_dataset.py | New dataset for science multiple-choice questions |
openbookqa_dataset.py | New dataset for elementary science reasoning |
math_dataset.py | New dataset for competition mathematics (commented as disabled) |
gsm8k_dataset.py | New dataset for elementary math word problems |
drop_dataset.py | New dataset for discrete reasoning over text passages |
aqua_rat_dataset.py | New dataset for algebraic word problems with rationales |
dataset_factory.py | Registration of all new dataset implementations |
cli.py | Updated CLI choices to include new datasets |
plot_comprehensive_results.py | New comprehensive plotting script for benchmark visualization |
comprehensive_bench.sh | Enhanced benchmark script with new datasets and plotting integration |
Comments suppressed due to low confidence (2)
bench/vllm_semantic_router_bench/router_reason_bench_multi_dataset.py:1
- [nitpick] Consider using f-string formatting consistently. The string concatenation here could be replaced with an f-string for better performance and readability:
f\"Reasoning steps:\\n{chr(10).join([f'{i+1}. {step}' for i, step in enumerate(decomp)])}\"
"""
bench/vllm_semantic_router_bench/dataset_implementations/gsm8k_dataset.py:1
- This hardcoded seed conflicts with the seed parameter passed to the load_dataset method. Consider using the provided seed parameter instead of hardcoding 42 to maintain consistency with the overall seeding strategy.
"""
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
bench/vllm_semantic_router_bench/router_reason_bench_multi_dataset.py
Outdated
Show resolved
Hide resolved
bench/vllm_semantic_router_bench/dataset_implementations/math_dataset.py
Outdated
Show resolved
Hide resolved
✅ Deploy Preview for vllm-semantic-router ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
…ison.py - Add new dataset implementations: AQUA-RAT, DROP, GSM8K, MATH, OpenBookQA, SciQ, StrategyQA - Update router_reason_bench_multi_dataset.py with adaptive max token - Improved answer extraction and evaluation logic for multiple answer formats Signed-off-by: Huamin Chen <hchen@redhat.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Huamin Chen <hchen@redhat.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Huamin Chen <hchen@redhat.com>
f8bac71
to
8952bc9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to let users to input which dataset subset (test? valid?) they want to use? As some of the datasets may not have test subsets. Also instead of giving sample count per dataset, do we want to support the percentage base sampling over the datasets?
you are right, at the moment, not all the datasets have test/train/validate splits. I will follow up with such split next. |
What type of PR is this?
Add new dataset implementations: AQUA-RAT, DROP, GSM8K, MATH, OpenBookQA, SciQ, StrategyQA
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Release Notes: Yes/No