Skip to content

Conversation

rootfs
Copy link
Collaborator

@rootfs rootfs commented Sep 28, 2025

What type of PR is this?

Add new dataset implementations: AQUA-RAT, DROP, GSM8K, MATH, OpenBookQA, SciQ, StrategyQA

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Release Notes: Yes/No

Copy link

github-actions bot commented Sep 28, 2025

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 bench

Owners: @yuezhu1, @Xunzhuo
Files changed:

  • bench/plot_comprehensive_results.py
  • bench/vllm_semantic_router_bench/dataset_implementations/aqua_rat_dataset.py
  • bench/vllm_semantic_router_bench/dataset_implementations/drop_dataset.py
  • bench/vllm_semantic_router_bench/dataset_implementations/gsm8k_dataset.py
  • bench/vllm_semantic_router_bench/dataset_implementations/math_dataset.py
  • bench/vllm_semantic_router_bench/dataset_implementations/openbookqa_dataset.py
  • bench/vllm_semantic_router_bench/dataset_implementations/sciq_dataset.py
  • bench/vllm_semantic_router_bench/dataset_implementations/strategyqa_dataset.py
  • bench/comprehensive_bench.sh
  • bench/vllm_semantic_router_bench/cli.py
  • bench/vllm_semantic_router_bench/dataset_factory.py
  • bench/vllm_semantic_router_bench/router_reason_bench_multi_dataset.py

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

Copy link

netlify bot commented Sep 28, 2025

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit bd78dd3
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/68d95fbf2586260008c63cae
😎 Deploy Preview https://deploy-preview-270--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for multiple new datasets in the router evaluation benchmark, expanding from the original 6 datasets to include 7 new mathematical reasoning, multi-step reasoning, and scientific reasoning datasets.

Key changes:

  • Added implementations for 7 new datasets: AQUA-RAT, DROP, GSM8K, MATH (disabled), OpenBookQA, SciQ, and StrategyQA
  • Enhanced answer extraction with format-specific parsers for multiple-choice, binary, and free-form questions
  • Updated token allocation system with dataset-specific optimal limits and model-aware multipliers

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
router_reason_bench_multi_dataset.py Enhanced answer extraction patterns, expanded dataset token configs, improved model compatibility
strategyqa_dataset.py New dataset for multi-step Yes/No reasoning questions
sciq_dataset.py New dataset for science multiple-choice questions
openbookqa_dataset.py New dataset for elementary science reasoning
math_dataset.py New dataset for competition mathematics (commented as disabled)
gsm8k_dataset.py New dataset for elementary math word problems
drop_dataset.py New dataset for discrete reasoning over text passages
aqua_rat_dataset.py New dataset for algebraic word problems with rationales
dataset_factory.py Registration of all new dataset implementations
cli.py Updated CLI choices to include new datasets
plot_comprehensive_results.py New comprehensive plotting script for benchmark visualization
comprehensive_bench.sh Enhanced benchmark script with new datasets and plotting integration
Comments suppressed due to low confidence (2)

bench/vllm_semantic_router_bench/router_reason_bench_multi_dataset.py:1

  • [nitpick] Consider using f-string formatting consistently. The string concatenation here could be replaced with an f-string for better performance and readability: f\"Reasoning steps:\\n{chr(10).join([f'{i+1}. {step}' for i, step in enumerate(decomp)])}\"
"""

bench/vllm_semantic_router_bench/dataset_implementations/gsm8k_dataset.py:1

  • This hardcoded seed conflicts with the seed parameter passed to the load_dataset method. Consider using the provided seed parameter instead of hardcoding 42 to maintain consistency with the overall seeding strategy.
"""

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Copy link

netlify bot commented Sep 28, 2025

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit efe2e77
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/68d96841a0c4300008ba6cca
😎 Deploy Preview https://deploy-preview-270--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

rootfs and others added 3 commits September 28, 2025 16:23
…ison.py

- Add new dataset implementations: AQUA-RAT, DROP, GSM8K, MATH, OpenBookQA, SciQ, StrategyQA
- Update router_reason_bench_multi_dataset.py with adaptive max token
- Improved answer extraction and evaluation logic for multiple answer formats

Signed-off-by: Huamin Chen <hchen@redhat.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Huamin Chen <hchen@redhat.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Huamin Chen <hchen@redhat.com>
rootfs and others added 2 commits September 28, 2025 16:24
Signed-off-by: Huamin Chen <hchen@redhat.com>
Copy link
Collaborator

@yuezhu1 yuezhu1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to let users to input which dataset subset (test? valid?) they want to use? As some of the datasets may not have test subsets. Also instead of giving sample count per dataset, do we want to support the percentage base sampling over the datasets?

@rootfs
Copy link
Collaborator Author

rootfs commented Sep 28, 2025

Do we want to let users to input which dataset subset (test? valid?) they want to use? As some of the datasets may not have test subsets. Also instead of giving sample count per dataset, do we want to support the percentage base sampling over the datasets?

you are right, at the moment, not all the datasets have test/train/validate splits. I will follow up with such split next.

@rootfs rootfs marked this pull request as draft September 29, 2025 00:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants