Bench: Add more dataset in router evaluation #270

rootfs · 2025-09-28T16:18:04Z

What type of PR is this?

Add new dataset implementations: AQUA-RAT, DROP, GSM8K, MATH, OpenBookQA, SciQ, StrategyQA

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Release Notes: Yes/No

github-actions · 2025-09-28T16:18:17Z

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 `bench`

Owners: @yuezhu1, @Xunzhuo
Files changed:

bench/plot_comprehensive_results.py
bench/vllm_semantic_router_bench/dataset_implementations/aqua_rat_dataset.py
bench/vllm_semantic_router_bench/dataset_implementations/drop_dataset.py
bench/vllm_semantic_router_bench/dataset_implementations/gsm8k_dataset.py
bench/vllm_semantic_router_bench/dataset_implementations/math_dataset.py
bench/vllm_semantic_router_bench/dataset_implementations/openbookqa_dataset.py
bench/vllm_semantic_router_bench/dataset_implementations/sciq_dataset.py
bench/vllm_semantic_router_bench/dataset_implementations/strategyqa_dataset.py
bench/comprehensive_bench.sh
bench/vllm_semantic_router_bench/cli.py
bench/vllm_semantic_router_bench/dataset_factory.py
bench/vllm_semantic_router_bench/router_reason_bench_multi_dataset.py

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

netlify · 2025-09-28T16:19:01Z

✅ Deploy Preview for vllm-semantic-router ready!

Name	Link
🔨 Latest commit	`bd78dd3`
🔍 Latest deploy log	https://app.netlify.com/projects/vllm-semantic-router/deploys/68d95fbf2586260008c63cae
😎 Deploy Preview	https://deploy-preview-270--vllm-semantic-router.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Copilot

Pull Request Overview

This PR adds support for multiple new datasets in the router evaluation benchmark, expanding from the original 6 datasets to include 7 new mathematical reasoning, multi-step reasoning, and scientific reasoning datasets.

Key changes:

Added implementations for 7 new datasets: AQUA-RAT, DROP, GSM8K, MATH (disabled), OpenBookQA, SciQ, and StrategyQA
Enhanced answer extraction with format-specific parsers for multiple-choice, binary, and free-form questions
Updated token allocation system with dataset-specific optimal limits and model-aware multipliers

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
router_reason_bench_multi_dataset.py	Enhanced answer extraction patterns, expanded dataset token configs, improved model compatibility
strategyqa_dataset.py	New dataset for multi-step Yes/No reasoning questions
sciq_dataset.py	New dataset for science multiple-choice questions
openbookqa_dataset.py	New dataset for elementary science reasoning
math_dataset.py	New dataset for competition mathematics (commented as disabled)
gsm8k_dataset.py	New dataset for elementary math word problems
drop_dataset.py	New dataset for discrete reasoning over text passages
aqua_rat_dataset.py	New dataset for algebraic word problems with rationales
dataset_factory.py	Registration of all new dataset implementations
cli.py	Updated CLI choices to include new datasets
plot_comprehensive_results.py	New comprehensive plotting script for benchmark visualization
comprehensive_bench.sh	Enhanced benchmark script with new datasets and plotting integration

Comments suppressed due to low confidence (2)

bench/vllm_semantic_router_bench/router_reason_bench_multi_dataset.py:1

[nitpick] Consider using f-string formatting consistently. The string concatenation here could be replaced with an f-string for better performance and readability: f\"Reasoning steps:\\n{chr(10).join([f'{i+1}. {step}' for i, step in enumerate(decomp)])}\"

"""

bench/vllm_semantic_router_bench/dataset_implementations/gsm8k_dataset.py:1

This hardcoded seed conflicts with the seed parameter passed to the load_dataset method. Consider using the provided seed parameter instead of hardcoding 42 to maintain consistency with the overall seeding strategy.

"""

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

bench/vllm_semantic_router_bench/router_reason_bench_multi_dataset.py

bench/vllm_semantic_router_bench/dataset_implementations/math_dataset.py

netlify · 2025-09-28T16:21:57Z

✅ Deploy Preview for vllm-semantic-router ready!

Name	Link
🔨 Latest commit	`efe2e77`
🔍 Latest deploy log	https://app.netlify.com/projects/vllm-semantic-router/deploys/68d96841a0c4300008ba6cca
😎 Deploy Preview	https://deploy-preview-270--vllm-semantic-router.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

…ison.py - Add new dataset implementations: AQUA-RAT, DROP, GSM8K, MATH, OpenBookQA, SciQ, StrategyQA - Update router_reason_bench_multi_dataset.py with adaptive max token - Improved answer extraction and evaluation logic for multiple answer formats Signed-off-by: Huamin Chen <hchen@redhat.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Huamin Chen <hchen@redhat.com>

Signed-off-by: Huamin Chen <hchen@redhat.com>

yuezhu1

Do we want to let users to input which dataset subset (test? valid?) they want to use? As some of the datasets may not have test subsets. Also instead of giving sample count per dataset, do we want to support the percentage base sampling over the datasets?

rootfs · 2025-09-28T18:53:17Z

Do we want to let users to input which dataset subset (test? valid?) they want to use? As some of the datasets may not have test subsets. Also instead of giving sample count per dataset, do we want to support the percentage base sampling over the datasets?

you are right, at the moment, not all the datasets have test/train/validate splits. I will follow up with such split next.

rootfs requested review from yuezhu1 and Xunzhuo as code owners September 28, 2025 16:18

rootfs requested a review from Copilot September 28, 2025 16:18

github-actions bot assigned Xunzhuo and yuezhu1 Sep 28, 2025

rootfs force-pushed the bench-dataset branch from bd78dd3 to 2743aad Compare September 28, 2025 16:18

Copilot AI reviewed Sep 28, 2025

View reviewed changes

bench/vllm_semantic_router_bench/router_reason_bench_multi_dataset.py Outdated Show resolved Hide resolved

bench/vllm_semantic_router_bench/dataset_implementations/math_dataset.py Outdated Show resolved Hide resolved

rootfs and others added 3 commits September 28, 2025 16:23

Apply suggestion from @Copilot

4f8881a

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Huamin Chen <hchen@redhat.com>

Apply suggestion from @Copilot

8952bc9

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Huamin Chen <hchen@redhat.com>

rootfs force-pushed the bench-dataset branch from f8bac71 to 8952bc9 Compare September 28, 2025 16:23

rootfs and others added 2 commits September 28, 2025 16:24

fix lint

8440831

Signed-off-by: Huamin Chen <hchen@redhat.com>

Merge branch 'main' into bench-dataset

efe2e77

yuezhu1 reviewed Sep 28, 2025

View reviewed changes

rootfs marked this pull request as draft September 29, 2025 00:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Bench: Add more dataset in router evaluation #270

Bench: Add more dataset in router evaluation #270

Uh oh!

rootfs commented Sep 28, 2025

Uh oh!

github-actions bot commented Sep 28, 2025 •

edited

Loading

Uh oh!

netlify bot commented Sep 28, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

netlify bot commented Sep 28, 2025 •

edited

Loading

Uh oh!

yuezhu1 left a comment

Uh oh!

rootfs commented Sep 28, 2025

Uh oh!

Uh oh!

Bench: Add more dataset in router evaluation #270

Are you sure you want to change the base?

Bench: Add more dataset in router evaluation #270

Uh oh!

Conversation

rootfs commented Sep 28, 2025

Uh oh!

github-actions bot commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

👥 vLLM Semantic Team Notification

📁 bench

🎉 Thanks for your contributions!

Uh oh!

netlify bot commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for vllm-semantic-router ready!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

netlify bot commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for vllm-semantic-router ready!

Uh oh!

yuezhu1 left a comment

Choose a reason for hiding this comment

Uh oh!

rootfs commented Sep 28, 2025

Uh oh!

Uh oh!

github-actions bot commented Sep 28, 2025 •

edited

Loading

📁 `bench`

netlify bot commented Sep 28, 2025 •

edited

Loading

netlify bot commented Sep 28, 2025 •

edited

Loading