Skip to content

Cross-dataset benchmarking for OpenMath #30

@anjaligaikwad0502-bot

Description

@anjaligaikwad0502-bot

Description:

Currently, OpenMath is primarily evaluated on the GSM8K dataset. To improve model generalization and provide more comprehensive benchmarking, we propose evaluating the fine-tuned LoRA model on additional open-source math reasoning datasets such as:

SVAMP – to test arithmetic problem solving with varying difficulty.

ASDiv – to include diverse arithmetic and algebra problems.

AQuA / MathQA – optional, for advanced reasoning coverage.

Tasks:

Load the additional datasets and preprocess them to match the OpenMath input prompt format.

Run inference using the current LoRA-adapted model.

Compute accuracy and optionally step-by-step reasoning metrics.

Compare performance against GSM8K results.

Document results in a report or markdown file for future reference.

Expected Outcome:

A clear evaluation report showing OpenMath performance across multiple datasets.

Insights into model generalization and possible areas of improvement.

Labels:

enhancement

evaluation

OSCG-2026

Optional:

Include plots/tables comparing accuracy across datasets.

Add notes for dataset preprocessing scripts to allow reproducibility.

Metadata

Metadata

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions