-
Notifications
You must be signed in to change notification settings - Fork 21
Description
Description:
Currently, OpenMath is primarily evaluated on the GSM8K dataset. To improve model generalization and provide more comprehensive benchmarking, we propose evaluating the fine-tuned LoRA model on additional open-source math reasoning datasets such as:
SVAMP – to test arithmetic problem solving with varying difficulty.
ASDiv – to include diverse arithmetic and algebra problems.
AQuA / MathQA – optional, for advanced reasoning coverage.
Tasks:
Load the additional datasets and preprocess them to match the OpenMath input prompt format.
Run inference using the current LoRA-adapted model.
Compute accuracy and optionally step-by-step reasoning metrics.
Compare performance against GSM8K results.
Document results in a report or markdown file for future reference.
Expected Outcome:
A clear evaluation report showing OpenMath performance across multiple datasets.
Insights into model generalization and possible areas of improvement.
Labels:
enhancement
evaluation
OSCG-2026
Optional:
Include plots/tables comparing accuracy across datasets.
Add notes for dataset preprocessing scripts to allow reproducibility.