Cross-dataset benchmarking for OpenMath

## Description:

Currently, OpenMath is primarily evaluated on the GSM8K dataset. To improve model generalization and provide more comprehensive benchmarking, we propose evaluating the fine-tuned LoRA model on additional open-source math reasoning datasets such as:

SVAMP – to test arithmetic problem solving with varying difficulty.

ASDiv – to include diverse arithmetic and algebra problems.

AQuA / MathQA – optional, for advanced reasoning coverage.

## Tasks:

Load the additional datasets and preprocess them to match the OpenMath input prompt format.

Run inference using the current LoRA-adapted model.

Compute accuracy and optionally step-by-step reasoning metrics.

Compare performance against GSM8K results.

Document results in a report or markdown file for future reference.

## Expected Outcome:

A clear evaluation report showing OpenMath performance across multiple datasets.

Insights into model generalization and possible areas of improvement.

## Labels:

enhancement

evaluation

OSCG-2026

## Optional:

Include plots/tables comparing accuracy across datasets.

Add notes for dataset preprocessing scripts to allow reproducibility.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cross-dataset benchmarking for OpenMath #30

Description:

Tasks:

Expected Outcome:

Labels:

Optional:

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Cross-dataset benchmarking for OpenMath #30

Description

Description:

Tasks:

Expected Outcome:

Labels:

Optional:

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions