conceptmath

ConceptMath

A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models

Introduction

ConceptMath is a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models.

How to run

Step 1: Installation

conda create --name conceptmath python=3.9
conda activate conceptmath
pip install sympy scipy pandas
git clone https://github.com/conceptmath/conceptmath.git
cd conceptmath

Step 2: Data Preparation

Please run the tested model and save the responses in the inference folder. Follow the format in inference/Meta-Llama-3-70B-Instruction/middle_en.jsonl for the responses.

Step 3: Evaluation

After preparing the model responses, you can run the following script,

python evaluation/main.py --path_in inference/Meta-Llama-3-70B-Instruction/middle_en.jsonl --dir_out inference/Meta-Llama-3-70B-Instruction/ --model Meta-Llama-3-70B-Instruction --grade middle_en

Then you will get: overall acc, concept acc, and bad case in the dir_out directory.

Leaderboard

Based on our ConcepthMath, we evaluate a broad range of LLMs, and we observe existing LLMs, though achieving high average accuracies on traditional benchmarks, exhibit significant performance variations across different math concepts and may even fail catastrophically on the most basic ones.

Results of different models on our constructed ConceptMath benchmark dataset.

(a) Concept accuracies on Middle-EN (b) Mean concept accuracies on Middle-EN.

ConceptMath

About ConceptMath

ConceptMath is a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models (LLMs). Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systematically organizes math problems under a hierarchy of math concepts, so that mathematical reasoning can be evaluated at different granularity with conceptwise accuracies.

ConceptMath comprises a total of 4011 math problems across 214 math concepts.

How to efficiently enhance the weaknesses of existing LLMs

We also introduce an efficient fine-tuning strategy to enhance the weaknesses of existing LLMs.

Left: The concept-wise accuracies of LLaMA2-13B and the fine-tuned version based on our efficient finetuning method (i.e., LLaMA2-FT); Right: Introducing CS data specifically for the bottom 10 concepts significantly enhances these concepts’ performance, while slightly improving the performance across the remaining 33 concepts.

Citation

Feel free to cite us if you like ConceptMath.

@article{wu2024conceptmath,
  title={ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models},
  author={Wu, Yanan and Liu, Jie and Bu, Xingyuan and Liu, Jiaheng and Zhou, Zhanhui and Zhang, Yuanxing and Zhang, Chenchen and Bai, Zhiqi and Chen, Haibin and Ge, Tiezheng and others},
  journal={arXiv preprint arXiv:2402.14660},
  year={2024}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conceptmath

Achievements

Achievements

Block or report conceptmath

ConceptMath

Introduction

How to run

Step 1: Installation

Step 2: Data Preparation

Step 3: Evaluation

Leaderboard

ConceptMath

Citation

Popular repositories Loading