A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models
ConceptMath is a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models.
conda create --name conceptmath python=3.9
conda activate conceptmath
pip install sympy scipy pandas
git clone https://github.com/conceptmath/conceptmath.git
cd conceptmath
Please run the tested model and save the responses in the inference
folder. Follow the format in inference/Meta-Llama-3-70B-Instruction/middle_en.jsonl
for the responses.
After preparing the model responses, you can run the following script,
python evaluation/main.py --path_in inference/Meta-Llama-3-70B-Instruction/middle_en.jsonl --dir_out inference/Meta-Llama-3-70B-Instruction/ --model Meta-Llama-3-70B-Instruction --grade middle_en
Then you will get: overall acc
, concept acc
, and bad case in the dir_out
directory.
Leaderboard
Based on our ConcepthMath, we evaluate a broad range of LLMs, and we observe existing LLMs, though achieving high average accuracies on traditional benchmarks, exhibit significant performance variations across different math concepts and may even fail catastrophically on the most basic ones.
Results of different models on our constructed ConceptMath benchmark dataset.
(a) Concept accuracies on Middle-EN (b) Mean concept accuracies on Middle-EN.
About ConceptMath
ConceptMath is a bilingual (English and Chinese), fine-grained benchmark that evaluates concept-wise mathematical reasoning of Large Language Models (LLMs). Unlike traditional benchmarks that evaluate general mathematical reasoning with an average accuracy, ConceptMath systematically organizes math problems under a hierarchy of math concepts, so that mathematical reasoning can be evaluated at different granularity with conceptwise accuracies.
ConceptMath comprises a total of 4011 math problems across 214 math concepts.
How to efficiently enhance the weaknesses of existing LLMs
We also introduce an efficient fine-tuning strategy to enhance the weaknesses of existing LLMs.
Left: The concept-wise accuracies of LLaMA2-13B and the fine-tuned version based on our efficient finetuning method (i.e., LLaMA2-FT); Right: Introducing CS data specifically for the bottom 10 concepts significantly enhances these concepts’ performance, while slightly improving the performance across the remaining 33 concepts.
Feel free to cite us if you like ConceptMath.
@article{wu2024conceptmath,
title={ConceptMath: A Bilingual Concept-wise Benchmark for Measuring Mathematical Reasoning of Large Language Models},
author={Wu, Yanan and Liu, Jie and Bu, Xingyuan and Liu, Jiaheng and Zhou, Zhanhui and Zhang, Yuanxing and Zhang, Chenchen and Bai, Zhiqi and Chen, Haibin and Ge, Tiezheng and others},
journal={arXiv preprint arXiv:2402.14660},
year={2024}
}