Experiment code for DiagramIR: An Automatic Evaluation Pipeline for Educational Math Diagrams.
DiagramIR is a scalable and reliable method for automatic evaluation of geometric figures. It uses as a language model to translate diagram code into an intermediate representaiton (IR),a standardized, structured format where rule-based checks can be applied against the key geometric and mathemtical constraints of the figure. This approach achieves higher agreement with human raters (Cohen’s K) and enables smaller models, such as GPT-4.1 Mini, to perform on par with larger frontier models like GPT-5—at nearly 10x lower inference cost.
- ImageMagick (
magickCLI) - TeX distribution with
lualatexanddvisvgm - Optional: uv for environment management.
Verify these system tools are on your PATH:
magick -version
lualatex --version
dvisvgm --versionThe TeX compilation step also expects the supporting styles in styles/
(IM.cls, IMlongdivision.sty, Tikz-IM.sty, Tikz-IM-ES.sty).
Create the virtual environment and install the project plus notebook extras:
uv venv
uv sync --extra notebooks
source .venv/bin/activateCopy the example file and populate your API keys / configuration:
cp .env.example .envRequired entries:
OPENAI_API_KEY,OPENAI_MODELTORCH_DEVICE,MAX_CONCURRENCY
backtranslation.py- TikZ -> IR extraction helpers (includescompile_tikz).evaluator.py- Rule-based checks applied to the extracted IR.geometry_engine.py,IR_model.py- Geometry helpers and Pydantic schema for the IR.scripts/- Utilities for llm as a judge.notebooks/- Backtranslation experiment and analysis notebooks.data/- Benchmark prompts and human annotation CSVs.results/- Cached model outputs and evaluation results.
- Run
notebooks/backtranslation_evaluation.ipynbto generate model IR extractions and populateresults/backtranslation/. - Run
notebooks/backtranslation_analysis.ipynbto compute human agreement and cost and time metrics.
-
Pre-render judge PNGs if you need image inputs:
python scripts/precompute_judge_pngs.py
Requires
magick,lualatex, anddvisvgm. -
Ensure
data/geometric_shapes_test_set.csv(or your chosen dataset) includesdiagram_id, TikZ, and optional PNG paths. -
Adjust
llm_judge.pyto list the models, prompt mode (code,image, orboth), and concurrency you want. -
Launch the evaluation:
python llm_judge.py
Results will appear under
results/llm_judge/<mode>/<model>/diagram_*.json.