We introduce LogicOCR, a benchmark comprising 2,780 questions with two subsets, i.e., LogicOCR-Gen with 1,100 multi-choice questions on generated images, and LogicOCR-Real with 1,680 meticulously designed free-form questions on real-world images, to evaluate the logical reasoning abilities of Large Multimodal Models (LMMs) on text-rich images, while minimizing reliance on complexs STEM knowledge. For constructing LogicOCR-Gen, we first curate a text corpus from the Chinese National Civil Servant Examination, and customize an automatic pipeline to steer GPT-Image-1 to generate images with varied layouts and fonts, ensuring contextual relevance and visual realism. Then, the generated images are manually verified. We evaluate a range of representative LMMs under Chain-of-Thought (CoT) and direct-answer settings. Our multi-dimensional analysis reveals key insights, such as the impact of test-time scaling, input modality differences, and sensitivity to visual-text orientation. Notably, LMMs still lag in multimodal reasoning compared to text-only inputs, indicating that they have not fully bridged visual reading with reasoning. Moreover, we propose TextCue, a training-free method that enhances LMMs’ perception of image regions containing important text cues for solving questions. We leverage LMMs' attention maps and an off-the-shelf text segmentation specialist to determine the region, which is then cropped and enlarged to augment the original image.
- CoT does not consistently improve accuracy on LogicOCR—most models fail to reason better step-by-step, suggesting flaws in their reasoning paths.
- Test-time scaling significantly improves performance on LogicOCR, though the efficiency of open-source LMMs still leaves room for improvement
- State-of-the-art LMMs still fall short of fully integrating visual reading and reasoning. While vision-language alignment suffices for perception tasks like OCR, it remains inadequate for more complex reasoning, especially as model size grows.
- The perception robustness of LMMs across different visual-text orientations needs further improvement. Perturbations like image rotation can reduce accuracy to near-random levels.
For main results and detailed analysis, please refer to the paper.
-
[
11/28/2025]: A new version of paper is updated. LogicOCR consists of two subsets now, i.e., LogicOCR-Gen with 1,100 multi-choice questions on generated images, and LogicOCR-Real with 1,680 meticulously designed free-form questions on real-world images. -
[
05/16/2025]: Release the dataset on huggingface. Release the codes.
- Setup
Clone this repo and download the images and JSON file:
git clone https://github.com/MiliLab/LogicOCR
cd LogicOCR
wget https://huggingface.co/datasets/MiliLab/LogicOCR/resolve/main/LogicOCR_gen.zip
wget https://huggingface.co/datasets/MiliLab/LogicOCR/resolve/main/LogicOCR_real.zip
unzip LogicOCR_gen.zip && rm LogicOCR_gen.zip
unzip LogicOCR_real.zip && rm LogicOCR_real.zip
wget https://huggingface.co/datasets/MiliLab/LogicOCR/resolve/main/LogicOCR_gen.json
wget https://huggingface.co/datasets/MiliLab/LogicOCR/resolve/main/LogicOCR_real.json- Recommed Environment
python>=3.10, torch 2.5.1, torchvision 0.20.1, transformers >= 4.49.0, flash-attn 2.7.4.post1, and see requirement.txt
- Evaluate LMMs
Some evaluation scripts are provided in infer_models and infer_models_real.
For evaluation on LogicOCR-Gen:
bash eval_gen.shFor evaluation on LogicOCR-Real:
bash eval_real.shReport the overall and detailed accuracy, for example:
python get_score.py \
--gen_json res/LLaVA-OneVision-1.5-8B-Instruct_image-text_cot.json \
--real_json res_real/LLaVA-OneVision-1.5-8B-Instruct_image-text_cot.json \- (Optional) Evaluate OCR and Two-Step Performance
bash eval_ocr.shYou can also find the existing OCR evaluation results in huggingface repo.
If you want to generate images in yourself, a JSON file with 3 samples and a simple script are provided for reference. You can run the following commands. The generated images will be saved in gen_images/saved_folder
cd gen_images
python gpt_generate.py samples.json $YOUR_API_KEY $YOUR_BASE_URL $NUM_WORKERSLogicOCR is licensed under CC BY-NC-SA 4.0.
The raw text corpora for constructing LogicOCR-Gen are collected from LogiQA and LogiQA2.0.
The inference script is modified from OCRBench. The OCR evaluation tool is modified from Fox.
If you find LogicOCR helpful, please consider giving this repo a ⭐ and citing:
@article{ye2025logicocr,
title={LogicOCR: Do Your Large Multimodal Models Excel at Logical Reasoning on Text-Rich Images?},
author={Maoyuan Ye and Haibin He and Qihuang Zhong and Jing Zhang and Juhua Liu and Bo Du},
journal={arXiv preprint arXiv:2505.12307},
year={2025}
}

