GitHub - for-ai/multlimodal-exams

Multimodal Multilingual Global Exams (Preliminary Version)

This repository contains a preliminary version of preprocessing scripts for the Multimodal Multilingual Global Exams project for extracting text, images, and tables from PDF files. Language-specific OCR pipeline scripts will be added to this repository as they are developed. Feel free to add specific scripts for your language! You can use the base script. It is intended to serve as a template for implementing more advanced PDF parsing functionality.

Features

Text Extraction: Reads and saves all text content from each page of a PDF.
Image Extraction: Extracts images from PDF pages and saves them as individual files.
Table Extraction: Detects tables on PDF pages and saves their regions as images.
JSON Layout: Outputs a structured JSON file containing:
- Page numbers
- Extracted text
- Paths to extracted images and tables.

Installation

To set up the environment, follow these steps:

git clone https://github.com/for-ai/multlimodal-exams.git
cd multimodal-exams
conda env create -f environment.yml
conda activate mm-exams

Current Limitations

This is a preliminary version with the following known limitations:

Dividing Graphs into Subgraphs: Graphs or charts are split into separate, smaller images. Clustering is used to address this issue, but the clustering algorithm could be improved.
Reading Nonexistent Images: The script may detect and attempt to read images that don’t exist, leading to black images being stored.
Poor Equation Parsing: Mathematical equations are not accurately parsed or extracted. (Testing a new version)
Mid-Dependence on LLMs: The current script relies on a large language model (LLM) to parse the JSON structure. While placeholders are included for the images, some text—particularly in tables—may be repeated, leaving it up to the LLM to resolve any inconsistencies.

Recommendations

If you’re exploring PDF parsing, consider looking into:

PyMuPDF (pymupdf): A powerful library for PDF content extraction, including text, images, and annotations.
PDFMiner.six: Another versatile library for text extraction and layout analysis from PDFs.

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
base		base
validator		validator
Readme.md		Readme.md
environment.yml		environment.yml
generic_processing_nb.ipynb		generic_processing_nb.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multimodal Multilingual Global Exams (Preliminary Version)

Features

Installation

Current Limitations

Recommendations

About

Contributors 4

Languages

for-ai/multlimodal-exams

Folders and files

Latest commit

History

Repository files navigation

Multimodal Multilingual Global Exams (Preliminary Version)

Features

Installation

Current Limitations

Recommendations

About

Resources

Stars

Watchers

Forks

Contributors 4

Languages