A tool for extracting structured materials science data from tables using Large Language Models.
MatSciTableExtract is designed to extract structured data from materials science tables using multiple input methods:
- Image-based extraction (directly processing table images)
- OCR-based extraction (extracting text from images before processing)
- Structured format extraction (parsing tables in CSV format)
- Structured format with captions (contextual captions are added to the input)
The system extracts key materials science information, including:
- Matrix name
- Filler name
- Composition information (amount and type)
- Particle surface treatment (PST) name
- Material properties (values, units, and conditions)
To process images of tables, run:
python code/imagesasinput.pyThis will:
- Process PNG images from the
tablesdirectory inside the articles in the data folder. - Generate JSON output in
data/imageoutput3.
To process tables stored as CSV files, run:
python code/StructuredFormatasInput.pyTo process tables after applying OCR, run:
python code/OCRasInput.pyMatSciTableExtract includes multiple evaluation methods to assess extraction accuracy.
- Property name evaluation:
python code/F1score-properties_with_missing.py
- Detailed property evaluation (including all parameters):
python code/F1score-properties_with_missing_including_prop_details.py
- Property evaluation without missing samples:
python code/F1score-propertieswtmissing.py
- Stage 1: Property Name Matching
- Uses Levenshtein distance (0.6 threshold) to handle variations in property naming.
- Computes initial F1 scores based on property names.
- Stage 2: Detailed Property Matching
- Evaluates values, units, and conditions.
- Scores based on exact matches for values and units.
- Normalized scoring between 0 and 1 for complex condition matching.
- Evaluate composition with missing samples:
python code/Accuracy-composition_with_missing.py
- Evaluate composition without missing samples:
python code/Accuracy-composition_wt_missing.py
- Flexible string matching:
- Sub-string comparison for PST, filler, and matrix names.
- Case-insensitive comparisons.
- Partial accuracy scoring for composition fields.
- Special handling for:
- Numeric values and percentages.
- Control samples (0% composition).
- Missing or invalid samples.
-
Composition Accuracy:
- Overall accuracy: 0.910 (matrix name, filler name, composition, and PST).
-
Property Extraction:
- Basic F1 score: 0.863 (property names only).
- Detailed F1 score: 0.419 - 0.769 (including values, units, conditions).
If you use this work, please cite:
@article{circi2024well,
title={How Well Do Large Language Models Understand Tables in Materials Science?},
author={Circi, Defne and Khalighinejad, Ghazal and Chen, Anlan and Dhingra, Bhuwan and Brinson, L. Catherine},
journal={Integrating Materials and Manufacturing Innovation},
volume={13},
pages={669--687},
year={2024},
publisher={Springer}
}This project is licensed under the MIT License.