This project provides a set of tools for extracting data from PDF files, visualizing text locations, and comparing the extracted data with ground truth data stored in CSV files. It calculates errors using Mean Absolute Error (MAE) and provides accuracy metrics for different fields.
- Data Extraction: Extracts data from PDF files using the
pymupdf
library. - Visualization: Visualizes text locations from PDF files to assist in data verification.
- Comparison: Compares extracted data with ground truth data stored in CSV files.
- Error Calculation: Computes the Mean Absolute Error (MAE) between the extracted and ground truth data.
- Accuracy Metrics: Provides accuracy percentages for specific fields and coordinates.
└── Data-Extraction-PDFs/
├── Data Extraction
│ ├── DE-TEST.csv
│ ├── DE-TRAIN.csv
│ ├── Test Files
│ └── Train Files
├── Final v1.ipynb
├── Final v2.ipynb
├── Final v3.ipynb
├── README.md
├── requirement.txt
├── test.csv
├── text.csv
└── train.csv
- Python 3.x
- Required Python libraries:
pandas
pymupdf
pillow
scikit-learn
-
Clone the repository:
git clone https://github.com/Eemayas/Data-Extraction-PDFs.git cd Data-Extraction-PDFs
-
Install the required dependencies:
pip install -r requirements.txt
-
Place your PDF files and CSV files in the
./Data Extraction/
directory. -
Run all the cell in file
Final v3.ipynb
-
Load CSV Files: The script loads the ground truth CSV (
DE-TRAIN.csv
) and predicted CSV (train.csv
) files into pandas DataFrames. -
Identify Missing Rows: The script identifies and displays any missing rows in the ground truth CSV compared to the predicted CSV.
-
Calculate MAE and Accuracy: The script calculates the Mean Absolute Error (MAE) for specific fields (e.g.,
x0
,y0
,x2
,y2
) and prints the accuracy percentage for each. -
Compare Specific Fields: The script compares specific fields (e.g.,
value
,label
) between the two CSV files and calculates accuracy based on equality.
Here's an example of the output generated by the script:
Missing Rows in df_true compared to df_pred:
x0 y0 x2 y2 value label
0 10 100 150 200 Text1 Label1
Accuracy of x0 = 98.45%
Accuracy of y0 = 97.82%
Accuracy of x2 = 99.12%
Accuracy of y2 = 98.76%
Accuracy of field 'value': 95.00%
Accuracy of field 'label': 96.50%
Contributions are welcome! If you'd like to contribute, please fork the repository, create a new branch, and submit a pull request.
- Fork the repository
- Create a new branch (
git checkout -b feature-branch
) - Make your changes
- Commit your changes (
git commit -m 'Add some feature'
) - Push to the branch (
git push origin feature-branch
) - Open a pull request
This project is licensed under the MIT License. See the LICENSE file for more details.