PDF Data Extraction and Comparison Tool

This project provides a set of tools for extracting data from PDF files, visualizing text locations, and comparing the extracted data with ground truth data stored in CSV files. It calculates errors using Mean Absolute Error (MAE) and provides accuracy metrics for different fields.

Features

Data Extraction: Extracts data from PDF files using the pymupdf library.
Visualization: Visualizes text locations from PDF files to assist in data verification.
Comparison: Compares extracted data with ground truth data stored in CSV files.
Error Calculation: Computes the Mean Absolute Error (MAE) between the extracted and ground truth data.
Accuracy Metrics: Provides accuracy percentages for specific fields and coordinates.

Repository Structure

└── Data-Extraction-PDFs/
    ├── Data Extraction
    │   ├── DE-TEST.csv
    │   ├── DE-TRAIN.csv
    │   ├── Test Files
    │   └── Train Files
    ├── Final v1.ipynb
    ├── Final v2.ipynb
    ├── Final v3.ipynb
    ├── README.md
    ├── requirement.txt
    ├── test.csv
    ├── text.csv
    └── train.csv

Installation

Prerequisites

Python 3.x
Required Python libraries:
- pandas
- pymupdf
- pillow
- scikit-learn

Installation Steps

Clone the repository:

git clone https://github.com/Eemayas/Data-Extraction-PDFs.git
cd Data-Extraction-PDFs

Install the required dependencies:
```
pip install -r requirements.txt
```
Place your PDF files and CSV files in the ./Data Extraction/ directory.
Run all the cell in file Final v3.ipynb

Usage

Load CSV Files: The script loads the ground truth CSV (DE-TRAIN.csv) and predicted CSV (train.csv) files into pandas DataFrames.
Identify Missing Rows: The script identifies and displays any missing rows in the ground truth CSV compared to the predicted CSV.
Calculate MAE and Accuracy: The script calculates the Mean Absolute Error (MAE) for specific fields (e.g., x0, y0, x2, y2) and prints the accuracy percentage for each.
Compare Specific Fields: The script compares specific fields (e.g., value, label) between the two CSV files and calculates accuracy based on equality.

Example Output

Here's an example of the output generated by the script:

Missing Rows in df_true compared to df_pred:
   x0   y0   x2   y2  value  label
0  10  100  150  200  Text1  Label1

Accuracy of x0 = 98.45%
Accuracy of y0 = 97.82%
Accuracy of x2 = 99.12%
Accuracy of y2 = 98.76%

Accuracy of field 'value': 95.00%
Accuracy of field 'label': 96.50%

Contributing

Contributions are welcome! If you'd like to contribute, please fork the repository, create a new branch, and submit a pull request.

Steps to Contribute

Fork the repository
Create a new branch (git checkout -b feature-branch)
Make your changes
Commit your changes (git commit -m 'Add some feature')
Push to the branch (git push origin feature-branch)
Open a pull request

License

This project is licensed under the MIT License. See the LICENSE file for more details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

PDF Data Extraction and Comparison Tool

Table of Contents

Features

Repository Structure

Installation

Prerequisites

Installation Steps

Usage

Example Output

Contributing

Steps to Contribute

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

PDF Data Extraction and Comparison Tool

Table of Contents

Features

Repository Structure

Installation

Prerequisites

Installation Steps

Usage

Example Output

Contributing

Steps to Contribute

License