Skip to content

This project provides a set of tools for extracting data from PDF files, visualizing text locations, and comparing the extracted data with ground truth data stored in CSV files. It calculates errors using Mean Absolute Error (MAE) and provides accuracy metrics for different fields.

License

Notifications You must be signed in to change notification settings

Eemayas/Data-Extraction-PDFs

Repository files navigation

PDF Data Extraction and Comparison Tool

This project provides a set of tools for extracting data from PDF files, visualizing text locations, and comparing the extracted data with ground truth data stored in CSV files. It calculates errors using Mean Absolute Error (MAE) and provides accuracy metrics for different fields.


Table of Contents


Features

  • Data Extraction: Extracts data from PDF files using the pymupdf library.
  • Visualization: Visualizes text locations from PDF files to assist in data verification.
  • Comparison: Compares extracted data with ground truth data stored in CSV files.
  • Error Calculation: Computes the Mean Absolute Error (MAE) between the extracted and ground truth data.
  • Accuracy Metrics: Provides accuracy percentages for specific fields and coordinates.

Repository Structure

└── Data-Extraction-PDFs/
    ├── Data Extraction
    │   ├── DE-TEST.csv
    │   ├── DE-TRAIN.csv
    │   ├── Test Files
    │   └── Train Files
    ├── Final v1.ipynb
    ├── Final v2.ipynb
    ├── Final v3.ipynb
    ├── README.md
    ├── requirement.txt
    ├── test.csv
    ├── text.csv
    └── train.csv

Installation

Prerequisites

  • Python 3.x
  • Required Python libraries:
    • pandas
    • pymupdf
    • pillow
    • scikit-learn

Installation Steps

  1. Clone the repository:

    git clone https://github.com/Eemayas/Data-Extraction-PDFs.git
    cd Data-Extraction-PDFs
  2. Install the required dependencies:

    pip install -r requirements.txt
  3. Place your PDF files and CSV files in the ./Data Extraction/ directory.

  4. Run all the cell in file Final v3.ipynb


Usage

  1. Load CSV Files: The script loads the ground truth CSV (DE-TRAIN.csv) and predicted CSV (train.csv) files into pandas DataFrames.

  2. Identify Missing Rows: The script identifies and displays any missing rows in the ground truth CSV compared to the predicted CSV.

  3. Calculate MAE and Accuracy: The script calculates the Mean Absolute Error (MAE) for specific fields (e.g., x0, y0, x2, y2) and prints the accuracy percentage for each.

  4. Compare Specific Fields: The script compares specific fields (e.g., value, label) between the two CSV files and calculates accuracy based on equality.


Example Output

Here's an example of the output generated by the script:

Missing Rows in df_true compared to df_pred:
   x0   y0   x2   y2  value  label
0  10  100  150  200  Text1  Label1

Accuracy of x0 = 98.45%
Accuracy of y0 = 97.82%
Accuracy of x2 = 99.12%
Accuracy of y2 = 98.76%

Accuracy of field 'value': 95.00%
Accuracy of field 'label': 96.50%

Contributing

Contributions are welcome! If you'd like to contribute, please fork the repository, create a new branch, and submit a pull request.

Steps to Contribute

  1. Fork the repository
  2. Create a new branch (git checkout -b feature-branch)
  3. Make your changes
  4. Commit your changes (git commit -m 'Add some feature')
  5. Push to the branch (git push origin feature-branch)
  6. Open a pull request

License

This project is licensed under the MIT License. See the LICENSE file for more details.


About

This project provides a set of tools for extracting data from PDF files, visualizing text locations, and comparing the extracted data with ground truth data stored in CSV files. It calculates errors using Mean Absolute Error (MAE) and provides accuracy metrics for different fields.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published