This tool returns the reading order of a PDF
Create venv:
make install_venv
Get the reading order of a PDF:
source venv/bin/activate
python src/predict.py /path/to/pdf
Get the labeled data tool from the GitHub repository:
https://github.com/huridocs/pdf-labeled-data
Change the paths in src/config.py
LABELED_DATA_ROOT_PATH = /path/to/pdf-labeled-data/project TRAINED_MODEL_PATH = /path/to/save/trained/model
Create venv:
make install_venv
Train a new model:
source venv/bin/activate
python src/create_candidate_finder_model.py
python src/create_reading_order_model.py
python src/predict.py /path/to/pdf --model-path /path/to/model
python src/predict.py /path/to/pdf --extract-figures-and-tables