This project is no longer actively maintained. We are focusing our efforts on developing a new and improved version, which can be found in the following repository:
https://github.com/huridocs/pdf-document-layout-analysis
We encourage you to check out the new project, as it offers enhanced features, better performance, and an updated codebase.
Thank you for your understanding and continued support!
PDF Paragraphs Extraction
This service provides one endpoint to get paragraphs from PDFs. The paragraphs contain the page number, the position in the page, the size, and the text. Furthermore, there is an option to get an asynchronous flow using message queues on redis.
Start the service:
make start
Get the paragraphs from a PDF:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5051
To stop the server:
make stop
- Quick Start
- Dependencies
- Requirements
- Docker containers
- How to use it asynchronously
- HTTP server
- Queue processor
- Service configuration
- Set up environment for development
- Train the paragraph extraction model
- Execute tests
- Troubleshooting
- Docker Desktop 4.25.0 install link
- 2Gb RAM memory
- Single core
A redis server is needed to use the service asynchronously. For that matter, it can be used the
command make start:testing
that has a built-in
redis server.
Containers with make start
Containers with make start:testing
-
Send PDF to extract
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5051/async_extraction/[tenant_name]
- Add extraction task
To add an extraction task, a message should be sent to a queue.
Python code:
queue = RedisSMQ(host=[redis host], port=[redis port], qname='segmentation_tasks', quiet=True)
message_json = '{"tenant": "tenant_name", "task": "segmentation", "params": {"filename": "pdf_file_name.pdf"}}'
message = queue.sendMessage(message_json).exceptions(False).execute()
- Get paragraphs
When the segmentation task is done, a message is placed in the results queue:
queue = RedisSMQ(host=[redis host], port=[redis port], qname='segmentation_results', quiet=True)
results_message = queue.receiveMessage().exceptions(False).execute()
# The message.message contains the following information:
# {"tenant": "tenant_name",
# "task": "pdf_name.pdf",
# "success": true,
# "error_message": "",
# "data_url": "http://localhost:5051/get_paragraphs/[tenant_name]/[pdf_name]"
# "file_url": "http://localhost:5051/get_xml/[tenant_name]/[pdf_name]"
# }
curl -X GET http://localhost:5051/get_paragraphs/[tenant_name]/[pdf_name]
curl -X GET http://localhost:5051/get_xml/[tenant_name]/[pdf_name]
or in python
requests.get(results_message.data_url)
requests.get(results_message.file_url)
The container HTTP server
is coded using Python 3.9 and uses the FastApi web framework.
If the service is running, the end point definitions can be founded in the following url:
http://localhost:5051/docs
The end points code can be founded inside the file app.py
.
The errors are reported to the file docker_volume/service.log
, if the configuration is not changed (see Get service logs)
The container Queue processor
is coded using Python 3.9, and it is on charge of the communication with redis.
The code can be founded in the file QueueProcessor.py
and it uses the library RedisSMQ
to interact with the
redis queues.
Some parameters could be configured using environment variables. If a configuration is not provided, the defaults values are used.
Default parameters:
REDIS_HOST=redis_paragraphs
REDIS_PORT=6379
MONGO_HOST=mongo_paragraphs
MONGO_PORT=28017
SERVICE_HOST=http://127.0.0.1
SERVICE_PORT=5051
It works with Python 3.9 [install] (https://runnable.com/docker/getting-started/)
make install_venv
NOTE: The model training was only tested using Python 3.11
Get the labeled data
git clone https://github.com/huridocs/pdf-labeled-data.git
Place the pdf-labeled-data project in the same folder as this repository
.
├── pdf_paragraphs_extraction
├── pdf-labeled-data
Install the virtual environment and initialize it
make install_venv
source .venv/bin/activate
Create the paragraph extraction model
python src/create_paragraph_extractor_model.py
The trained model is in the following path
model/paragraph_extraction_model.model
make test
Solution: Change RAM memory used by the docker containers to 3Gb or 4Gb