A Docker-powered service for extracting text from PDF documents
This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of text extraction from PDF files.
You can check the pdf-document-layout-analysis service from here:
https://github.com/huridocs/pdf-document-layout-analysis
Start the service:
# With GPU support
make start
# Without GPU support [if you do not have a GPU on your system]
make start_no_gpu
Get the segments from a PDF:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080
Get only the text:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text
To stop the server:
make stop
- Docker Desktop 4.25.0 install link
- 4 GB RAM memory
- 6 GB GPU memory (if not, it will run with CPU)
As we mentioned at the Quick Start, you can use the service simply like this:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080
This will directly return the analysis results from the pdf-document-layout-analysis service. The output will return a list of SegmentBox elements and each SegmentBox element has this shape:
{
"left": Left position of the segment
"top": Top position of the segment
"width": Width of the segment
"height": Height of the segment
"page_number": Page number which the segment belongs to
"text": Text inside the segment
"type": Type of the segment (one of the categories mentioned above)
}
But you can also pass the types of the SegmentBoxes which you want to extract like:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080 -F "types=text title section_header list_item"
These are the types you can pass:
"Caption"
"Footnote"
"Formula"
"List_Item"
"Page_Footer"
"Page_Header"
"Picture"
"Section_Header"
"Table"
"Text"
"Title"
If you only want to get the contents in a single string, you can use this command:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text
This will only return the content information. Similarly, you can pass the types of the text you want to extract:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text -F "types=text title section_header list_item"
Also, if you want to get the results faster (but with slightly worse performance) you can run this command:,
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/fast
For getting only the contents with the fast method:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text_fast
You can pass the types to these endpoints too:
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/fast -F "types=text section_header list_item"
curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text_fast -F "types=text title"
For more information about models and this fast method, check this link.
And to stop the server, you can simply use this:
make stop