GitHub - huridocs/pdf-text-extraction: This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of text extraction from PDF files.

PDF Text Extraction

A Docker-powered service for extracting text from PDF documents

This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of text extraction from PDF files.

You can check the pdf-document-layout-analysis service from here:

https://github.com/huridocs/pdf-document-layout-analysis

Quick Start

Start the service:

# With GPU support
make start

# Without GPU support [if you do not have a GPU on your system]
make start_no_gpu

Get the segments from a PDF:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080

Get only the text:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text

To stop the server:

make stop

Dependencies

Docker Desktop 4.25.0 install link

Requirements

4 GB RAM memory
6 GB GPU memory (if not, it will run with CPU)

Usage

As we mentioned at the Quick Start, you can use the service simply like this:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080

This will directly return the analysis results from the pdf-document-layout-analysis service. The output will return a list of SegmentBox elements and each SegmentBox element has this shape:

    {
        "left": Left position of the segment
        "top": Top position of the segment
        "width": Width of the segment
        "height": Height of the segment
        "page_number": Page number which the segment belongs to
        "text": Text inside the segment
        "type": Type of the segment (one of the categories mentioned above)
    }

But you can also pass the types of the SegmentBoxes which you want to extract like:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080 -F "types=text title section_header list_item"

These are the types you can pass:

   "Caption"
   "Footnote"
   "Formula"
   "List_Item"
   "Page_Footer"
   "Page_Header"
   "Picture"
   "Section_Header"
   "Table"
   "Text"
   "Title"

If you only want to get the contents in a single string, you can use this command:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text

This will only return the content information. Similarly, you can pass the types of the text you want to extract:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text -F "types=text title section_header list_item"

Also, if you want to get the results faster (but with slightly worse performance) you can run this command:,

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/fast

For getting only the contents with the fast method:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text_fast

You can pass the types to these endpoints too:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/fast -F "types=text section_header list_item"

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text_fast -F "types=text title"

For more information about models and this fast method, check this link.

And to stop the server, you can simply use this:

make stop

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
src		src
test_pdfs		test_pdfs
.dockerignore		.dockerignore
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
dev-requirements.txt		dev-requirements.txt
docker-compose-gpu.yml		docker-compose-gpu.yml
docker-compose.yml		docker-compose.yml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PDF Text Extraction

Quick Start

Contents

Dependencies

Requirements

Usage

About

Releases

Packages

Languages

License

huridocs/pdf-text-extraction

Folders and files

Latest commit

History

Repository files navigation

PDF Text Extraction

Quick Start

Contents

Dependencies

Requirements

Usage

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages