Skip to content

This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of text extraction from PDF files.

License

Notifications You must be signed in to change notification settings

huridocs/pdf-text-extraction

Repository files navigation

PDF Text Extraction

A Docker-powered service for extracting text from PDF documents


This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of text extraction from PDF files.

You can check the pdf-document-layout-analysis service from here:

https://github.com/huridocs/pdf-document-layout-analysis

Quick Start

Start the service:

# With GPU support
make start

# Without GPU support [if you do not have a GPU on your system]
make start_no_gpu

Get the segments from a PDF:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080

Get only the text:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text

To stop the server:

make stop

Contents

Dependencies

Requirements

  • 4 GB RAM memory
  • 6 GB GPU memory (if not, it will run with CPU)

Usage

As we mentioned at the Quick Start, you can use the service simply like this:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080

This will directly return the analysis results from the pdf-document-layout-analysis service. The output will return a list of SegmentBox elements and each SegmentBox element has this shape:

    {
        "left": Left position of the segment
        "top": Top position of the segment
        "width": Width of the segment
        "height": Height of the segment
        "page_number": Page number which the segment belongs to
        "text": Text inside the segment
        "type": Type of the segment (one of the categories mentioned above)
    }

But you can also pass the types of the SegmentBoxes which you want to extract like:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080 -F "types=text title section_header list_item"

These are the types you can pass:

   "Caption"
   "Footnote"
   "Formula"
   "List_Item"
   "Page_Footer"
   "Page_Header"
   "Picture"
   "Section_Header"
   "Table"
   "Text"
   "Title"

If you only want to get the contents in a single string, you can use this command:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text

This will only return the content information. Similarly, you can pass the types of the text you want to extract:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text -F "types=text title section_header list_item"

Also, if you want to get the results faster (but with slightly worse performance) you can run this command:,

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/fast

For getting only the contents with the fast method:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text_fast

You can pass the types to these endpoints too:

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/fast -F "types=text section_header list_item"

curl -X POST -F 'file=@/PATH/TO/PDF/pdf_name.pdf' localhost:5080/text_fast -F "types=text title"

For more information about models and this fast method, check this link.

And to stop the server, you can simply use this:

make stop

About

This project aims to extract text from PDF files using the outputs generated by the pdf-document-layout-analysis service. By leveraging the segmentation and classification capabilities of the underlying analysis tool, this project automates the process of text extraction from PDF files.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published