Local LLM PDF Text Extractor

This script extracts text from images embedded within a PDF file, structures the data, and outputs it in a Markdown format. The extracted text is structured based on the formatting instructions and saved as a Markdown file.

Features

Converts each page of a PDF into an image.
Extracts text from each image using a Large Language Model (LLM) API (specifically Ollama API).
Structures the extracted text in Markdown format.
Outputs the structured text as a .md file.

Requirements

Python 3.8+
Libraries: requests, fitz (from PyMuPDF), dotenv

Install the required libraries using:

pip install requests pymupdf python-dotenv

Setup

Clone the repository.
Install the required libraries.
Add .env file with the following environment variables:

OLLAMA_COMPLETIONS_URL=`<Your Ollama API URL>`
OLLAMA_CHAT_COMPLETIONS_URL=`<Your Ollama Chat API URL>`

Place your PDF file in the root directory or specify the correct path in the pdf_path variable.

Usage

To run the script, simply execute:

python script.py

Parameters

pdf_path : Set this variable to the path of the PDF file you wish to process.
output_folder (optional): Directory where page images will be temporarily saved.

Workflow

Convert PDF to Images : Each page in the PDF is converted to an image and stored in the specified output folder.
Encode Images : Each image is converted to base64 for API transmission.
Extract Text : Using the Ollama API, text is extracted from each image.
Structure Output : The extracted text is structured in Markdown format and saved as a .md file.

Example Output

The output is saved as a Markdown file (extracted_text.md) with structured text extracted from the PDF images.

Important Notes

API Requirements : Ensure that your API keys and URLs for the Ollama API are set correctly in the .env file.
File Cleanup : Temporary images created during the process are automatically deleted.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.env		.env
.gitignore		.gitignore
README.md		README.md
app.py		app.py
book.pdf		book.pdf
extracted_text.md		extracted_text.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Local LLM PDF Text Extractor

Features

Requirements

Setup

Usage

Parameters

Workflow

Example Output

Important Notes

About

Uh oh!

Releases

Packages

Languages

DimKouts84/LLM_PDF_Text_Extractor

Folders and files

Latest commit

History

Repository files navigation

Local LLM PDF Text Extractor

Features

Requirements

Setup

Usage

Parameters

Workflow

Example Output

Important Notes

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages