This script extracts text from images embedded within a PDF file, structures the data, and outputs it in a Markdown format. The extracted text is structured based on the formatting instructions and saved as a Markdown file.
- Converts each page of a PDF into an image.
- Extracts text from each image using a Large Language Model (LLM) API (specifically
Ollama API). - Structures the extracted text in Markdown format.
- Outputs the structured text as a
.mdfile.
- Python 3.8+
- Libraries:
requests,fitz(from PyMuPDF),dotenv
Install the required libraries using:
pip install requests pymupdf python-dotenv- Clone the repository.
- Install the required libraries.
- Add
.envfile with the following environment variables:
OLLAMA_COMPLETIONS_URL=`<Your Ollama API URL>`
OLLAMA_CHAT_COMPLETIONS_URL=`<Your Ollama Chat API URL>`- Place your PDF file in the root directory or specify the correct path in the
pdf_pathvariable.
To run the script, simply execute:
python script.pypdf_path: Set this variable to the path of the PDF file you wish to process.output_folder(optional): Directory where page images will be temporarily saved.
- Convert PDF to Images : Each page in the PDF is converted to an image and stored in the specified output folder.
- Encode Images : Each image is converted to base64 for API transmission.
- Extract Text : Using the Ollama API, text is extracted from each image.
- Structure Output : The extracted text is structured in Markdown format and saved as a
.mdfile.
The output is saved as a Markdown file (extracted_text.md) with structured text extracted from the PDF images.
- API Requirements : Ensure that your API keys and URLs for the Ollama API are set correctly in the
.envfile. - File Cleanup : Temporary images created during the process are automatically deleted.