Skip to content

This Python-based tool automatically processes large PDF documents, identifies sections based on text formatting, and generates intelligent summaries using a Local Large Language Model (LLM). The summaries are then compiled into a well-structured Word document.

License

Notifications You must be signed in to change notification settings

filosofo33/summary-of-large-files

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

PDF Section Summarizer with LLM

Intelligent Document Analysis & Summarization Tool

Overview

This Python-based tool automatically processes large PDF documents, identifies sections based on text formatting, and generates intelligent summaries using a Local Large Language Model (LLM). The summaries are then compiled into a well-structured Word document.

Features

  • PDF text extraction with formatting awareness
  • Automatic section detection based on font size
  • Intelligent text summarization using Local LLM
  • Word document generation with formatted summaries
  • Special character handling for API compatibility
  • Progress tracking during processing

Requirements

PyMuPDF>=1.18.0
python-docx>=0.8.11
subprocess
json

Usage

from pdf_summarizer import read_pdf_and_summarize

# Basic usage
read_pdf_and_summarize("input.pdf", "output.docx")

# Start from specific page
read_pdf_and_summarize("input.pdf", "output.docx", start_page=5)

API Configuration

The tool requires a local LLM API running on port 1234., you could also use an external API, but a local one is preferred because of costs because we are working with very big pdf files.

Text Processing Rules

  • Sections are identified by font size > 13
  • Minimum 35 words required for summarization
  • Text chunks limited to 1900 words per API call so you are not exceeding with your tokens the context wimdow

Error Handling

  • JSON decode error management
  • API communication error handling
  • Empty response handling
  • Unicode character sanitization

Contributing

  1. Fork the repository
  2. Create your feature branch
  3. Commit your changes
  4. Push to the branch
  5. Create a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.

#DocumentProcessing #MachineLearning #PDF #Python #LLM

Note: Make sure your local LLM API is properly configured before running the tool, i use llmstudio

About

This Python-based tool automatically processes large PDF documents, identifies sections based on text formatting, and generates intelligent summaries using a Local Large Language Model (LLM). The summaries are then compiled into a well-structured Word document.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages