Skip to content

This project implements a document intelligence system, allowing you to process and analyze business documents (invoices, contracts, reports) using LLMs and serve results through a FastAPI interface.

Notifications You must be signed in to change notification settings

ilanit1997/PDFAnalyzer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

5 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ’พ PDF Analyzer โ€” README

This project implements a document intelligence system, allowing you to process and analyze business documents (invoices, contracts, reports) using LLMs and serve results through a FastAPI interface.


Setup Instructions

1. ๐Ÿ“ฆ Install dependencies

Create a virtual environment (optional but recommended):

python -m venv venv
source venv/bin/activate  # or venv\Scripts\activate on Windows

Then install requirements:

pip install -r requirements.txt

2. ๐Ÿ”‘ Set your OpenAI key

Create a .env file or set your environment variable:

export OPENAI_API_KEY=sk-...

3. ๐Ÿš€ Run the API

python api.py

Access the Swagger UI at:

http://localhost:8000/docs

Upload a PDF and view the extracted metadata.


๐Ÿง  Approach Overview

This project is organized into two main components:

โœ… 1. Document Understanding System

This component handles PDF text extraction, document classification, and metadata extraction using GPT-4o-mini.

๐Ÿ“Œ Label Definitions

label_descriptions = {
    "Invoice": "A bill for goods or services, typically including vendor, amount, due date, and line items.",
    "Contract": "A legal agreement between parties, containing terms, dates, and responsibilities.",
    "Earnings": "A financial or business report summarizing revenue, profits, expenses, and other key metrics.",
    "Other": "Any other type of document that does not fit the above categories."
}

๐Ÿงฉ Subcomponents

  • Classification

    • Document loading: Uses pdfplumber to extract text on a per-page basis.
    • Classifier: RunnableGPTLogprobClassifier applies GPT-4o-mini logprobs to assign one of the four labels (Invoice, Contract, Earnings, or Other).
  • Metadata Extraction

    • Uses a type-specific RunnableMetadataExtractor, which combines:
      • Prompts based on document type
      • Structured parsing with (pydantic)

โœ… 2. API Design (FastAPI)

๐Ÿ“„ View full API documentation ยป


๐Ÿง  Testing Approach

We evaluated the pipeline using real-world PDF documents collected from open-source repositories. The core/main.py script was used to test multiple files locally. The goal was to validate pipeline performance across diverse document types under realistic conditions.

๐Ÿ“‚ Document Types

1. Invoices

  • Source: invoice2data test set
  • Note: This repository provides invoice samples used for testing invoice parsers. We are not using their codebaseโ€”only the PDF files.
  • Usage: 11 invoices were selected to test the pipeline.

2. Contracts

3. Earnings Reports

  • Source: Collected via targeted web search for "Earnings report presentation PDF". Several public companies regularly publish earnings summaries in PDF format.
  • Samples:

๐Ÿ” Key Findings

  1. Contract Length: Contracts tend to be lengthy; the number of pages processed should be optimized for efficiency.

  2. Classification Strategy: Document classification can be reliably performed using only a subset of pages (e.g., first 3 pages), since the document type is usually evident early on.

  3. Overconfidence in Predictions: The classifier often returns overly confident predictions (e.g., probabilities near 1.0), which is suboptimal. Confidence calibration should be considered.

  4. "Other" Category Addition: Introduced an "Other" category to handle outlier documents and to encourage a more realistic distribution of classification probabilities.

  5. Metadata Extraction Observations:

    • Due Date (Invoices): Needs a more explicit definition to distinguish it clearly from issue dates or payment dates.
    • Vendor Name (Invoices): Ambiguity exists regarding whether this refers to the issuing or receiving company; requires clarification.
  6. Classification Errors:

    • Invoice โ†’ Other: Occurred when the document was a payment receipt, not a formal invoice.
    • Earnings โ†’ Other: Happened when the document was a general investor presentation, although it included earnings data in later pages.

    ๐Ÿ“Œ Suggestion: Instead of using a fixed max_pages, consider dynamically adjusting the page limit based on a fraction of the documentโ€™s total length (e.g., alpha ร— total_pages) to improve classification and extraction performance.

๐Ÿ” File Overview


โ”œโ”€โ”€ api.py                   # API entrypoint (FastAPI app)
โ”œโ”€โ”€ core/
โ”‚   โ”œโ”€โ”€ main.py                  # Script to test multiple files locally
โ”‚   โ”œโ”€โ”€ document_loader.py       # PDF parsing using pdfplumber
โ”‚   โ”œโ”€โ”€ document_classification.py  # Classifier with GPT logprobs
โ”‚   โ”œโ”€โ”€ metadata_extraction.py  # Metadata prompts + extraction runners
โ”‚   โ””โ”€โ”€ document_pipeline.py     # Manages pipeline: loading pdf -> classification -> extraction (per document)
โ”‚   โ””โ”€โ”€ action_generator.py     # Suggests next steps based on metadata - e.g. "Schedule payment"
โ”œโ”€โ”€ documents/              # assignment PDF files for prediction
โ”œโ”€โ”€ output/                 # output directory for processed files
โ”œโ”€โ”€ documents-extra/        # additional documents for testing
โ”‚   โ”œโ”€โ”€ Invoice/              # Sample invoices for testing
โ”‚   โ”œโ”€โ”€ Contract/             # Sample contracts for testing
โ”‚   โ””โ”€โ”€ Earning Report/       # Sample earnings reports for testing
โ”œโ”€โ”€ output-extra/            # output directory for processed additional testing files
โ”‚   โ”œโ”€โ”€ Invoice/              
โ”‚   โ”œโ”€โ”€ Contract/             
โ”‚   โ””โ”€โ”€ Earning Report/             
โ”œโ”€โ”€ requirements.txt         # Clean dependency list
โ”œโ”€โ”€ api_docs.md              # Endpoint documentation and examples
โ”œโ”€โ”€ README.md                # This file

๐Ÿค– Part 3 โ€“ AI-Powered Features for Factify

๐Ÿ”น Feature 1: Natural Language Summary of Extracted Metadata

What: After extracting structured metadata, the system generates a short, human-readable summary (e.g., โ€œInvoice from Example LLC for $1200, due July 1โ€). Difficulty: Easy, summarization is a common LLM task. How:

  • Use templating or LLM summarization to convert key metadata fields into natural language.
  • Triggered post-extraction and cached for fast UI display.

Business Value:

  • Enables metadata previews in the UI or chatbot responses.
  • Makes structured fields more accessible to non-technical users.
  • AI value: Makes later AI tasks (like search or Q&A) more effective by providing context.

๐Ÿ”น Feature 2: Next-Step Action Suggestion

What: The system proposes the next logical action for the document (e.g., โ€œSchedule paymentโ€, โ€œReview contractโ€, โ€œFlag for renewalโ€).

How:

  • Rule-based system identifies actionable metadata like due dates or termination periods.
  • Optionally enhanced by an LLM that interprets metadata contextually.
  • Served via the /documents/{id}/actions endpoint.

Business Value:

  • Transforms documents into workflow triggers.
  • Helps users stay ahead of deadlines, obligations, and business tasks.
  • Can power reminders, dashboards, or automated task queues.

๐Ÿญ Production Considerations

๐Ÿ”ง Handling LLM API Failures

  • Use retry logic (tenacity) for various failures (e.g., rate limits, timeouts, parsing errors).
  • If classification or metadata parsing fails after retries, send to human review or log for manual inspection.
  • Log all failures with document ID and traceback for audit/debugging.

๐Ÿ’พ Caching Strategy

  • Use a hash of the extracted PDF text (or file hash) as a cache key.
  • Store classification + metadata outputs.
  • Before calling the LLM, check if a previously processed result exists for this hash.

This ensures exact duplicates (same file content) are only processed once.

More complicated caching may involve storing the textual content in a vector database for similarity search.

๐Ÿ’ฐ Cost Estimate per Document

Assuming usage of gpt-4o-mini:

  • Input Pricing: $0.60 per 1M tokens
  • Output Pricing: $2.40 per 1M tokens

๐Ÿงพ Breakdown per Document:

  • Classification Prompt: ~100 input tokens
  • Metadata Extraction Prompt: ~700 input tokens
  • Document Text (Y): estimated size of extracted content in tokens
  • Total Input Tokens: 2 ร— Y + 800
  • Output Tokens:
    • Classification: ~1 token
    • Metadata: ~200 tokens
    • Total Output: ~201 tokens

If we cut off at page N for classification, the input tokens would be Y + Y_N + 800 where Y_n is the number of tokens in the first N pages.

๐Ÿ’ธ Formula:

Total Cost = ((2 ร— Y + 800) / 1,000,000 ร— $0.60) + (201 / 1,000,000 ร— $2.40)

๐Ÿงฎ Example (Y = 1000 tokens of document content):

Input: 2ร—1000 + 800 = 2800 tokens โ†’ $0.00168 Output: 201 tokens โ†’ $0.0004824 Total: ~$0.00216 per document

About

This project implements a document intelligence system, allowing you to process and analyze business documents (invoices, contracts, reports) using LLMs and serve results through a FastAPI interface.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages