This project implements a document intelligence system, allowing you to process and analyze business documents (invoices, contracts, reports) using LLMs and serve results through a FastAPI interface.
Create a virtual environment (optional but recommended):
python -m venv venv
source venv/bin/activate # or venv\Scripts\activate on WindowsThen install requirements:
pip install -r requirements.txtCreate a .env file or set your environment variable:
export OPENAI_API_KEY=sk-...python api.pyAccess the Swagger UI at:
Upload a PDF and view the extracted metadata.
This project is organized into two main components:
This component handles PDF text extraction, document classification, and metadata extraction using GPT-4o-mini.
label_descriptions = {
"Invoice": "A bill for goods or services, typically including vendor, amount, due date, and line items.",
"Contract": "A legal agreement between parties, containing terms, dates, and responsibilities.",
"Earnings": "A financial or business report summarizing revenue, profits, expenses, and other key metrics.",
"Other": "Any other type of document that does not fit the above categories."
}-
Classification
- Document loading: Uses
pdfplumberto extract text on a per-page basis. - Classifier:
RunnableGPTLogprobClassifierapplies GPT-4o-mini logprobs to assign one of the four labels (Invoice,Contract,Earnings, orOther).
- Document loading: Uses
-
Metadata Extraction
- Uses a type-specific
RunnableMetadataExtractor, which combines:- Prompts based on document type
- Structured parsing with (
pydantic)
- Uses a type-specific
๐ View full API documentation ยป
We evaluated the pipeline using real-world PDF documents collected from open-source repositories. The core/main.py script was used to test multiple files locally. The goal was to validate pipeline performance across diverse document types under realistic conditions.
- Source: invoice2data test set
- Note: This repository provides invoice samples used for testing invoice parsers. We are not using their codebaseโonly the PDF files.
- Usage: 11 invoices were selected to test the pipeline.
- Source: CUAD v1 (Contract Understanding Atticus Dataset)
- Note: Contains numerous contract PDFs. We sampled 8 documents for testing.
- Source: Collected via targeted web search for "Earnings report presentation PDF". Several public companies regularly publish earnings summaries in PDF format.
- Samples:
-
Contract Length: Contracts tend to be lengthy; the number of pages processed should be optimized for efficiency.
-
Classification Strategy: Document classification can be reliably performed using only a subset of pages (e.g., first 3 pages), since the document type is usually evident early on.
-
Overconfidence in Predictions: The classifier often returns overly confident predictions (e.g., probabilities near 1.0), which is suboptimal. Confidence calibration should be considered.
-
"Other" Category Addition: Introduced an "Other" category to handle outlier documents and to encourage a more realistic distribution of classification probabilities.
-
Metadata Extraction Observations:
- Due Date (Invoices): Needs a more explicit definition to distinguish it clearly from issue dates or payment dates.
- Vendor Name (Invoices): Ambiguity exists regarding whether this refers to the issuing or receiving company; requires clarification.
-
Classification Errors:
- Invoice โ Other: Occurred when the document was a payment receipt, not a formal invoice.
- Earnings โ Other: Happened when the document was a general investor presentation, although it included earnings data in later pages.
๐ Suggestion: Instead of using a fixed
max_pages, consider dynamically adjusting the page limit based on a fraction of the documentโs total length (e.g.,alpha ร total_pages) to improve classification and extraction performance.
โโโ api.py # API entrypoint (FastAPI app)
โโโ core/
โ โโโ main.py # Script to test multiple files locally
โ โโโ document_loader.py # PDF parsing using pdfplumber
โ โโโ document_classification.py # Classifier with GPT logprobs
โ โโโ metadata_extraction.py # Metadata prompts + extraction runners
โ โโโ document_pipeline.py # Manages pipeline: loading pdf -> classification -> extraction (per document)
โ โโโ action_generator.py # Suggests next steps based on metadata - e.g. "Schedule payment"
โโโ documents/ # assignment PDF files for prediction
โโโ output/ # output directory for processed files
โโโ documents-extra/ # additional documents for testing
โ โโโ Invoice/ # Sample invoices for testing
โ โโโ Contract/ # Sample contracts for testing
โ โโโ Earning Report/ # Sample earnings reports for testing
โโโ output-extra/ # output directory for processed additional testing files
โ โโโ Invoice/
โ โโโ Contract/
โ โโโ Earning Report/
โโโ requirements.txt # Clean dependency list
โโโ api_docs.md # Endpoint documentation and examples
โโโ README.md # This file
What: After extracting structured metadata, the system generates a short, human-readable summary (e.g., โInvoice from Example LLC for $1200, due July 1โ). Difficulty: Easy, summarization is a common LLM task. How:
- Use templating or LLM summarization to convert key metadata fields into natural language.
- Triggered post-extraction and cached for fast UI display.
Business Value:
- Enables metadata previews in the UI or chatbot responses.
- Makes structured fields more accessible to non-technical users.
- AI value: Makes later AI tasks (like search or Q&A) more effective by providing context.
What: The system proposes the next logical action for the document (e.g., โSchedule paymentโ, โReview contractโ, โFlag for renewalโ).
How:
- Rule-based system identifies actionable metadata like due dates or termination periods.
- Optionally enhanced by an LLM that interprets metadata contextually.
- Served via the
/documents/{id}/actionsendpoint.
Business Value:
- Transforms documents into workflow triggers.
- Helps users stay ahead of deadlines, obligations, and business tasks.
- Can power reminders, dashboards, or automated task queues.
- Use retry logic (
tenacity) for various failures (e.g., rate limits, timeouts, parsing errors). - If classification or metadata parsing fails after retries, send to human review or log for manual inspection.
- Log all failures with document ID and traceback for audit/debugging.
- Use a hash of the extracted PDF text (or file hash) as a cache key.
- Store classification + metadata outputs.
- Before calling the LLM, check if a previously processed result exists for this hash.
This ensures exact duplicates (same file content) are only processed once.
More complicated caching may involve storing the textual content in a vector database for similarity search.
Assuming usage of gpt-4o-mini:
- Input Pricing: $0.60 per 1M tokens
- Output Pricing: $2.40 per 1M tokens
- Classification Prompt: ~100 input tokens
- Metadata Extraction Prompt: ~700 input tokens
- Document Text (Y): estimated size of extracted content in tokens
- Total Input Tokens:
2 ร Y + 800 - Output Tokens:
- Classification: ~1 token
- Metadata: ~200 tokens
- Total Output: ~201 tokens
If we cut off at page N for classification, the input tokens would be Y + Y_N + 800 where Y_n is the number of tokens in the first N pages.
Total Cost = ((2 ร Y + 800) / 1,000,000 ร $0.60) + (201 / 1,000,000 ร $2.40)
Input: 2ร1000 + 800 = 2800 tokens โ $0.00168 Output: 201 tokens โ $0.0004824 Total: ~$0.00216 per document