🌾 Harvestor

AI-powered document data extraction toolkit

Extract structured data from documents (invoices, receipts, forms) using any supported provider. Easily integrate into your Python applications with flexible input options and built-in cost tracking.

⚠️ Early Development: This project is in active development. Core functionality is working, but many features are still being built.

What Works Now

✅ Vision API Integration: Extract data from images (.jpg, .png, .gif, .webp)
✅ Flexible Input: Accepts file paths, bytes, or file-like objects (like PIL, requests)
✅ Cost Tracking: Built-in monitoring and limits for API usage (needs to be improved)
✅ Structured Output: Returns Pydantic-validated data models that you can define
✅ Providers: Currently supports Anthropic, OpenAI and local with Ollama

Quick Start

# Install dependencies
uv sync

# Setup environment
cp .env.template .env
# Add your Anthropic or OpenAI API key to .env

# Run a test
uv run python example.py

Basic Usage

from harvestor import harvest

# From file path
result = harvest("invoice.jpg")
print(f"Invoice #: {result.data.get('invoice_number')}")
print(f"Total: ${result.data.get('total_amount')}")
print(f"Cost: ${result.total_cost:.4f}")

# From bytes (e.g., API upload)
with open("invoice.jpg", "rb") as f:
    data = f.read()
result = harvest(data, filename="invoice.jpg")

# From file-like object
from io import BytesIO
buffer = BytesIO(image_data)
result = harvest(buffer, filename="invoice.jpg")

# Display summary output
print(result.to_summary())

Testing

# Install test dependencies
uv sync --extra dev

# Run tests
make test

# Run with coverage
make test-cov

Requirements

Python 3.13
Anthropic API key or OpenAI API key
Optional: Ollama for local model support

Citation

For testing and evaluation, we are currently using the following dataset:

Limam, M., et al. FATURA Dataset. Zenodo, 13 Dec. 2023, https://doi.org/10.5281/zenodo.10371464.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
.github/workflows		.github/workflows
data		data
logs		logs
src		src
tests		tests
.env.template		.env.template
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
example.py		example.py
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🌾 Harvestor

What Works Now

Quick Start

Basic Usage

Testing

Requirements

Citation

License

About

Uh oh!

Releases 2

Packages

Languages

License

SIMOUNIX/harvestor

Folders and files

Latest commit

History

Repository files navigation

🌾 Harvestor

What Works Now

Quick Start

Basic Usage

Testing

Requirements

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages