Financial Data Extractor

An automated platform to scrape, classify, parse, and compile multi-year financial statements from European company investor relations websites.

Project Overview

The Financial Data Extractor automates the labor-intensive process of collecting and standardizing financial data from annual reports. It handles:

Web Scraping: Automated discovery and download of annual reports using intelligent LLM-powered extraction
Document Classification: Categorization of PDFs (Annual Reports, Presentations, etc.)
Data Extraction: LLM-powered parsing of financial statements from PDF documents
Normalization: Fuzzy matching and deduplication of line items across multiple years
Compilation: Aggregation of 10 years of financial data into unified views

Core Objectives

Scrape & Classify: Identify and categorize PDFs from investor relations websites using Crawl4AI
Parse: Extract financial data from Annual Reports using LLM (via OpenRouter)
Compile: Aggregate 10 years of financial data into unified views
Deduplicate: Align and merge similarly-named line items across years
Prioritize Latest: Use restated data from newer reports when available

Technology Stack

Backend: FastAPI, Celery, PostgreSQL, Redis, SQLAlchemy, Alembic
Frontend: Next.js 15, React 19, TypeScript, TailwindCSS, shadcn/ui, React Query
Processing: OpenRouter (LLM API gateway), PyMuPDF, pdfplumber, Crawl4AI, rapidfuzz
Infrastructure: Docker, Flower (Celery monitoring), PostgreSQL 16, Redis 8, MinIO, Prometheus, Grafana, Loki

Target Companies

Initial Scope: 6 European companies seeded in database migrations
- AstraZeneca PLC, SAP SE, Siemens AG, ASML Holding N.V., Unilever PLC, Allianz SE
Scalable: Architecture supports adding more companies dynamically via API

Documentation

📚 Full Documentation →

Complete documentation available on GitHub Pages including:

Architecture overview and system design
API reference with all endpoints
Database schema and migrations
Task processing with Celery
Frontend development guide
Infrastructure setup

Quick Start

# Clone the repository
git clone https://github.com/PatrykQuantumNomad/financial-data-extractor.git
cd financial-data-extractor

# Setup infrastructure (PostgreSQL, Redis, MinIO, monitoring)
cd infrastructure
make up-dev

# Setup backend
cd ../backend
make install-dev
make migrate

# Start backend in one terminal
make run

# Start Celery worker in another terminal
make celery-worker

# Setup frontend in a third terminal
cd ../frontend
npm install
npm run dev

Access Points:

Frontend: http://localhost:3000
Backend API: http://localhost:3030
API Docs: http://localhost:3030/docs
Grafana: http://localhost:3200 (admin/admin)
Flower: http://localhost:5555
MinIO Console: http://localhost:9001

For detailed setup instructions, see the Full Documentation.

License

Apache 2.0 License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.cursor/rules		.cursor/rules
.github/workflows		.github/workflows
.vscode		.vscode
backend		backend
docs		docs
frontend		frontend
infrastructure		infrastructure
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Financial Data Extractor

Project Overview

Core Objectives

Technology Stack

Target Companies

Documentation

Quick Start

License

About

Uh oh!

Releases

Packages

Languages

License

PatrykQuantumNomad/financial-data-extractor

Folders and files

Latest commit

History

Repository files navigation

Financial Data Extractor

Project Overview

Core Objectives

Technology Stack

Target Companies

Documentation

Quick Start

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages