Epstein Emails Knowledge Graph

A Python-based pipeline for parsing email archives and building a temporal knowledge graph using Graphiti.

📋 About This Project

This project demonstrates a complete data engineering pipeline that transforms unstructured email archives into a queryable knowledge graph. It showcases skills in:

Why I Built This: I wanted to visualize and explore the complex connections and relationships within this dataset. Knowledge graphs are perfect for traversing relationships between people, organizations, and events over time. Graphiti's LLM-powered entity extraction combined with Neo4j's graph database capabilities allows for semantic search and relationship discovery that would be difficult with traditional databases.

Data Processing: Parsing and normalizing complex, multi-format email data
Knowledge Graph Construction: Building temporal graphs with custom entity schemas
LLM Integration: Using AI for intelligent entity extraction and relationship mapping
Data Quality: Implementing duplicate detection, date normalization, and validation pipelines

The knowledge graph enables semantic search, relationship analysis, and temporal queries across thousands of email communications.

✨ Key Features

Feature	Description
Custom Entity Schema	Pydantic-based schema for Person, Organization, Location, Event, and Document entities
Temporal Graph	Time-aware graph structure with normalized datetime handling
Multi-language Support	Automatic translation and normalization of dates in French, Slovak, and other languages
Duplicate Detection	Advanced similarity matching to identify and deduplicate near-identical emails
Semantic Search	AI-powered querying using Graphiti's LLM integration
Data Validation	Comprehensive testing and verification pipelines

🛠️ Technical Stack

Component	Technology
Language	Python 3.8+
Graph Database	Neo4j 5.x
Knowledge Graph	Graphiti (Zep)
LLM	OpenAI GPT (configurable)
Data Processing	Pydantic, pandas, regex
Analysis	Jupyter, NetworkX, matplotlib

🔍 Overview

This project processes email archives and ingests them into a Graphiti knowledge graph, enabling AI-powered querying and analysis of email threads, participants, and communication patterns.

The knowledge graph uses custom entity types (Person, Organization, Location, Event, Document) and custom relationships (REPRESENTS, ALLEGED_VICTIM_OF, INVESTIGATED_BY, EMPLOYED_BY) defined using Pydantic models to guide Graphiti's LLM extraction. See docs/ENTITY_SCHEMA.md for full schema documentation.

🚀 Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Set Up Database

See docs/SETUP.md for detailed setup instructions. Quick Docker setup:

# Start Neo4j with Docker
docker run \
    --name neo4j-email-graph \
    -p 7474:7474 -p 7687:7687 \
    -e NEO4J_AUTH=neo4j/emailgraph123 \
    -e NEO4J_PLUGINS='["apoc"]' \
    -d \
    neo4j:5.26-community

3. Configure Environment Variables

Create a .env file from the example:

cp .env.example .env

Then edit .env and set your values:

NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD - Neo4j connection
OPENAI_API_KEY - Your OpenAI API key (required)

The scripts will automatically load these from .env file.

Or use the quick start script:

./scripts/utils/quick_start_local.sh

4. Ingest Emails

# Ingest pre-processed episodes into Graphiti knowledge graph
python scripts/core/ingest_to_graphiti.py

# Or test with limited data first
python scripts/core/ingest_to_graphiti.py --limit 10

Note: The script uses data/graphiti_episodes.json by default. If you need to parse emails first, run:

python scripts/core/parse_emails.py

📁 Project Structure

epstein-emails/
├── scripts/
│   ├── core/          # Main pipeline scripts
│   │   ├── parse_emails.py          # Parse email files
│   │   ├── ingest_to_graphiti.py    # Ingest to knowledge graph
│   │   ├── generate_episodes.py     # Generate episodes from emails
│   │   └── sort_episodes.py         # Sort episodes chronologically
│   ├── utils/         # Utility scripts
│   │   ├── quick_start_local.sh     # Neo4j setup script
│   │   └── start_jupyter.sh         # Jupyter launcher
│   ├── maintenance/   # Maintenance and debugging scripts
│   └── archive/       # Deprecated/one-time-use scripts
├── data/
│   ├── raw/           # Raw email files (gitignored)
│   ├── processed/     # Cleaned email data (gitignored)
│   ├── outputs/       # Generated outputs
│   │   ├── episodes/  # Episode JSON files
│   │   └── backups/   # Backup files
│   └── intermediate/  # Intermediate parsing results (gitignored)
├── config/            # Configuration files
│   ├── manually_fixed_dates.json
│   └── episode_date_issues.json
├── docs/              # Documentation
│   ├── SETUP.md       # Setup instructions
│   ├── SETUP_ENV.md   # Environment setup
│   ├── QUERIES.md     # Query examples
│   └── GRAPHITI.md    # Graphiti details
└── analysis/          # Analysis notebooks and reports
    ├── notebooks/     # Jupyter notebooks
    └── reports/       # Generated analysis reports

💻 Usage

Ingest to Graphiti

Ingest pre-processed episodes into the knowledge graph:

# Test with limited data
python scripts/core/ingest_to_graphiti.py --limit 10

# Full ingestion (uses data/graphiti_episodes.json by default)
python scripts/core/ingest_to_graphiti.py

Parse Emails (Optional)

If you need to parse raw email files first:

python scripts/core/parse_emails.py

This creates:

data/intermediate/graph_db_export.json - Graph structure
data/intermediate/graphiti_threads.json - Thread data
data/intermediate/zep_export/zep_documents.json - Documents for ingestion

To generate episodes from parsed emails:

python scripts/core/generate_episodes.py

This creates data/outputs/episodes/graphiti_episodes.json which can then be ingested.

Query the Graph

Access Neo4j Browser at http://localhost:7474 or query programmatically:

from graphiti_core import Graphiti

graphiti = Graphiti(
    uri="bolt://localhost:7687",
    user="neo4j",
    password="emailgraph123"
)

results = await graphiti.search(
    query="emails about Trump",
    limit=10
)

See docs/QUERIES.md for more query examples.

🌟 Project Highlights

Data Processing Challenges Solved

Date Normalization: Handled 20+ datetime formats, OCR errors, and multi-language dates (French, Slovak)
Duplicate Detection: Implemented similarity algorithms to identify 100% duplicate pairs across thousands of emails
Schema Design: Created custom Pydantic models for domain-specific entity extraction
Data Quality: Built comprehensive validation pipelines to ensure data integrity

Architecture Decisions

Temporal Episodes: Used Graphiti's episodic model for time-aware graph queries
Incremental Processing: Designed pipeline to handle large datasets with progress tracking
Flexible LLM Integration: Support for both OpenAI and local LLM providers (Ollama)

📊 Visual Elements

Graph Visualization

The knowledge graph can be visualized in Neo4j Browser, showing:

Email episodes as nodes with temporal relationships
Extracted entities (people, organizations) connected to relevant emails
Communication patterns and relationship networks

🔍 Interactive Graph Exploration
Visualizing a 2-hop network from "Jeffrey Epstein" showing relationships via MENTIONS and RELATES_TO edges. The graph reveals connections between people, organizations, and locations extracted from email communications.

Example Queries

// Find all emails involving a specific person
MATCH (p:Entity {name: "Jeffrey Epstein"})-[r]->(e:Episodic)
RETURN e.name, e.reference_time
ORDER BY e.reference_time DESC
LIMIT 10

See docs/QUERIES.md for more examples.

📚 Documentation

Document	Description
Setup Guide	Detailed installation and configuration
Entity Schema	Custom entity types and relationships
Query Guide	Example queries and patterns
Graphiti Guide	Understanding Graphiti integration
Schema Reference	Overall graph database schema
Project Overview	Methodology, challenges, and solutions

📦 Requirements

Python 3.8+
Neo4j 5.x (or other graph database)
OpenAI API key (for entity extraction)

🤝 Contributing

See CONTRIBUTING.md for guidelines on contributing to this project.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Epstein Emails Knowledge Graph

📋 About This Project

✨ Key Features

🛠️ Technical Stack

🔍 Overview

🚀 Quick Start

1. Install Dependencies

2. Set Up Database

3. Configure Environment Variables

4. Ingest Emails

📁 Project Structure

💻 Usage

Ingest to Graphiti

Parse Emails (Optional)

Query the Graph

🌟 Project Highlights

Data Processing Challenges Solved

Architecture Decisions

📊 Visual Elements

Graph Visualization

Example Queries

📚 Documentation

📦 Requirements

🤝 Contributing

📄 License

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
analysis		analysis
config		config
data		data
docs		docs
scripts		scripts
.env.example		.env.example
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
PROJECT_OVERVIEW.md		PROJECT_OVERVIEW.md
README.md		README.md
pyproject.toml		pyproject.toml

License

kev-hu/epstein-emails

Folders and files

Latest commit

History

Repository files navigation

Epstein Emails Knowledge Graph

📋 About This Project

✨ Key Features

🛠️ Technical Stack

🔍 Overview

🚀 Quick Start

1. Install Dependencies

2. Set Up Database

3. Configure Environment Variables

4. Ingest Emails

📁 Project Structure

💻 Usage

Ingest to Graphiti

Parse Emails (Optional)

Query the Graph

🌟 Project Highlights

Data Processing Challenges Solved

Architecture Decisions

📊 Visual Elements

Graph Visualization

Example Queries

📚 Documentation

📦 Requirements

🤝 Contributing

📄 License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages