Skip to content

Transform email archives into a queryable knowledge graph using Graphiti & Neo4j. LLM-powered entity extraction, temporal relationships, semantic search.

License

Notifications You must be signed in to change notification settings

kev-hu/epstein-emails

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

12 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Epstein Emails Knowledge Graph

A Python-based pipeline for parsing email archives and building a temporal knowledge graph using Graphiti.

Neo4j Graph Visualization - 2-hop network from Jeffrey Epstein

πŸ“‹ About This Project

This project demonstrates a complete data engineering pipeline that transforms unstructured email archives into a queryable knowledge graph. It showcases skills in:

Why I Built This: I wanted to visualize and explore the complex connections and relationships within this dataset. Knowledge graphs are perfect for traversing relationships between people, organizations, and events over time. Graphiti's LLM-powered entity extraction combined with Neo4j's graph database capabilities allows for semantic search and relationship discovery that would be difficult with traditional databases.

  • Data Processing: Parsing and normalizing complex, multi-format email data
  • Knowledge Graph Construction: Building temporal graphs with custom entity schemas
  • LLM Integration: Using AI for intelligent entity extraction and relationship mapping
  • Data Quality: Implementing duplicate detection, date normalization, and validation pipelines

The knowledge graph enables semantic search, relationship analysis, and temporal queries across thousands of email communications.

✨ Key Features

Feature Description
Custom Entity Schema Pydantic-based schema for Person, Organization, Location, Event, and Document entities
Temporal Graph Time-aware graph structure with normalized datetime handling
Multi-language Support Automatic translation and normalization of dates in French, Slovak, and other languages
Duplicate Detection Advanced similarity matching to identify and deduplicate near-identical emails
Semantic Search AI-powered querying using Graphiti's LLM integration
Data Validation Comprehensive testing and verification pipelines

πŸ› οΈ Technical Stack

Component Technology
Language Python 3.8+
Graph Database Neo4j 5.x
Knowledge Graph Graphiti (Zep)
LLM OpenAI GPT (configurable)
Data Processing Pydantic, pandas, regex
Analysis Jupyter, NetworkX, matplotlib

πŸ” Overview

This project processes email archives and ingests them into a Graphiti knowledge graph, enabling AI-powered querying and analysis of email threads, participants, and communication patterns.

The knowledge graph uses custom entity types (Person, Organization, Location, Event, Document) and custom relationships (REPRESENTS, ALLEGED_VICTIM_OF, INVESTIGATED_BY, EMPLOYED_BY) defined using Pydantic models to guide Graphiti's LLM extraction. See docs/ENTITY_SCHEMA.md for full schema documentation.

πŸš€ Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Set Up Database

See docs/SETUP.md for detailed setup instructions. Quick Docker setup:

# Start Neo4j with Docker
docker run \
    --name neo4j-email-graph \
    -p 7474:7474 -p 7687:7687 \
    -e NEO4J_AUTH=neo4j/emailgraph123 \
    -e NEO4J_PLUGINS='["apoc"]' \
    -d \
    neo4j:5.26-community

3. Configure Environment Variables

Create a .env file from the example:

cp .env.example .env

Then edit .env and set your values:

  • NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD - Neo4j connection
  • OPENAI_API_KEY - Your OpenAI API key (required)

The scripts will automatically load these from .env file.

Or use the quick start script:

./scripts/utils/quick_start_local.sh

4. Ingest Emails

# Ingest pre-processed episodes into Graphiti knowledge graph
python scripts/core/ingest_to_graphiti.py

# Or test with limited data first
python scripts/core/ingest_to_graphiti.py --limit 10

Note: The script uses data/graphiti_episodes.json by default. If you need to parse emails first, run:

python scripts/core/parse_emails.py

πŸ“ Project Structure

epstein-emails/
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ core/          # Main pipeline scripts
β”‚   β”‚   β”œβ”€β”€ parse_emails.py          # Parse email files
β”‚   β”‚   β”œβ”€β”€ ingest_to_graphiti.py    # Ingest to knowledge graph
β”‚   β”‚   β”œβ”€β”€ generate_episodes.py     # Generate episodes from emails
β”‚   β”‚   └── sort_episodes.py         # Sort episodes chronologically
β”‚   β”œβ”€β”€ utils/         # Utility scripts
β”‚   β”‚   β”œβ”€β”€ quick_start_local.sh     # Neo4j setup script
β”‚   β”‚   └── start_jupyter.sh         # Jupyter launcher
β”‚   β”œβ”€β”€ maintenance/   # Maintenance and debugging scripts
β”‚   └── archive/       # Deprecated/one-time-use scripts
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ raw/           # Raw email files (gitignored)
β”‚   β”œβ”€β”€ processed/     # Cleaned email data (gitignored)
β”‚   β”œβ”€β”€ outputs/       # Generated outputs
β”‚   β”‚   β”œβ”€β”€ episodes/  # Episode JSON files
β”‚   β”‚   └── backups/   # Backup files
β”‚   └── intermediate/  # Intermediate parsing results (gitignored)
β”œβ”€β”€ config/            # Configuration files
β”‚   β”œβ”€β”€ manually_fixed_dates.json
β”‚   └── episode_date_issues.json
β”œβ”€β”€ docs/              # Documentation
β”‚   β”œβ”€β”€ SETUP.md       # Setup instructions
β”‚   β”œβ”€β”€ SETUP_ENV.md   # Environment setup
β”‚   β”œβ”€β”€ QUERIES.md     # Query examples
β”‚   └── GRAPHITI.md    # Graphiti details
└── analysis/          # Analysis notebooks and reports
    β”œβ”€β”€ notebooks/     # Jupyter notebooks
    └── reports/       # Generated analysis reports

πŸ’» Usage

Ingest to Graphiti

Ingest pre-processed episodes into the knowledge graph:

# Test with limited data
python scripts/core/ingest_to_graphiti.py --limit 10

# Full ingestion (uses data/graphiti_episodes.json by default)
python scripts/core/ingest_to_graphiti.py

Parse Emails (Optional)

If you need to parse raw email files first:

python scripts/core/parse_emails.py

This creates:

  • data/intermediate/graph_db_export.json - Graph structure
  • data/intermediate/graphiti_threads.json - Thread data
  • data/intermediate/zep_export/zep_documents.json - Documents for ingestion

To generate episodes from parsed emails:

python scripts/core/generate_episodes.py

This creates data/outputs/episodes/graphiti_episodes.json which can then be ingested.

Query the Graph

Access Neo4j Browser at http://localhost:7474 or query programmatically:

from graphiti_core import Graphiti

graphiti = Graphiti(
    uri="bolt://localhost:7687",
    user="neo4j",
    password="emailgraph123"
)

results = await graphiti.search(
    query="emails about Trump",
    limit=10
)

See docs/QUERIES.md for more query examples.

🌟 Project Highlights

Data Processing Challenges Solved

  • Date Normalization: Handled 20+ datetime formats, OCR errors, and multi-language dates (French, Slovak)
  • Duplicate Detection: Implemented similarity algorithms to identify 100% duplicate pairs across thousands of emails
  • Schema Design: Created custom Pydantic models for domain-specific entity extraction
  • Data Quality: Built comprehensive validation pipelines to ensure data integrity

Architecture Decisions

  • Temporal Episodes: Used Graphiti's episodic model for time-aware graph queries
  • Incremental Processing: Designed pipeline to handle large datasets with progress tracking
  • Flexible LLM Integration: Support for both OpenAI and local LLM providers (Ollama)

πŸ“Š Visual Elements

Graph Visualization

The knowledge graph can be visualized in Neo4j Browser, showing:

  • Email episodes as nodes with temporal relationships
  • Extracted entities (people, organizations) connected to relevant emails
  • Communication patterns and relationship networks

Neo4j Graph Visualization - 2-hop network from Jeffrey Epstein

πŸ” Interactive Graph Exploration
Visualizing a 2-hop network from "Jeffrey Epstein" showing relationships via MENTIONS and RELATES_TO edges. The graph reveals connections between people, organizations, and locations extracted from email communications.

Example Queries

// Find all emails involving a specific person
MATCH (p:Entity {name: "Jeffrey Epstein"})-[r]->(e:Episodic)
RETURN e.name, e.reference_time
ORDER BY e.reference_time DESC
LIMIT 10

See docs/QUERIES.md for more examples.

πŸ“š Documentation

Document Description
Setup Guide Detailed installation and configuration
Entity Schema Custom entity types and relationships
Query Guide Example queries and patterns
Graphiti Guide Understanding Graphiti integration
Schema Reference Overall graph database schema
Project Overview Methodology, challenges, and solutions

πŸ“¦ Requirements

  • Python 3.8+
  • Neo4j 5.x (or other graph database)
  • OpenAI API key (for entity extraction)

🀝 Contributing

See CONTRIBUTING.md for guidelines on contributing to this project.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

About

Transform email archives into a queryable knowledge graph using Graphiti & Neo4j. LLM-powered entity extraction, temporal relationships, semantic search.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published