A Python-based pipeline for parsing email archives and building a temporal knowledge graph using Graphiti.
This project demonstrates a complete data engineering pipeline that transforms unstructured email archives into a queryable knowledge graph. It showcases skills in:
Why I Built This: I wanted to visualize and explore the complex connections and relationships within this dataset. Knowledge graphs are perfect for traversing relationships between people, organizations, and events over time. Graphiti's LLM-powered entity extraction combined with Neo4j's graph database capabilities allows for semantic search and relationship discovery that would be difficult with traditional databases.
- Data Processing: Parsing and normalizing complex, multi-format email data
- Knowledge Graph Construction: Building temporal graphs with custom entity schemas
- LLM Integration: Using AI for intelligent entity extraction and relationship mapping
- Data Quality: Implementing duplicate detection, date normalization, and validation pipelines
The knowledge graph enables semantic search, relationship analysis, and temporal queries across thousands of email communications.
| Feature | Description |
|---|---|
| Custom Entity Schema | Pydantic-based schema for Person, Organization, Location, Event, and Document entities |
| Temporal Graph | Time-aware graph structure with normalized datetime handling |
| Multi-language Support | Automatic translation and normalization of dates in French, Slovak, and other languages |
| Duplicate Detection | Advanced similarity matching to identify and deduplicate near-identical emails |
| Semantic Search | AI-powered querying using Graphiti's LLM integration |
| Data Validation | Comprehensive testing and verification pipelines |
| Component | Technology |
|---|---|
| Language | Python 3.8+ |
| Graph Database | Neo4j 5.x |
| Knowledge Graph | Graphiti (Zep) |
| LLM | OpenAI GPT (configurable) |
| Data Processing | Pydantic, pandas, regex |
| Analysis | Jupyter, NetworkX, matplotlib |
This project processes email archives and ingests them into a Graphiti knowledge graph, enabling AI-powered querying and analysis of email threads, participants, and communication patterns.
The knowledge graph uses custom entity types (Person, Organization, Location, Event, Document) and custom relationships (REPRESENTS, ALLEGED_VICTIM_OF, INVESTIGATED_BY, EMPLOYED_BY) defined using Pydantic models to guide Graphiti's LLM extraction. See docs/ENTITY_SCHEMA.md for full schema documentation.
pip install -r requirements.txtSee docs/SETUP.md for detailed setup instructions. Quick Docker setup:
# Start Neo4j with Docker
docker run \
--name neo4j-email-graph \
-p 7474:7474 -p 7687:7687 \
-e NEO4J_AUTH=neo4j/emailgraph123 \
-e NEO4J_PLUGINS='["apoc"]' \
-d \
neo4j:5.26-communityCreate a .env file from the example:
cp .env.example .envThen edit .env and set your values:
NEO4J_URI,NEO4J_USER,NEO4J_PASSWORD- Neo4j connectionOPENAI_API_KEY- Your OpenAI API key (required)
The scripts will automatically load these from .env file.
Or use the quick start script:
./scripts/utils/quick_start_local.sh# Ingest pre-processed episodes into Graphiti knowledge graph
python scripts/core/ingest_to_graphiti.py
# Or test with limited data first
python scripts/core/ingest_to_graphiti.py --limit 10Note: The script uses data/graphiti_episodes.json by default. If you need to parse emails first, run:
python scripts/core/parse_emails.pyepstein-emails/
βββ scripts/
β βββ core/ # Main pipeline scripts
β β βββ parse_emails.py # Parse email files
β β βββ ingest_to_graphiti.py # Ingest to knowledge graph
β β βββ generate_episodes.py # Generate episodes from emails
β β βββ sort_episodes.py # Sort episodes chronologically
β βββ utils/ # Utility scripts
β β βββ quick_start_local.sh # Neo4j setup script
β β βββ start_jupyter.sh # Jupyter launcher
β βββ maintenance/ # Maintenance and debugging scripts
β βββ archive/ # Deprecated/one-time-use scripts
βββ data/
β βββ raw/ # Raw email files (gitignored)
β βββ processed/ # Cleaned email data (gitignored)
β βββ outputs/ # Generated outputs
β β βββ episodes/ # Episode JSON files
β β βββ backups/ # Backup files
β βββ intermediate/ # Intermediate parsing results (gitignored)
βββ config/ # Configuration files
β βββ manually_fixed_dates.json
β βββ episode_date_issues.json
βββ docs/ # Documentation
β βββ SETUP.md # Setup instructions
β βββ SETUP_ENV.md # Environment setup
β βββ QUERIES.md # Query examples
β βββ GRAPHITI.md # Graphiti details
βββ analysis/ # Analysis notebooks and reports
βββ notebooks/ # Jupyter notebooks
βββ reports/ # Generated analysis reports
Ingest pre-processed episodes into the knowledge graph:
# Test with limited data
python scripts/core/ingest_to_graphiti.py --limit 10
# Full ingestion (uses data/graphiti_episodes.json by default)
python scripts/core/ingest_to_graphiti.pyIf you need to parse raw email files first:
python scripts/core/parse_emails.pyThis creates:
data/intermediate/graph_db_export.json- Graph structuredata/intermediate/graphiti_threads.json- Thread datadata/intermediate/zep_export/zep_documents.json- Documents for ingestion
To generate episodes from parsed emails:
python scripts/core/generate_episodes.pyThis creates data/outputs/episodes/graphiti_episodes.json which can then be ingested.
Access Neo4j Browser at http://localhost:7474 or query programmatically:
from graphiti_core import Graphiti
graphiti = Graphiti(
uri="bolt://localhost:7687",
user="neo4j",
password="emailgraph123"
)
results = await graphiti.search(
query="emails about Trump",
limit=10
)See docs/QUERIES.md for more query examples.
- Date Normalization: Handled 20+ datetime formats, OCR errors, and multi-language dates (French, Slovak)
- Duplicate Detection: Implemented similarity algorithms to identify 100% duplicate pairs across thousands of emails
- Schema Design: Created custom Pydantic models for domain-specific entity extraction
- Data Quality: Built comprehensive validation pipelines to ensure data integrity
- Temporal Episodes: Used Graphiti's episodic model for time-aware graph queries
- Incremental Processing: Designed pipeline to handle large datasets with progress tracking
- Flexible LLM Integration: Support for both OpenAI and local LLM providers (Ollama)
The knowledge graph can be visualized in Neo4j Browser, showing:
- Email episodes as nodes with temporal relationships
- Extracted entities (people, organizations) connected to relevant emails
- Communication patterns and relationship networks
|
π Interactive Graph Exploration |
// Find all emails involving a specific person
MATCH (p:Entity {name: "Jeffrey Epstein"})-[r]->(e:Episodic)
RETURN e.name, e.reference_time
ORDER BY e.reference_time DESC
LIMIT 10See docs/QUERIES.md for more examples.
| Document | Description |
|---|---|
| Setup Guide | Detailed installation and configuration |
| Entity Schema | Custom entity types and relationships |
| Query Guide | Example queries and patterns |
| Graphiti Guide | Understanding Graphiti integration |
| Schema Reference | Overall graph database schema |
| Project Overview | Methodology, challenges, and solutions |
- Python 3.8+
- Neo4j 5.x (or other graph database)
- OpenAI API key (for entity extraction)
See CONTRIBUTING.md for guidelines on contributing to this project.
This project is licensed under the MIT License - see the LICENSE file for details.
