A production-ready document intelligence pipeline that classifies email attachments from .eml files as relevant or irrelevant, using only the emailโs HTML body context.
This system leverages the Anthropic Claude API for contextual reasoning and includes an evaluation module for performance benchmarking against ground truth data.
The project processes .eml email files and performs the following:
-
Extracts:
- HTML body
- Attachment filenames
-
Classifies each attachment into exactly one category:
relevantirrelevant
-
Generates structured JSON output files.
-
Evaluates predictions against labeled ground truth using standard classification metrics.
Classification must rely exclusively on the emailโs HTML body.
The following information must not be used:
- Attachment contents
- MIME types
- Filenames
- Headers
- Any metadata
This constraint simulates real-world scenarios where reasoning must be based solely on rendered email content.
doczen/
โ
โโโ examples/
โ โโโ example_00001.eml
โ โโโ example_00002.eml
โ โโโ ...
โ
โโโ ground_truth/
โ โโโ attachments_00001.json
โ โโโ attachments_00002.json
โ โโโ ...
โ
โโโ output/
โ
โโโ classify_attachments.py
โโโ evaluate.py
โโโ requirements.txt
โโโ README.md
git clone https://github.com/your-org/doczen.git
cd doczenpython -m venv venv
source venv/bin/activate # macOS/Linux
venv\Scripts\activate # Windowspip install -r requirements.txtExample requirements.txt:
anthropic
beautifulsoup4
tqdm
scikit-learn
Set your Anthropic API key:
macOS/Linux
export ANTHROPIC_API_KEY=your_key_hereWindows
set ANTHROPIC_API_KEY=your_key_hereReads .eml files from examples/, extracts HTML content and attachment filenames, and classifies attachments using Claude.
python classify_attachments.pyGenerated files:
output/
attachments_00001.json
attachments_00002.json
...
Example output:
{
"relevant": [
"example_00001_attachment_02.pdf"
],
"irrelevant": [
"example_00001_attachment_01.jpg"
]
}Each attachment must appear in exactly one category.
The model receives:
- Full HTML body
- List of attachment filenames
It is instructed to:
- Identify attachments materially referenced in the email
- Detect decorative or structural HTML elements (logos, icons, signature images)
- Return strictly structured JSON output
- Avoid explanations
Compares generated outputs against ground truth labels.
python evaluate.py- Accuracy
- Precision
- Recall
- F1 Score
- Per-file breakdown
- Macro-averaged summary
Each attachment is treated as a binary classification:
- Positive โ relevant
- Negative โ irrelevant
Ground truth files must match output naming format:
ground_truth/attachments_00001.json
Evaluation compares attachment-level predictions against reference labels.
Strict JSON formatting enables automated validation and evaluation.
classify_attachments.pyhandles inferenceevaluate.pyhandles benchmarking
Consistent file naming and structured outputs ensure experiment tracking.
The classification pipeline includes:
- API retry handling
- JSON schema validation
- Attachment coverage verification
- Logging for malformed responses
# Step 1: Generate classifications
python classify_attachments.py
# Step 2: Evaluate performance
python evaluate.py- Rate limiting and exponential backoff
- Deterministic JSON validation
- Cost monitoring for API usage
- Parallel processing support
- Prompt versioning
- CI-based regression evaluation
This pipeline can be extended to support:
- Confidence scoring
- Multi-class categorization
- Prompt optimization experiments
- Async API batching
- Docker deployment
- Model comparison benchmarking
MIT License
Copyright (c) 2026 Will