Extract structured medical information from documents using AI.
Smart Artifact Parser uses Docling for document parsing and Claude for intelligent extraction, converting unstructured medical documents into clean, validated JSON.
- Multi-format support: PDF, DOCX, TXT, and images (PNG, JPG, TIFF, BMP)
- OCR included: Automatic text extraction from scanned documents and images
- Structured output: Validated JSON with Pydantic schemas
- Extensible: Add new extraction fields by modifying the schema
- CLI interface: Simple command-line tool for batch processing
Requires Python 3.12+ and uv.
# Clone the repository
git clone https://github.com/danjamk/smart-artifact-parser.git
cd smart-artifact-parser
# Install dependencies
uv syncCreate a .env file with your Anthropic API key:
cp .env.example .envEdit .env and add your key:
ANTHROPIC_API_KEY=your_api_key_here
Get an API key at console.anthropic.com.
# Extract from a document
uv run python -m src.cli document.pdf
# Specify output directory
uv run python -m src.cli document.pdf --output-dir ./results
# Using the installed command (after uv sync)
uv run smart-parser document.pdfResults are saved as JSON files with timestamped names:
output/document_20250126_143022.json
The parser extracts the following information:
| Field | Description |
|---|---|
document_type |
visit_note, lab_result, discharge_summary, prescription, referral, imaging_report, or other |
document_date |
Date of the document/visit |
provider |
Healthcare provider name, specialty, and facility |
chief_complaint |
Primary reason for the visit |
assessment |
Provider's clinical assessment |
diagnoses |
List of diagnoses with optional ICD-10 codes |
medications |
Medications with dosage, frequency, and instructions |
follow_up_instructions |
Follow-up care instructions |
{
"document_type": "visit_note",
"document_date": "2025-01-15",
"provider": {
"name": "Dr. Jane Smith",
"specialty": "Internal Medicine",
"facility": "City Medical Center"
},
"chief_complaint": "Persistent cough for 2 weeks",
"assessment": "Acute bronchitis, improving",
"diagnoses": [
{
"description": "Acute bronchitis",
"icd_code": "J20.9"
}
],
"medications": [
{
"name": "Amoxicillin",
"dosage": "500mg",
"frequency": "Three times daily",
"instructions": "Take with food for 7 days"
}
],
"follow_up_instructions": "Return if symptoms worsen or fever develops",
"source_file": "visit_note.pdf",
"extracted_at": "2025-01-26T14:30:22.123456"
}| Format | Extension | Notes |
|---|---|---|
.pdf |
Native text and scanned (via OCR) | |
| Word | .docx |
Microsoft Word documents |
| Text | .txt |
Plain text files |
| Images | .png, .jpg, .jpeg, .tiff, .bmp |
OCR extraction |
To extract additional fields, modify src/schemas.py:
- Add new Pydantic models for complex types
- Add fields to
MedicalDocumentExtraction - Claude will automatically extract the new fields
Example - adding patient vitals:
class Vitals(BaseModel):
blood_pressure: str | None = Field(default=None, description="Blood pressure reading")
heart_rate: int | None = Field(default=None, description="Heart rate in BPM")
temperature: float | None = Field(default=None, description="Temperature in Fahrenheit")
class MedicalDocumentExtraction(BaseModel):
# ... existing fields ...
vitals: Vitals | None = Field(default=None, description="Patient vital signs")smart-artifact-parser/
├── src/
│ ├── cli.py # Typer CLI entry point
│ ├── parser.py # Document parsing (Docling)
│ ├── extractor.py # Claude API extraction
│ └── schemas.py # Pydantic data models
├── output/ # Default output directory
├── pyproject.toml # Project configuration
└── .env.example # Environment template
- Parse: Docling converts the document to markdown, preserving tables and structure
- Extract: Claude analyzes the text and extracts structured data using tool_use
- Validate: Pydantic validates the extracted data against the schema
- Output: Results saved as JSON with metadata
- Python 3.12+
- uv package manager
- Anthropic API key
- docling - Document parsing
- anthropic - Claude API client
- pydantic - Data validation
- typer - CLI framework
MIT License - see LICENSE for details.
Contributions are welcome! Please feel free to submit a Pull Request.