Smart Artifact Parser

Extract structured medical information from documents using AI.

Smart Artifact Parser uses Docling for document parsing and Claude for intelligent extraction, converting unstructured medical documents into clean, validated JSON.

Features

Multi-format support: PDF, DOCX, TXT, and images (PNG, JPG, TIFF, BMP)
OCR included: Automatic text extraction from scanned documents and images
Structured output: Validated JSON with Pydantic schemas
Extensible: Add new extraction fields by modifying the schema
CLI interface: Simple command-line tool for batch processing

Installation

Requires Python 3.12+ and uv.

# Clone the repository
git clone https://github.com/danjamk/smart-artifact-parser.git
cd smart-artifact-parser

# Install dependencies
uv sync

Configuration

Create a .env file with your Anthropic API key:

cp .env.example .env

Edit .env and add your key:

ANTHROPIC_API_KEY=your_api_key_here

Get an API key at console.anthropic.com.

Usage

# Extract from a document
uv run python -m src.cli document.pdf

# Specify output directory
uv run python -m src.cli document.pdf --output-dir ./results

# Using the installed command (after uv sync)
uv run smart-parser document.pdf

Output

Results are saved as JSON files with timestamped names:

output/document_20250126_143022.json

Extracted Fields

The parser extracts the following information:

Field	Description
`document_type`	visit_note, lab_result, discharge_summary, prescription, referral, imaging_report, or other
`document_date`	Date of the document/visit
`provider`	Healthcare provider name, specialty, and facility
`chief_complaint`	Primary reason for the visit
`assessment`	Provider's clinical assessment
`diagnoses`	List of diagnoses with optional ICD-10 codes
`medications`	Medications with dosage, frequency, and instructions
`follow_up_instructions`	Follow-up care instructions

Example Output

{
  "document_type": "visit_note",
  "document_date": "2025-01-15",
  "provider": {
    "name": "Dr. Jane Smith",
    "specialty": "Internal Medicine",
    "facility": "City Medical Center"
  },
  "chief_complaint": "Persistent cough for 2 weeks",
  "assessment": "Acute bronchitis, improving",
  "diagnoses": [
    {
      "description": "Acute bronchitis",
      "icd_code": "J20.9"
    }
  ],
  "medications": [
    {
      "name": "Amoxicillin",
      "dosage": "500mg",
      "frequency": "Three times daily",
      "instructions": "Take with food for 7 days"
    }
  ],
  "follow_up_instructions": "Return if symptoms worsen or fever develops",
  "source_file": "visit_note.pdf",
  "extracted_at": "2025-01-26T14:30:22.123456"
}

Supported Document Types

Format	Extension	Notes
PDF	`.pdf`	Native text and scanned (via OCR)
Word	`.docx`	Microsoft Word documents
Text	`.txt`	Plain text files
Images	`.png`, `.jpg`, `.jpeg`, `.tiff`, `.bmp`	OCR extraction

Extending the Schema

To extract additional fields, modify src/schemas.py:

Add new Pydantic models for complex types
Add fields to MedicalDocumentExtraction
Claude will automatically extract the new fields

Example - adding patient vitals:

class Vitals(BaseModel):
    blood_pressure: str | None = Field(default=None, description="Blood pressure reading")
    heart_rate: int | None = Field(default=None, description="Heart rate in BPM")
    temperature: float | None = Field(default=None, description="Temperature in Fahrenheit")

class MedicalDocumentExtraction(BaseModel):
    # ... existing fields ...
    vitals: Vitals | None = Field(default=None, description="Patient vital signs")

Project Structure

smart-artifact-parser/
├── src/
│   ├── cli.py          # Typer CLI entry point
│   ├── parser.py       # Document parsing (Docling)
│   ├── extractor.py    # Claude API extraction
│   └── schemas.py      # Pydantic data models
├── output/             # Default output directory
├── pyproject.toml      # Project configuration
└── .env.example        # Environment template

How It Works

Parse: Docling converts the document to markdown, preserving tables and structure
Extract: Claude analyzes the text and extracts structured data using tool_use
Validate: Pydantic validates the extracted data against the schema
Output: Results saved as JSON with metadata

Requirements

Python 3.12+
uv package manager
Anthropic API key

Dependencies

docling - Document parsing
anthropic - Claude API client
pydantic - Data validation
typer - CLI framework

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
output		output
samples		samples
src		src
.DS_Store		.DS_Store
.env.example		.env.example
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
COMMIT_COMMENT.md		COMMIT_COMMENT.md
LICENSE		LICENSE
PLAN.md		PLAN.md
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Smart Artifact Parser

Features

Installation

Configuration

Usage

Output

Extracted Fields

Example Output

Supported Document Types

Extending the Schema

Project Structure

How It Works

Requirements

Dependencies

License

Contributing

About

Uh oh!

Releases

Packages

Languages

License

danjamk/smart-artifact-parser

Folders and files

Latest commit

History

Repository files navigation

Smart Artifact Parser

Features

Installation

Configuration

Usage

Output

Extracted Fields

Example Output

Supported Document Types

Extending the Schema

Project Structure

How It Works

Requirements

Dependencies

License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages