Skip to content

Build a document parser that can extract domain specific information from a wide range of documents

License

Notifications You must be signed in to change notification settings

danjamk/smart-artifact-parser

Repository files navigation

Smart Artifact Parser

Extract structured medical information from documents using AI.

Smart Artifact Parser uses Docling for document parsing and Claude for intelligent extraction, converting unstructured medical documents into clean, validated JSON.

Features

  • Multi-format support: PDF, DOCX, TXT, and images (PNG, JPG, TIFF, BMP)
  • OCR included: Automatic text extraction from scanned documents and images
  • Structured output: Validated JSON with Pydantic schemas
  • Extensible: Add new extraction fields by modifying the schema
  • CLI interface: Simple command-line tool for batch processing

Installation

Requires Python 3.12+ and uv.

# Clone the repository
git clone https://github.com/danjamk/smart-artifact-parser.git
cd smart-artifact-parser

# Install dependencies
uv sync

Configuration

Create a .env file with your Anthropic API key:

cp .env.example .env

Edit .env and add your key:

ANTHROPIC_API_KEY=your_api_key_here

Get an API key at console.anthropic.com.

Usage

# Extract from a document
uv run python -m src.cli document.pdf

# Specify output directory
uv run python -m src.cli document.pdf --output-dir ./results

# Using the installed command (after uv sync)
uv run smart-parser document.pdf

Output

Results are saved as JSON files with timestamped names:

output/document_20250126_143022.json

Extracted Fields

The parser extracts the following information:

Field Description
document_type visit_note, lab_result, discharge_summary, prescription, referral, imaging_report, or other
document_date Date of the document/visit
provider Healthcare provider name, specialty, and facility
chief_complaint Primary reason for the visit
assessment Provider's clinical assessment
diagnoses List of diagnoses with optional ICD-10 codes
medications Medications with dosage, frequency, and instructions
follow_up_instructions Follow-up care instructions

Example Output

{
  "document_type": "visit_note",
  "document_date": "2025-01-15",
  "provider": {
    "name": "Dr. Jane Smith",
    "specialty": "Internal Medicine",
    "facility": "City Medical Center"
  },
  "chief_complaint": "Persistent cough for 2 weeks",
  "assessment": "Acute bronchitis, improving",
  "diagnoses": [
    {
      "description": "Acute bronchitis",
      "icd_code": "J20.9"
    }
  ],
  "medications": [
    {
      "name": "Amoxicillin",
      "dosage": "500mg",
      "frequency": "Three times daily",
      "instructions": "Take with food for 7 days"
    }
  ],
  "follow_up_instructions": "Return if symptoms worsen or fever develops",
  "source_file": "visit_note.pdf",
  "extracted_at": "2025-01-26T14:30:22.123456"
}

Supported Document Types

Format Extension Notes
PDF .pdf Native text and scanned (via OCR)
Word .docx Microsoft Word documents
Text .txt Plain text files
Images .png, .jpg, .jpeg, .tiff, .bmp OCR extraction

Extending the Schema

To extract additional fields, modify src/schemas.py:

  1. Add new Pydantic models for complex types
  2. Add fields to MedicalDocumentExtraction
  3. Claude will automatically extract the new fields

Example - adding patient vitals:

class Vitals(BaseModel):
    blood_pressure: str | None = Field(default=None, description="Blood pressure reading")
    heart_rate: int | None = Field(default=None, description="Heart rate in BPM")
    temperature: float | None = Field(default=None, description="Temperature in Fahrenheit")

class MedicalDocumentExtraction(BaseModel):
    # ... existing fields ...
    vitals: Vitals | None = Field(default=None, description="Patient vital signs")

Project Structure

smart-artifact-parser/
├── src/
│   ├── cli.py          # Typer CLI entry point
│   ├── parser.py       # Document parsing (Docling)
│   ├── extractor.py    # Claude API extraction
│   └── schemas.py      # Pydantic data models
├── output/             # Default output directory
├── pyproject.toml      # Project configuration
└── .env.example        # Environment template

How It Works

  1. Parse: Docling converts the document to markdown, preserving tables and structure
  2. Extract: Claude analyzes the text and extracts structured data using tool_use
  3. Validate: Pydantic validates the extracted data against the schema
  4. Output: Results saved as JSON with metadata

Requirements

  • Python 3.12+
  • uv package manager
  • Anthropic API key

Dependencies

License

MIT License - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

About

Build a document parser that can extract domain specific information from a wide range of documents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages