A Python-based engine for processing radiology reports using the Qwen3-30B-A3B-FP8 model with sglang for efficient batch inference. Includes quality control file generation and performance evaluation tools for comprehensive validation workflows with debug mode for faster iteration.
curl -LsSf https://astral.sh/uv/install.sh | sh
export PATH="$HOME/.local/bin:$PATH"
uv sync
source .venv/bin/activateFirst, launch the sglang server with the Qwen model:
python -m sglang.launch_server \
--model-path Qwen/Qwen3-30B-A3B-FP8 \
--reasoning-parser qwen3 \
--port 8000 \
--host 127.0.0.1 \
--dp 8 \
--schedule-conservativeness 0.1Key parameters:
--dp: Number of GPUs to use for data parallelism--schedule-conservativeness: Controls scheduling behavior (lower = more aggressive, higher = more conservative)--port: Server port (default: 8000)--host: Server host (default: 127.0.0.1)--reasoning-parser: Model-specific parser for handling output format (qwen3 for Qwen models)
Note: The --reasoning-parser qwen3 parameter is required when using Qwen models with sglang. It tells sglang how to parse the model's output format and handle its reasoning capabilities. This is specific to the Qwen model architecture and is not configurable in our codebase.
If the server doesn't start properly (especially on FAC), try:
export NO_PROXY=localhost,127.0.0.1Expected server output:
[2025-06-03 17:55:56] INFO: Started server process [3645360]
[2025-06-03 17:55:56] INFO: Waiting for application startup.
[2025-06-03 17:55:56] INFO: Application startup complete.
[2025-06-03 17:55:56] INFO: Uvicorn running on http://127.0.0.1:8000 (Press CTRL+C to quit)
[2025-06-03 17:55:57] INFO: 127.0.0.1:56852 - "GET /get_model_info HTTP/1.1" 200 OK
[2025-06-03 17:55:57 DP0] Prefill batch. #new-seq: 1, #new-token: 6, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0
...
[2025-06-03 17:56:02] The server is fired up and ready to roll!
The easiest way to process reports is using the command-line interface:
python src/cli.py \
--input-files /path/to/your/reports.csv \
--modality-config config/modalities/cxr.yaml \
--save-dir outputIf you just want to try the CLI wiring without preparing your own dataset, a tiny sample file lives at docs/examples/reports_dummy.csv. It follows the default column names (Accession, Report Text), so you can run:
python src/cli.py \
--input-files docs/examples/reports_dummy.csv \
--modality-config config/modalities/cxr.yaml \
--save-dir outputRequired:
--modality-config: Path to modality-specific configuration file--save-dir: Directory to save all processed results and logs
Stage Selection:
--stages: Which processing stages to run (default:all)- Choices:
remove-comparisons,extract-findings,map-categories,process-questions,all
- Choices:
Input Options:
--input-files: Path(s) to CSV file(s) containing reports (required forremove-comparisonsandprocess-questionsstages)--input-no-comparisons: Path tono_comparisons.csvfrom previous run--input-findings: Path tofindings.csvfrom previous run
Debug Mode Options:
--debug-mode: Enable debug mode with subsampling for faster iteration--debug-sample: Maximum number of reports to sample (default: 5000)--debug-categories: Number of categories to sample (default: 4)--debug-num-questions: Number of questions per category (default: 10)--debug-seed: Random seed for reproducible sampling (default: 42)
Optional:
--batch-size: Number of reports to process per batch (default: 1024)--accession-col: Column name for accession numbers in CSV (default: "Accession")--report-col: Column name for report text in CSV (default: "Report Text")--config: Path to default configuration file (default: "config/default_config.yaml")
Basic usage:
python src/cli.py \
--input-files data/mimic_reports.csv \
--modality-config config/modalities/cxr.yaml \
--save-dir resultsDebug mode for rapid iteration:
python src/cli.py \
--input-files data/mimic_reports.csv \
--modality-config config/modalities/cxr.yaml \
--save-dir debug_test \
--debug-mode \
--debug-sample 50 \
--debug-categories 2 \
--debug-num-questions 3 \
--debug-seed 42Process multiple files with custom batch size:
python src/cli.py \
--input-files data/file1.csv data/file2.csv data/file3.csv \
--modality-config config/modalities/breast_mr.yaml \
--save-dir results \
--batch-size 2048Custom column names:
python src/cli.py \
--input-files data/reports.csv \
--modality-config config/modalities/cxr.yaml \
--save-dir results \
--accession-col "Study_ID" \
--report-col "Report_Text"The engine supports running individual processing stages independently and resuming incomplete runs.
remove-comparisons: Removes temporal comparisons from reportsextract-findings: Extracts the findings section from reportsmap-categories: Maps findings to specific categoriesprocess-questions: Answers specific questions about reports
Raw Reports
↓
remove-comparisons → no_comparisons.csv
↓
extract-findings → findings.csv
↓
map-categories → category_findings.csv
Raw Reports
↓
process-questions → questions.csv (independent)
All combined → final_results.json
The system automatically saves results incrementally:
How It Works:
- API Batching: Processes reports in batches (default: 1024 reports per batch, controlled by
--batch-size) - Incremental Saving: After each API batch completes, results are saved to CSV files
- Resume Protection: If processing fails, work from completed API batches is preserved
- Progress Tracking: Each stage tracks which reports have been processed to avoid reprocessing
Benefits:
- Fault Tolerance: Processing failures only lose the current API batch, not all work
- Flexible Resume: Can resume from any stage using existing files as input
Resume Behavior: When you restart a stage that was previously interrupted:
- Automatic Detection: The system checks existing CSV files for completed report IDs
- Skip Processed: Only processes reports that haven't been completed yet
- Seamless Continuation: Appends new results to existing CSV files
Example:
# Initial run processes 5000 reports, fails at report 3500
python src/cli.py --stages extract-findings --input-files data.csv
# Restart automatically detects completed reports and continues from report 3501
python src/cli.py --stages extract-findings --input-files data.csvRun individual stages:
# Stage 1: Remove comparisons only
python src/cli.py \
--stages remove-comparisons \
--input-files data.csv \
--modality-config config/modalities/cxr.yaml \
--save-dir results
# Stage 2: Extract findings using previous results
python src/cli.py \
--stages extract-findings \
--input-no-comparisons results/no_comparisons.csv \
--modality-config config/modalities/cxr.yaml \
--save-dir resultsRun multiple dependent stages:
python src/cli.py \
--stages remove-comparisons extract-findings map-categories \
--input-files data.csv \
--modality-config config/modalities/cxr.yaml \
--save-dir resultsYour CSV file should contain at least two columns:
- Accession column: Unique identifier for each report
- Report column: The radiology report text
Example CSV structure:
Accession,Report Text
12345,"FINDINGS: No acute cardiopulmonary abnormality..."
12346,"IMPRESSION: Mild cardiomegaly. Otherwise normal..."
12347,"COMPARISON: Prior chest X-ray from 2023..."
Results are organized in the specified save directory:
save_dir/
├── logs/
│ └── Engine_filename_timestamp.log
├── no_comparisons.csv
├── findings.csv
├── category_findings.csv
├── questions.csv
└── final_results.json
File descriptions:
- logs/: Contains detailed processing logs with timestamps
- no_comparisons.csv: Reports with comparison sections removed
- findings.csv: Extracted findings from each report
- category_findings.csv: Findings mapped to specific categories
- questions.csv: Question-answer pairs for each report
- final_results.json: Complete structured results
{
"report_id": {
"raw_text": "Original report text",
"no_comparison_text": "Report with comparisons removed",
"findings": {
"findings": "Extracted findings text",
"findings_impressions": null,
"impression": ""
},
"category_findings": {
"Lung": "Lung-related findings...",
"Heart": "Heart-related findings...",
"Pleura": "No relevant findings"
},
"answers": {
"Lung": [
{
"question": "Is there any airspace opacity?",
"answer": "Yes"
},
{
"question": "Is there any atelectasis?",
"answer": "Yes"
}
],
"Heart": [
{
"question": "Is there cardiomegaly?",
"answer": "No"
}
]
},
"processing_time": 12.34
}
}After processing your reports, generate QC files for human annotation and evaluate performance metrics without leaving this README.
Features
- Produces QC files for four validation tasks: no-comparisons, findings extraction, category assignments, and binary question answering
- Consolidates all questions into a single CSV with balanced positive/negative sampling
- Allows per-task budgets, random seeding, and verbose logging
- Designed for Python ≥3.9 and installs automatically with
pip install .
# Quick start: generate all QC artifacts in a target directory
python src/qc_cli.py \
--results-path final_results.json \
--qc-dir qc_validation/ \
--questions-budget 20 \
--category-budget 15Advanced usage
# Generate specific QC types
python src/qc_cli.py --results-path results.json --qc-dir qc_output/ --qc-types findings categories
# Custom sample budgets and reproducibility
python src/qc_cli.py \
--results-path results.json \
--qc-dir qc_output/ \
--findings-budget 50 \
--category-budget 15 \
--questions-budget 20 \
--seed 123
# Verbose logging
python src/qc_cli.py --results-path results.json --qc-dir qc_output/ --verboseCommand line arguments
--results-path(required): JSON produced by the main engine--qc-dir(required): Output folder for QC CSVs--qc-types: Subset ofno-comparisons,findings,categories,questions, orall--no-comparisons-budget,--findings-budget,--category-budget,--questions-budget: Sample counts per task--seed: RNG seed for deterministic sampling--verbose/-v: Extra logging
Input data format
{
"report_id_1": {
"raw_text": "Original report text...",
"no_comparison_text": "Report with comparisons removed...",
"findings": {
"findings": "Extracted findings section..."
},
"category_findings": {
"Lung": "Lung-related findings...",
"Pleura": "No relevant findings",
"Heart": "Heart-related findings..."
},
"qa_results": {
"Lung": {
"Is there pneumonia?": "Yes"
},
"Devices": {
"Is there any chest tube or pigtail catheter?": "No"
}
}
}
}Generated files
combined_findings_qc.csv: Human reviewers validate both comparison removal and findings extraction in one placecategories_qc.csv: Aggregates all category samples with per-category budgetsquestions_qc.csv: Consolidated binary classification questions with balanced labels
Features
- Evaluates no-comparisons, findings, category, and questions QC files
- Computes accuracy, precision (PPV), recall (sensitivity), specificity, NPV, F1, confusion matrices, and sample counts
- Supports consolidated and per-question analysis with imperfect-task highlighting
- Exports JSON metrics plus optional CSV of tasks below a configurable accuracy threshold
# Quick start: evaluate every QC artifact in a directory
python src/eval_cli.py --qc-dir qc_validation/ --output-file performance_metrics.jsonQuestions QC evaluation
- Consolidated mode measures aggregate binary classification performance across all questions.
- Individual mode surfaces weak questions with detailed per-question metrics.
Enhanced capabilities
- Imperfect task identification with custom accuracy thresholds
- CSV export of tasks that fall below the threshold
- Summary dashboard in logs for fast triage
- Verbose logging for step-by-step tracing
Command line arguments
--qc-dir(required): Directory containing annotated QC CSVs--qc-types: Same choices as the QC generator; defaults toall--output-file: JSON file capturing metrics by task/question--imperfect-csv: CSV listing tasks with accuracy below--accuracy-threshold--accuracy-threshold: Default 1.0 (100%); adjust to match review tolerance--verbose/-v: Detailed logging
Usage examples
# Evaluate only questions and categories
python src/eval_cli.py --qc-dir qc_output/ --qc-types questions categories
# Persist metrics and imperfect tasks
python src/eval_cli.py \
--qc-dir qc_output/ \
--output-file evaluation_results.json \
--imperfect-csv tasks_to_review.csv \
--accuracy-threshold 0.95 \
--verboseExpected QC annotations
combined_findings_qc.csv:correctcolumn must contain reviewer verdicts (1/0, yes/no, true/false)categories_qc.csv: Each row includescategory,model_text, andcorrectquestions_qc.csv: Requirespredicted_label/human_labelwith binary values (1/0 or yes/no)
Sample evaluation output
{
"summary": {
"overall_accuracy": 0.91,
"total_tasks": 3,
"tasks_below_threshold": 1
},
"categories": {
"Lung": {
"accuracy": 0.75,
"n_samples": 10,
"n_annotated": 8,
"file": "qc_output/categories_qc.csv"
}
},
"questions": {
"Is_there_pneumonia": {
"accuracy": 0.83,
"precision": 0.75,
"recall": 0.90,
"specificity": 0.80,
"npv": 0.92,
"f1_score": 0.82,
"confusion_matrix": {
"true_negative": 8,
"false_positive": 3,
"false_negative": 1,
"true_positive": 9
}
}
}
}Metrics reference
- Accuracy:
(TP + TN) / total - Precision (PPV):
TP / (TP + FP) - Recall (Sensitivity):
TP / (TP + FN) - Specificity:
TN / (TN + FP) - NPV:
TN / (TN + FN) - F1 Score: Harmonic mean of precision and recall
- Correctness evaluations (no-comparisons/findings/categories) report simple accuracy
QC → Evaluation workflow
- Generate QC files
python src/qc_cli.py --results-path final_results.json --qc-dir qc_output/
- Human annotation
- Fill in
correctcolumns (extraction/categorization) andhuman_labelcolumns (questions)
- Fill in
- Evaluate
python src/eval_cli.py \ --qc-dir qc_output/ \ --output-file performance_metrics.json \ --imperfect-csv review.csv - Triaging
- Inspect JSON metrics for aggregate performance
- Review the imperfect CSV to prioritize remediation
Troubleshooting
- No human annotations found: ensure
correct/human_labelcolumns aren’t empty and use supported values (1/0/yes/no) - File not found: verify
--qc-dirand that filenames follow the generator’s naming scheme - Missing required columns: confirm CSV headers match expectations (e.g.,
predicted_label,human_label,correct) - Partial annotations: the evaluator automatically ignores blank rows while reporting annotated counts
Create modality-specific configuration files in config/modalities/:
# config/modalities/cxr.yaml
categories:
Lung:
description: "Findings related to lung parenchyma, airways, and pulmonary vessels"
questions:
- question: "Is there evidence of pneumonia or infection?"
- question: "Are there any nodules or masses?"
Heart:
description: "Findings related to heart size, shape, and cardiac structures"
questions:
- question: "Is there cardiomegaly?"
- question: "Is there evidence of heart failure?"Modify config/default_config.yaml for model and server settings:
model:
name: "Qwen/Qwen3-30B-A3B-FP8"
temperature: 0.1
top_p: 0.1
max_tokens: 4096
server:
port: 8000
base_url: "http://127.0.0.1"For optimal performance:
- Adjust batch size: Use
--batch-sizeto optimize for your hardware- Smaller batches: Lower memory usage, more API calls, more checkpointing.
- Larger batches : Higher memory usage, fewer API calls, less checkpointing.
For an 8x H100 server, I found a batch size of ~24,000 (~3000 per GPU) to be effective at maximizing GPU utilization.
-
Server tuning: Adjust sglang server parameters:
--dp: Number of GPUs for data parallelism--schedule-conservativeness: 0.1-0.3 for throughput, 0.4-0.8 for latency
-
Monitor progress: Track processing in real-time:
# Watch CSV files grow during processing watch -n 10 'wc -l results/findings.csv' # Check processing times in logs tail -f results/logs/Engine_*.log
-
Batch size optimization examples:
# Memory-constrained environments python src/cli.py --batch-size 512 --input-files data.csv # High-memory systems (faster processing) python src/cli.py --batch-size 4096 --input-files data.csv
Memory issues:
- Reduce
--batch-size - Monitor GPU memory usage
Processing errors:
- Check logs in
save_dir/logs/ - Verify CSV format and column names
- Ensure modality config file exists
Stage-based processing errors:
- Ensure required input files exist for dependent stages
- Use
--stages allfor complete processing with no intermediate files - Check batch directory paths when using
--resume-from-batch
Start with debug mode for rapid iteration using a subset of your dataset:
# Test with debug mode first - processes 50 reports with 2 categories
python src/cli.py \
--input-files large_dataset.csv \
--modality-config config/modalities/cxr.yaml \
--save-dir debug \
--debug-mode \
--debug-sample 50 \
--debug-categories 2 \
--debug-num-questions 5
# Scale up once you've validated the approach
python src/cli.py \
--input-files large_dataset.csv \
--modality-config config/modalities/cxr.yaml \
--save-dir production- Debug: Start with
--debug-modeto validate your configuration - Iterate: Adjust prompts and configurations based on debug results
- Scale: Remove debug mode for full dataset processing
- Validate: Generate QC files for quality assessment
- Evaluate: Use eval CLI to measure performance and identify improvements
If processing fails partway through:
- Identify which stages completed successfully by checking the save directory
- Use the intermediate files from the save directory
- Run only the remaining stages
Here's a complete pipeline from debug to production with QC validation:
1. Start with Debug Mode (2-3 minutes):
# Test configuration with small subset
python src/cli.py \
--input-files data/mimic_reports.csv \
--modality-config config/modalities/cxr.yaml \
--save-dir debug_test \
--debug-mode \
--debug-sample 50 \
--debug-categories 2 \
--debug-num-questions 3 \
--debug-seed 422. Generate Debug QC Files:
# Generate QC files for validation
python src/qc_cli.py \
--results-path debug_test/final_results.json \
--qc-dir debug_test/qc \
--combined-findings-budget 20 \
--category-budget 5 \
--questions-budget 103. Evaluate Debug Results:
# Check performance metrics
python src/eval_cli.py \
--qc-dir debug_test/qc \
--output-file debug_test/eval_results.json \
--verbose4. Scale to Production (hours):
# Full dataset processing
python src/cli.py \
--input-files data/mimic_reports.csv \
--modality-config config/modalities/cxr.yaml \
--save-dir production \
--batch-size 20485. Production QC and Evaluation:
# Generate production QC files
python src/qc_cli.py \
--results-path production/final_results.json \
--qc-dir production/qc \
--combined-findings-budget 100 \
--category-budget 20 \
--questions-budget 50
# Evaluate production results
python src/eval_cli.py \
--qc-dir production/qc \
--output-file production/eval_results.json \
--imperfect-csv production/needs_review.csvIf you use this code in your research, please cite the following paper:
@article{pillar0,
title = {Pillar-0: A New Frontier for Radiology Foundation Models},
author = {Agrawal, Kumar Krishna and Liu, Longchao and Lian, Long and Nercessian, Michael and Harguindeguy, Natalia and Wu, Yufu and Mikhael, Peter and Lin, Gigin and Sequist, Lecia V. and Fintelmann, Florian and Darrell, Trevor and Bai, Yutong and Chung, Maggie and Yala, Adam},
year = {2025}
}