LLM Epidemiological Code Composition

Can large language models write epidemiologically correct code for estimating the time-varying reproduction number (Rt)?

Study Design

This study evaluates LLM-generated code for Rt estimation across:

2 models: Claude Sonnet 4, Llama 3.1 8B
4 scenarios: From basic Rt estimation to complex multi-stream models
5 framework conditions: Stan, PyMC, Turing, EpiAware, plain R
3 runs each: 120 total submissions

Important Limitations

This study tests zero-context, single-shot prompting - a deliberately harsh baseline:

No documentation or examples provided
No iterative refinement
No tool use or agentic behavior
No access to framework codebases

This is not how practitioners would realistically use LLMs. Real-world use involves iteration, documentation in context, and error feedback. Results should be interpreted as a lower bound on capability.

See Issue #1 and Issue #2 for discussion.

Repository Structure

├── prompts/                 # Scenario prompts sent to LLMs
│   ├── scenario_1a/         # Open method choice
│   ├── scenario_1b/         # Renewal equation specified
│   ├── scenario_2/          # Complex model (day-of-week, ascertainment)
│   └── scenario_3/          # Multiple data streams
├── experiments/             # LLM responses (120 submissions)
├── evaluation/              # Execution evaluation
│   ├── run_evaluation.R     # Evaluation script
│   └── results/             # Execution results
├── expert_review/           # Expert assessment materials
│   ├── README.md            # Reviewer instructions
│   ├── all_code.md          # All submissions (blinded)
│   └── scoresheet.csv       # Scoring spreadsheet
├── reference_solutions/     # Ground truth implementations
├── data/                    # Synthetic COVID-19 case data
└── analysis_plan.md         # Pre-registered analysis plan

Scenarios

Scenario	Description	Method
1a	Estimate Rt from daily cases	Open choice
1b	Estimate Rt using renewal equation	Specified
2	Rt with day-of-week effects, time-varying ascertainment, NegBin noise	Specified
3	Joint model of cases, hospitalisations, deaths with shared Rt	Specified

Expert Review

Expert reviewers assess each submission for epidemiological correctness:

Method identification (Scenario 1a): Renewal equation, Wallinga-Teunis, etc.
Departures from reference: List differences from gold standard
Departure classification:
- A: Equivalent alternative
- B: Minor error
- C: Major error
- D: Fundamental misunderstanding
Overall assessment: Acceptable / Minor issues / Major issues / Incorrect

See expert_review/README.md for full instructions.

Reproducing

Run experiments

Rscript experiments/run_all.R

Evaluate execution

Rscript evaluation/run_evaluation.R

Generate review materials

Rscript expert_review/generate_review_materials.R

License

MIT

Citation

Paper forthcoming

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Epidemiological Code Composition

Study Design

Important Limitations

Repository Structure

Scenarios

Expert Review

Reproducing

Run experiments

Evaluate execution

Generate review materials

License

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
data		data
evaluation		evaluation
expert_review		expert_review
prompts		prompts
reference_solutions		reference_solutions
scripts		scripts
setup		setup
.gitignore		.gitignore
README.md		README.md
analysis_plan.md		analysis_plan.md

Folders and files

Latest commit

History

Repository files navigation

LLM Epidemiological Code Composition

Study Design

Important Limitations

Repository Structure

Scenarios

Expert Review

Reproducing

Run experiments

Evaluate execution

Generate review materials

License

Citation

About

Resources

Code of conduct

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages