Publicly-available clinical datasets

EHR records:

MIMIC-IV | dataset

Contains ICU-only EHR structured data, text, and images for nearly 300k patients

MIMIC-III | dataset

Contains ICU-only EHR structured data, text, and images for nearly 40k patients

MIMIC-III CareVue subset

Contains subset of MIMIC-III patients not included in MIMIC-IV

eICU | dataset

ICU-only EHR structured data from across the US for 200k admissions

MOVER

Hospital visit data for surgery patients at UCI.
Comprehensive EMR records and waveforms for each patient encounter
Contains patient information, medical history, and specific surgical procedure information including: medicines used, lines or drains used, and post-operative complications
58,799 unique patients with data from 83,468 surgeries. Data spans over 4 years.

CARMEN-I

COVID-19 patients with diverse comorbidities like kidney failure, cardiovascular diseases, malignancies, and immunosuppression
2,000 clinical records, encompassing discharge letters, referrals, and radiology reports from Hospital Clínic of Barcelona between March 2020 and March 2022
Primarily in Spanish, some Catalan sections
Expertly annotated for medical concepts by specialists, encompassing symptoms, diseases, procedures, medications, species, and humans (incl family members)

Manually Annotated datasets:

CORAL | dataset

Expert-labeled: 20 breast cancer and 20 pancreatic cancer de-identified progress notes from UCSF, comprehensively annotated
Unannotated: 100 breast cancer and 100 pancreatic cancer de-identified progress notes from UCSF, auto-labeled with GPT-4
Expert-labeled subset should only be used as a test-set

RadQA: A Question Answering Dataset to Improve Comprehension of Radiology Reports | dataset

1000+ expert-annotated radiology reports from MIMIC-III

CliniQG4QA: Generating Diverse Questions for Domain Adaptation of Clinical Question Answering

1287 annotated QA pairs on 36 sampled discharge summaries from MIMIC-III Clinical Notes

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records | [awaiting dataset]

Instruction-tuning dataset of 983 questions/instructions from 7 speciaties
Specialties include: Internal Medicine, Neurology, Radiology, Cardiology, Oncology, Surgery, and Primary Care
303 are expert-labeled, with EHR references for correct answers also annotated
Reference EHR contains structured and note data

n2c2/i2b2 shared task datasets

Other datasets:

CliCR

Nearly 100k automated queries from 11,846 clinical case reports for question answering / reading comprehension

EMR-QA

i2b2 datasets repurposed for question answering
400,000+ question-answer evidence pairs and 1 million questions-logical forms

PMC-Patients

167k patient summaries from PMC case reports
3.1M patient-article relevance and 293k patient-patient similarity annotations defined by PubMed citation graph

Med-HALT: Medical Domain Hallucination Test for Large Language Models

Medical exam and Pubmed QA datasets
Combination of 7 datasets: MedMCQA, HEADQA, MedQA USMILE, MedQA Taiwan, Pubmed

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | dataset

JAMA Clinical Challenge dataset containing questions based on challenging clinical cases
Medbullets comprising USMLE Step 2&3 style clinical questions
All are multiple-choice question-answering tasks
Each question is accompanied by an expert-written explanation

A Dataset for Pharmacovigilance in German, French, and Japanese: Annotating Adverse Drug Reactions across Languages | dataset

Multilingual corpus of texts concerning ADRs gathered from patient fora and social media in German, French, and Japanese
12 entity types, four attribute types, and 13 relation types

Exploring the Generalization of Cancer Clinical Trial Eligibility Classifiers Across Diseases | dataset

2,490 annotated eligibility criteria across seven exclusion types in the following groups: (1) additional phase 3 cancer trials, (2) phase 1 and 2 cancer trials, (3) heart disease trials, (4) type 2 diabetes trials, and (5) observational trials for any disease.
The paper also has several references to other clinical trial datasets.

CT-ADE: An Evaluation Benchmark for Adverse Drug Event Prediction from Clinical Trial Results

12,000 instances extracted from clinical trial results
Integrates drug, patient population, and contextual information for multilabel ADE classification tasks in monopharmacy treatments

Synthetic data:

Asclepius

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Publicly-available clinical datasets

EHR records:

Manually Annotated datasets:

Other datasets:

Synthetic data:

Files

README.md

Latest commit

History

README.md

File metadata and controls

Publicly-available clinical datasets

EHR records:

Manually Annotated datasets:

Other datasets:

Synthetic data: