- Contains ICU-only EHR structured data, text, and images for nearly 300k patients
- Contains ICU-only EHR structured data, text, and images for nearly 40k patients
- Contains subset of MIMIC-III patients not included in MIMIC-IV
- ICU-only EHR structured data from across the US for 200k admissions
- Hospital visit data for surgery patients at UCI.
- Comprehensive EMR records and waveforms for each patient encounter
- Contains patient information, medical history, and specific surgical procedure information including: medicines used, lines or drains used, and post-operative complications
- 58,799 unique patients with data from 83,468 surgeries. Data spans over 4 years.
- COVID-19 patients with diverse comorbidities like kidney failure, cardiovascular diseases, malignancies, and immunosuppression
- 2,000 clinical records, encompassing discharge letters, referrals, and radiology reports from Hospital Clínic of Barcelona between March 2020 and March 2022
- Primarily in Spanish, some Catalan sections
- Expertly annotated for medical concepts by specialists, encompassing symptoms, diseases, procedures, medications, species, and humans (incl family members)
- Expert-labeled: 20 breast cancer and 20 pancreatic cancer de-identified progress notes from UCSF, comprehensively annotated
- Unannotated: 100 breast cancer and 100 pancreatic cancer de-identified progress notes from UCSF, auto-labeled with GPT-4
- Expert-labeled subset should only be used as a test-set
RadQA: A Question Answering Dataset to Improve Comprehension of Radiology Reports | dataset
- 1000+ expert-annotated radiology reports from MIMIC-III
CliniQG4QA: Generating Diverse Questions for Domain Adaptation of Clinical Question Answering
- 1287 annotated QA pairs on 36 sampled discharge summaries from MIMIC-III Clinical Notes
MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records | [awaiting dataset]
- Instruction-tuning dataset of 983 questions/instructions from 7 speciaties
- Specialties include: Internal Medicine, Neurology, Radiology, Cardiology, Oncology, Surgery, and Primary Care
- 303 are expert-labeled, with EHR references for correct answers also annotated
- Reference EHR contains structured and note data
n2c2/i2b2 shared task datasets
- Nearly 100k automated queries from 11,846 clinical case reports for question answering / reading comprehension
- i2b2 datasets repurposed for question answering
- 400,000+ question-answer evidence pairs and 1 million questions-logical forms
- 167k patient summaries from PMC case reports
- 3.1M patient-article relevance and 293k patient-patient similarity annotations defined by PubMed citation graph
Med-HALT: Medical Domain Hallucination Test for Large Language Models
- Medical exam and Pubmed QA datasets
- Combination of 7 datasets: MedMCQA, HEADQA, MedQA USMILE, MedQA Taiwan, Pubmed
Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | dataset
- JAMA Clinical Challenge dataset containing questions based on challenging clinical cases
- Medbullets comprising USMLE Step 2&3 style clinical questions
- All are multiple-choice question-answering tasks
- Each question is accompanied by an expert-written explanation
A Dataset for Pharmacovigilance in German, French, and Japanese: Annotating Adverse Drug Reactions across Languages | dataset
- Multilingual corpus of texts concerning ADRs gathered from patient fora and social media in German, French, and Japanese
- 12 entity types, four attribute types, and 13 relation types
Exploring the Generalization of Cancer Clinical Trial Eligibility Classifiers Across Diseases | dataset
- 2,490 annotated eligibility criteria across seven exclusion types in the following groups: (1) additional phase 3 cancer trials, (2) phase 1 and 2 cancer trials, (3) heart disease trials, (4) type 2 diabetes trials, and (5) observational trials for any disease.
- The paper also has several references to other clinical trial datasets.
CT-ADE: An Evaluation Benchmark for Adverse Drug Event Prediction from Clinical Trial Results
- 12,000 instances extracted from clinical trial results
- Integrates drug, patient population, and contextual information for multilabel ADE classification tasks in monopharmacy treatments