Skip to content

Latest commit

 

History

History
83 lines (63 loc) · 5.67 KB

README.md

File metadata and controls

83 lines (63 loc) · 5.67 KB

Publicly-available clinical datasets

EHR records:

MIMIC-IV | dataset

  • Contains ICU-only EHR structured data, text, and images for nearly 300k patients

MIMIC-III | dataset

  • Contains ICU-only EHR structured data, text, and images for nearly 40k patients

MIMIC-III CareVue subset

  • Contains subset of MIMIC-III patients not included in MIMIC-IV

eICU | dataset

  • ICU-only EHR structured data from across the US for 200k admissions

MOVER

  • Hospital visit data for surgery patients at UCI.
  • Comprehensive EMR records and waveforms for each patient encounter
  • Contains patient information, medical history, and specific surgical procedure information including: medicines used, lines or drains used, and post-operative complications
  • 58,799 unique patients with data from 83,468 surgeries. Data spans over 4 years.

CARMEN-I

  • COVID-19 patients with diverse comorbidities like kidney failure, cardiovascular diseases, malignancies, and immunosuppression
  • 2,000 clinical records, encompassing discharge letters, referrals, and radiology reports from Hospital Clínic of Barcelona between March 2020 and March 2022
  • Primarily in Spanish, some Catalan sections
  • Expertly annotated for medical concepts by specialists, encompassing symptoms, diseases, procedures, medications, species, and humans (incl family members)

Manually Annotated datasets:

CORAL | dataset

  • Expert-labeled: 20 breast cancer and 20 pancreatic cancer de-identified progress notes from UCSF, comprehensively annotated
  • Unannotated: 100 breast cancer and 100 pancreatic cancer de-identified progress notes from UCSF, auto-labeled with GPT-4
  • Expert-labeled subset should only be used as a test-set

RadQA: A Question Answering Dataset to Improve Comprehension of Radiology Reports | dataset

  • 1000+ expert-annotated radiology reports from MIMIC-III

CliniQG4QA: Generating Diverse Questions for Domain Adaptation of Clinical Question Answering

  • 1287 annotated QA pairs on 36 sampled discharge summaries from MIMIC-III Clinical Notes

MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records | [awaiting dataset]

  • Instruction-tuning dataset of 983 questions/instructions from 7 speciaties
  • Specialties include: Internal Medicine, Neurology, Radiology, Cardiology, Oncology, Surgery, and Primary Care
  • 303 are expert-labeled, with EHR references for correct answers also annotated
  • Reference EHR contains structured and note data

n2c2/i2b2 shared task datasets

Other datasets:

CliCR

  • Nearly 100k automated queries from 11,846 clinical case reports for question answering / reading comprehension

EMR-QA

  • i2b2 datasets repurposed for question answering
  • 400,000+ question-answer evidence pairs and 1 million questions-logical forms

PMC-Patients

  • 167k patient summaries from PMC case reports
  • 3.1M patient-article relevance and 293k patient-patient similarity annotations defined by PubMed citation graph

Med-HALT: Medical Domain Hallucination Test for Large Language Models

  • Medical exam and Pubmed QA datasets
  • Combination of 7 datasets: MedMCQA, HEADQA, MedQA USMILE, MedQA Taiwan, Pubmed

Benchmarking Large Language Models on Answering and Explaining Challenging Medical Questions | dataset

  • JAMA Clinical Challenge dataset containing questions based on challenging clinical cases
  • Medbullets comprising USMLE Step 2&3 style clinical questions
  • All are multiple-choice question-answering tasks
  • Each question is accompanied by an expert-written explanation

A Dataset for Pharmacovigilance in German, French, and Japanese: Annotating Adverse Drug Reactions across Languages | dataset

  • Multilingual corpus of texts concerning ADRs gathered from patient fora and social media in German, French, and Japanese
  • 12 entity types, four attribute types, and 13 relation types

Exploring the Generalization of Cancer Clinical Trial Eligibility Classifiers Across Diseases | dataset

  • 2,490 annotated eligibility criteria across seven exclusion types in the following groups: (1) additional phase 3 cancer trials, (2) phase 1 and 2 cancer trials, (3) heart disease trials, (4) type 2 diabetes trials, and (5) observational trials for any disease.
  • The paper also has several references to other clinical trial datasets.

CT-ADE: An Evaluation Benchmark for Adverse Drug Event Prediction from Clinical Trial Results

  • 12,000 instances extracted from clinical trial results
  • Integrates drug, patient population, and contextual information for multilabel ADE classification tasks in monopharmacy treatments

Synthetic data:

Asclepius