eICU Data

Obtain access to the eICU Collaborative Research Database on PhysioNet and download the dataset.
Clone the eICU Benchmarks repository and follow the instructions under the "Data extraction" section.
Update the eicu_dir and benchmark_dir variables in clinicaldg/eicu/Constants.py to point to the raw data and processed data folders.

Chest X-ray Data

Obtain access to the MIMIC-CXR-JPG Database Database on PhysioNet and download the dataset. We recommend downloading from the GCP bucket:

gcloud auth login
mkdir MIMIC-CXR-JPG
gsutil -m rsync -d -r gs://mimic-cxr-jpg-2.0.0.physionet.org MIMIC-CXR-JPG

In order to obtain gender information for each patient, you will need to obtain access to MIMIC-IV. Download core/patients.csv.gz and place the file in the MIMIC-CXR-JPG directory.

Sign up with your email address here.
Download either the original or the downsampled dataset (we recommend the downsampled version - CheXpert-v1.0-small.zip) and extract it.

Download the images folder and Data_Entry_2017_v2020.csv from the NIH website.
Unzip all of the files in the images folder.

In clinicaldg/cxr/Constants.py, update image_paths to point to each of the four directories that you downloaded.
Run python -m clinicaldg.cxr.preprocess.preprocess.
(Optional) If you are training a lot of models, it might be faster to cache all images to binary 224x224 files on disk. In this case, you should update the cache_dir path in clinicaldg/cxr/Constants.py and then run python -m clinicaldg.cxr.preprocess.cache_data, optionally parallelizing over --env_id {0, 1, 2, 3} for speed. To use the cached files, pass --use_cache 1 to train.py or sweep.py.