-
Obtain access to the eICU Collaborative Research Database on PhysioNet and download the dataset.
-
Clone the eICU Benchmarks repository and follow the instructions under the "Data extraction" section.
-
Update the
eicu_dir
andbenchmark_dir
variables inclinicaldg/eicu/Constants.py
to point to the raw data and processed data folders.
- Obtain access to the MIMIC-CXR-JPG Database Database on PhysioNet and download the dataset. We recommend downloading from the GCP bucket:
gcloud auth login
mkdir MIMIC-CXR-JPG
gsutil -m rsync -d -r gs://mimic-cxr-jpg-2.0.0.physionet.org MIMIC-CXR-JPG
- In order to obtain gender information for each patient, you will need to obtain access to MIMIC-IV. Download
core/patients.csv.gz
and place the file in theMIMIC-CXR-JPG
directory.
-
Sign up with your email address here.
-
Download either the original or the downsampled dataset (we recommend the downsampled version -
CheXpert-v1.0-small.zip
) and extract it.
-
Download the
images
folder andData_Entry_2017_v2020.csv
from the NIH website. -
Unzip all of the files in the
images
folder.
-
We use a resized version of PadChest, which can be downloaded here.
-
Unzip
images-224.tar
.
-
In
clinicaldg/cxr/Constants.py
, updateimage_paths
to point to each of the four directories that you downloaded. -
Run
python -m clinicaldg.cxr.preprocess.preprocess
. -
(Optional) If you are training a lot of models, it might be faster to cache all images to binary 224x224 files on disk. In this case, you should update the
cache_dir
path inclinicaldg/cxr/Constants.py
and then runpython -m clinicaldg.cxr.preprocess.cache_data
, optionally parallelizing over--env_id {0, 1, 2, 3}
for speed. To use the cached files, pass--use_cache 1
totrain.py
orsweep.py
.
-
Update
mnist_dir
inclinicaldg/scripts/download.py
. -
Run
python -m clinicaldg.scripts.download
.