- Obtain access to the MIMIC-CXR-JPG Database Database on PhysioNet and download the dataset. We recommend downloading from the GCP bucket:
gcloud auth login
mkdir MIMIC-CXR-JPG
gsutil -m rsync -d -r gs://mimic-cxr-jpg-2.0.0.physionet.org MIMIC-CXR-JPG- In order to obtain gender information for each patient, you will need to obtain access to MIMIC-IV. Download
core/patients.csv.gzand place the file in theMIMIC-CXR-JPGdirectory.
-
Sign up with your email address here.
-
Download either the original or the downsampled dataset (we recommend the downsampled version -
CheXpert-v1.0-small.zip) and extract it.
-
Download the
imagesfolder andData_Entry_2017_v2020.csvfrom the NIH website. -
Unzip all of the files in the
imagesfolder.
-
We use a resized version of PadChest, which can be downloaded here.
-
Unzip
images-224.tar.
-
In
clinicaldg/cxr/Constants.py, updateimage_pathsto point to each of the four directories that you downloaded. -
Run
python -m clinicaldg.cxr.preprocess.preprocess. -
(Optional) If you are training a lot of models, it might be faster to cache all images to binary 224x224 files on disk. In this case, you should update the
cache_dirpath inclinicaldg/cxr/Constants.pyand then runpython -m clinicaldg.cxr.preprocess.cache_data, optionally parallelizing over--env_id {0, 1, 2, 3}for speed. To use the cached files, pass--use_cache 1totrain.pyorsweep.py.