⚠️ Data Access Notice MIMIC-III data is not included in this repository. Access requires:
- Completing CITI training at citiprogram.org
- Submitting a credentialing application at physionet.org
- Downloading MIMIC-III v1.4 from physionet.org/content/mimiciii/1.4
Place the downloaded
.csv.gzfiles in thedata/directory before running the notebook. A demo subset (~100 patients) is available at physionet.org/content/mimiciii-demo/1.4 for testing without full access.
Predict in-hospital mortality using the first 24 hours of an ICU stay.
- Why this task? Different from readmission and length-of-stay. Mortality prediction within the first 24h is clinically actionable — it guides care escalation decisions, ICU resource allocation, and palliative care discussions.
- Data: MIMIC-III v1.4 (Beth Israel Deaconess Medical Center ICU data)
- Cohort: First ICU stay, age ≥ 18, ICU LOS ≥ 24 hours (~46K patients)
- Outcome: Hospital expire flag (binary: 0 = survived, 1 = died in hospital)
| # | Model | Type | Notes |
|---|---|---|---|
| 1 | SVM (RBF kernel) | Classical ML | Baseline; PCA-reduced features |
| 2 | Decision Tree (pruned) | Classical ML | Cost-complexity pruning via CV |
| 3 | Cox Proportional Hazards | Survival | Time-to-event; interpretable HRs |
| 4 | LSTM (Bidirectional + Attention) | Deep Learning | Temporal vital sign sequences |
| 5 | Transformer Encoder (CLS token) | Deep Learning | Self-attention on 12 time steps |
icu-mortality-prediction/
├── mortality_prediction.ipynb ← Main notebook (all sections)
├── utils/
│ ├── mimic_utils.py ← Feature engineering utilities
│ └── __init__.py
├── data/ ← Place MIMIC-III CSV files here
├── outputs/
│ ├── figures/ ← All saved plots
│ └── models/ ← Saved model checkpoints
├── requirements.txt
└── README.md
# 1. Create virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# 2. Install dependencies
pip install -r requirements.txt
# 3. Place MIMIC-III CSV files in data/
# Required files:
# PATIENTS.csv
# ADMISSIONS.csv
# ICUSTAYS.csv
# CHARTEVENTS.csv (large — ~33GB uncompressed)
# LABEVENTS.csv (large — ~10GB uncompressed)
# DIAGNOSES_ICD.csv
# 4. Launch Jupyter
jupyter notebook mortality_prediction.ipynb- Demographics: age, gender
- Admission type: EMERGENCY / ELECTIVE / URGENT
- Ethnicity (5 categories)
- ICU unit type (MICU, SICU, CCU, CSRU, etc.)
| Vital | Item IDs | Stats |
|---|---|---|
| Heart rate | 211, 220045 | min, max, mean, std |
| Systolic BP | 51, 442, 220179... | min, max, mean, std |
| Diastolic BP | 8368, 220180... | min, max, mean, std |
| Temperature (°C) | 223762, 676, 223761, 678 | min, max, mean, std |
| SpO2 | 646, 220277 | min, max, mean, std |
| Respiratory rate | 615, 618, 220210... | min, max, mean, std |
| GCS total | 198, 226755 | min, max, mean, std |
Creatinine, BUN, WBC, Hemoglobin, Sodium, Glucose, Bicarbonate, Lactate, Potassium, Bilirubin (min, max, mean per stay)
Top-50 ICD-9 diagnosis codes as binary bag-of-codes features
| Section | Content |
|---|---|
| 1 | Environment setup, imports, reproducibility |
| 2 | MIMIC-III data loading, cohort selection |
| 3 | Feature engineering (static, vitals, labs, diagnoses) |
| 4 | EDA — outcome distribution, feature distributions, heatmap |
| 5 | PCA explained variance + scatter, UMAP projection |
| 6 | Train/test split, SMOTE class balancing |
| 7 | SVM (RBF) baseline |
| 8 | Decision Tree with cost-complexity pruning |
| 9 | Cox Proportional Hazards + KM curves |
| 10 | Bidirectional LSTM with temporal attention |
| 11 | Transformer Encoder with CLS token |
| 12 | ROC curves, PR curves, calibration, confusion matrices, results table |
| 15 | Key takeaways, feature importance, hazard ratios |
- AUROC — primary metric (area under ROC curve); threshold-independent
- AUPRC — especially important given class imbalance (~10–15% mortality)
- F1 Score — at default 0.5 threshold
- Calibration — reliability diagram; well-calibrated models are safer clinically
- Concordance Index (C-statistic) — for Cox PH model
- Confusion Matrix — false negatives (missed deaths) are clinically costly
Cohort: 32,575 ICU stays · 10.9% mortality rate · 123 features
| Model | AUROC | AUPRC | F1 | Accuracy |
|---|---|---|---|---|
| Transformer Encoder | 0.7683 | 0.3309 | 0.3891 | 0.8195 |
| SVM (RBF) | 0.7502 | 0.2458 | 0.3687 | 0.7450 |
| LSTM | 0.7479 | 0.3067 | 0.3721 | 0.8196 |
| Cox PH (C-index) | 0.7238 | — | — | — |
| Decision Tree | 0.6955 | 0.2565 | 0.3036 | 0.7050 |
- Johnson et al. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data.
- Wang et al. (2020). MIMIC-Extract: A data extraction, preprocessing, and representation pipeline for MIMIC-III. CHIL.
- Harutyunyan et al. (2019). Multitask learning and benchmarking with clinical time series data. Scientific Data.
- Rajpurkar et al. (2017). CheXNet: Radiologist-level pneumonia detection on chest X-rays. arXiv.