Evaluating Calibration Methods for Breast Cancer Mortality Prediction
A rigorous evaluation of probability calibration methods for machine learning models predicting 5-year mortality in breast cancer patients using the METABRIC dataset.
Modern ML models can achieve high discrimination (AUC) but their probability estimates are often poorly calibrated. In clinical settings, uncalibrated probabilities cannot be trusted for risk-based decision-making.
Implements and evaluates multiple calibration methods (Platt Scaling, Temperature Scaling, Conformal Prediction) across two model architectures (XGBoost, Neural Networks) on the METABRIC breast cancer dataset (N=1,492).
Key Finding: Post-hoc calibration substantially reduced miscalibration for neural networks; point-estimate ECE approached zero under fixed-bin evaluation, though uncertainty is high due to low event count (4% event rate, ~12 test events). Discrimination preserved (AUC maintained).
Outcome: Death within 5 years of diagnosis Event Rate: 4.0% (60/1,492 patients) Split: 60% train / 20% validation / 20% test (stratified)
| Model | Calibration | ECE โ | Brier โ | AUC |
|---|---|---|---|---|
| Neural Network | None | 0.5772 | 0.4390 | 0.596 |
| Neural Network | Platt Scaling | 0.0001โ | 0.0385 | 0.596 |
| Neural Network | Temperature (T=10) | 0.4760 | 0.2655 | 0.596 |
| XGBoost | None | 0.0250 | 0.0406 | 0.634 |
| XGBoost | Platt Scaling | 0.0009โ | 0.0385 | 0.634 |
โ Point estimate under fixed-bin evaluation; high variance due to sparse events.
Calibration Metrics:
- ECE = Expected Calibration Error (10 bins, equal-frequency binning)
- Brier = Brier Score (reliability + resolution decomposition)
- Results reported on held-out test set (N=299, ~12 events)
Important Context:
- Substantial ECE reduction reflects correction of severe initial miscalibration combined with sparse-bin effects
- With only 60 total events, calibration metrics have high variance
- Bootstrap CIs implemented in comprehensive analysis (see CALIBRATION_RIGOR.md); main pipeline reports point estimates only
- Reliability diagrams use 10 bins; finer binning creates sparser bins
Neural Network - Before/After Calibration:
Uncalibrated NN shows severe miscalibration (points far from diagonal); Platt Scaling corrects this.
XGBoost - Before/After Calibration:
XGBoost better calibrated initially but still benefits from post-hoc calibration.
![]() |
![]() |
| XGBoost (AUC = 0.634) | Neural Network (AUC = 0.596) |
Calibration preserves discrimination - AUC unchanged by monotone transformations.
Strict separation enforced across train/validation/test:
Fitted on Training Set Only:
- Missing data imputation (median/mode)
- Feature standardization (mean/std)
- Model training (XGBoost, Neural Network)
Fitted on Validation Set Only:
- Calibration methods (Platt Scaling parameters A/B, Temperature T)
- Conformal prediction quantiles
Test Set:
- Never seen during any fitting step
- Used only for final evaluation
- All metrics reported on test set
Implementation Details:
- Stratified splitting by outcome + ER status + tumor stage
- Preprocessing pipeline frozen after training fit
- See
src/data/splitter.py:87-124andsrc/data/preprocessor.py:45-89for details
- Low event rate (4%) makes calibration metrics noisy
- Only 60 events total; ~12 events in test set
- Reliability diagrams have sparse bins
- Bootstrap confidence intervals implemented in comprehensive_calibration_analysis.py; main pipeline reports point estimates only
- Outcome is 5-year mortality (overall survival), not recurrence-free survival
- Recurrence data not available in this METABRIC format
- Clinical scenario examples in documentation should reference mortality risk, not recurrence risk
- Single dataset (METABRIC); no external validation performed
- Patients excluded if <5 year follow-up without event (412/1904 excluded)
- Missing data handled by simple imputation (median/mode)
- Limited to key cancer genes; full genomic profiles not used
- ECE can be unstable with low event rates and fixed binning
- Platt Scaling fit on validation set (N=298, ~12 events)
- Temperature Scaling may underperform with small validation sets
- No recalibration for distribution shift or temporal drift
git clone https://github.com/bcfeen/OncoCalibrate.git
cd OncoCalibrate
# Create virtual environment
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt# 1. Download METABRIC data from Kaggle
export KAGGLE_API_KEY='your_key_here'
python scripts/download_metabric_kaggle.py
# 2. Run full pipeline (trains models, applies calibration, generates figures)
python scripts/run_pipeline.py
# Results saved to:
# - results/models/ (trained models)
# - results/figures/ (reliability diagrams, ROC curves)
# - results/metrics/ (performance metrics)python scripts/run_pipeline.py --use-synthetic --skip-tuning1. Platt Scaling (Logistic Calibration)
# Fits: q = 1 / (1 + exp(A*p + B))
calibrator = PlattScaling()
calibrator.fit(predictions_val, y_val)
calibrated = calibrator.predict(predictions_test)2. Temperature Scaling (Neural Networks)
# Applies: q = sigmoid(logits / T)
temp_calibrator = TemperatureScaling()
optimal_T = temp_calibrator.fit(logits_val, y_val)
calibrated = temp_calibrator.predict(logits_test)3. Conformal Prediction (Uncertainty Quantification)
# Provides prediction intervals with coverage guarantees
# Note: Uncertainty quantification under miscalibration, not calibration itself
conformal = ConformalPredictor(alpha=0.1) # 90% coverage
conformal.fit(predictions_val, y_val)
intervals = conformal.predict(predictions_test)- Calibration: ECE, MCE, Brier Score (with decomposition)
- Statistical Tests: Hosmer-Lemeshow, Calibration Slope
- Discrimination: AUC-ROC, Average Precision
- Visualization: Reliability diagrams, ROC curves
- XGBoost with hyperparameter tuning (Optuna)
- PyTorch Neural Network with logit access for temperature scaling
OncoCalibrate/
โโโ config/ # YAML configurations
โ โโโ data_config.yaml # Preprocessing, features, outcome
โ โโโ model_config.yaml # XGBoost/NN hyperparameters
โ โโโ eval_config.yaml # Calibration methods, metrics
โโโ data/ # Data storage (gitignored)
โโโ src/ # Source code
โ โโโ data/ # Loading, preprocessing, splitting
โ โโโ models/ # XGBoost & Neural Network
โ โโโ calibration/ # Platt, Temperature, Conformal
โ โโโ evaluation/ # Metrics & visualization
โ โโโ utils/ # Config, logging, seed management
โโโ scripts/ # Executable scripts
โ โโโ run_pipeline.py # Full end-to-end pipeline
โ โโโ download_metabric_kaggle.py
โ โโโ test_pipeline.py # Module testing
โโโ results/ # Generated outputs (gitignored)
Uncalibrated NN showed severe miscalibration (ECE = 0.577), confirming findings from Guo et al. (2017). Direct probability outputs cannot be trusted for clinical use without post-hoc calibration.
Despite being a simple 2-parameter method, Platt Scaling markedly improved calibration for both model types under the chosen evaluation scheme. Temperature Scaling (T=10) helped but was less effective, likely due to small validation set and class imbalance.
XGBoost showed better out-of-box calibration (ECE = 0.025) compared to neural networks (ECE = 0.577), consistent with calibration literature.
All calibration methods preserved AUC-ROC (monotone transformations don't change ranking). No trade-off between calibration and discrimination.
With 4% event rate, calibration metrics are noisy and reliability diagrams have sparse bins. Larger samples needed for robust calibration assessment.
- RESULTS_SUMMARY.md - Detailed experimental results and analysis
- CALIBRATION_RIGOR.md - Rigor analysis: Bootstrap CIs, bin sensitivity, multiple baselines
- PROJECT_COMPLETE.md - Development timeline and deliverables
from src.models.xgboost_model import XGBoostMortalityModel
model = XGBoostMortalityModel()
model.train(X_train, y_train, X_val, y_val)
predictions = model.predict_proba(X_test)from src.calibration.platt_scaling import PlattScaling
calibrator = PlattScaling()
calibrator.fit(predictions_val, y_val)
calibrated_probs = calibrator.predict(predictions_test)from src.evaluation.metrics import compute_all_metrics
metrics = compute_all_metrics(y_test, calibrated_probs)
print(f"ECE: {metrics['ece']:.4f}")
print(f"Brier: {metrics['brier_score']:.4f}")
print(f"AUC: {metrics['auc_roc']:.3f}")METABRIC (Molecular Taxonomy of Breast Cancer International Consortium)
- Source: Kaggle (raghadalharbi/breast-cancer-gene-expression-profiles-metabric)
- Patients: 1,904 total; 1,492 after exclusions (<5 year follow-up without event)
- Features: 49 total
- Clinical (18): Age, tumor characteristics, ER/PR/HER2 status, treatment
- Genetic (28): BRCA1/2, TP53, key cancer gene mutations
- Derived (4): Mutation burden, treatment intensity, PAM50 risk grouping
- Outcome: Death within 5 years of diagnosis
- Event Rate: 4.0% (60/1,492 patients)
Citation:
Curtis et al. (2012). "The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups." Nature 486:346-352.
- Bootstrap confidence intervals (see CALIBRATION_RIGOR.md)
- Bin sensitivity analysis (5/10/20 bins)
- Multiple baseline comparisons
- Isotonic regression and beta calibration baselines
- External validation on independent cohort
- Recalibration monitoring for temporal drift
- SHAP values for feature importance
- Survival analysis with time-to-event calibration
Calibration Methods:
- Guo et al. (2017). "On Calibration of Modern Neural Networks." ICML.
- Platt (1999). "Probabilistic outputs for support vector machines."
- Vovk et al. (2005). "Algorithmic Learning in a Random World."
Dataset: 4. Curtis et al. (2012). "The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups." Nature.
Suggestions and improvements welcome via issues or pull requests. Particularly interested in:
- Bootstrap CI implementation for calibration metrics
- Isotonic/beta calibration implementations
- External validation datasets
MIT License - see LICENSE for details.
Ben Feeney
- GitHub: @bcfeen
- LinkedIn: Ben Feeney
Portfolio project demonstrating ML engineering, statistical rigor, and clinical ML expertise.
- METABRIC Consortium for dataset
- cBioPortal and Kaggle for data access
- Calibration research community
Note: This is a research/educational project. Not validated for clinical use.



