OncoCalibrate

Evaluating Calibration Methods for Breast Cancer Mortality Prediction

A rigorous evaluation of probability calibration methods for machine learning models predicting 5-year mortality in breast cancer patients using the METABRIC dataset.

🎯 The Problem

Modern ML models can achieve high discrimination (AUC) but their probability estimates are often poorly calibrated. In clinical settings, uncalibrated probabilities cannot be trusted for risk-based decision-making.

💡 This Project

Implements and evaluates multiple calibration methods (Platt Scaling, Temperature Scaling, Conformal Prediction) across two model architectures (XGBoost, Neural Networks) on the METABRIC breast cancer dataset (N=1,492).

Key Finding: Post-hoc calibration substantially reduced miscalibration for neural networks; point-estimate ECE approached zero under fixed-bin evaluation, though uncertainty is high due to low event count (4% event rate, ~12 test events). Discrimination preserved (AUC maintained).

🔬 Results Summary

Dataset: METABRIC (1,492 patients, 5-year mortality)

Outcome: Death within 5 years of diagnosis Event Rate: 4.0% (60/1,492 patients) Split: 60% train / 20% validation / 20% test (stratified)

Model	Calibration	ECE ↓	Brier ↓	AUC
Neural Network	None	0.5772	0.4390	0.596
Neural Network	Platt Scaling	0.0001†	0.0385	0.596
Neural Network	Temperature (T=10)	0.4760	0.2655	0.596
XGBoost	None	0.0250	0.0406	0.634
XGBoost	Platt Scaling	0.0009†	0.0385	0.634

† Point estimate under fixed-bin evaluation; high variance due to sparse events.

⚠️ Critical Interpretation Note: Given the low number of test events (~12), ECE estimates are unstable and should be interpreted as qualitative improvement rather than precise calibration values. Point estimates approach zero but have high variance.

Calibration Metrics:

ECE = Expected Calibration Error (10 bins, equal-frequency binning)
Brier = Brier Score (reliability + resolution decomposition)
Results reported on held-out test set (N=299, ~12 events)

Important Context:

Substantial ECE reduction reflects correction of severe initial miscalibration combined with sparse-bin effects
With only 60 total events, calibration metrics have high variance
Bootstrap CIs implemented in comprehensive analysis (see CALIBRATION_RIGOR.md); main pipeline reports point estimates only
Reliability diagrams use 10 bins; finer binning creates sparser bins

📊 Visual Results

Reliability Diagrams

Neural Network - Before/After Calibration:

Uncalibrated NN shows severe miscalibration (points far from diagonal); Platt Scaling corrects this.

XGBoost - Before/After Calibration:

XGBoost better calibrated initially but still benefits from post-hoc calibration.

ROC Curves


XGBoost (AUC = 0.634)	Neural Network (AUC = 0.596)

Calibration preserves discrimination - AUC unchanged by monotone transformations.

🔒 Data Leakage Prevention

Strict separation enforced across train/validation/test:

Fitted on Training Set Only:

Missing data imputation (median/mode)
Feature standardization (mean/std)
Model training (XGBoost, Neural Network)

Fitted on Validation Set Only:

Calibration methods (Platt Scaling parameters A/B, Temperature T)
Conformal prediction quantiles

Test Set:

Never seen during any fitting step
Used only for final evaluation
All metrics reported on test set

Implementation Details:

Stratified splitting by outcome + ER status + tumor stage
Preprocessing pipeline frozen after training fit
See src/data/splitter.py:87-124 and src/data/preprocessor.py:45-89 for details

⚠️ Limitations & Caveats

Sample Size & Event Rate

Low event rate (4%) makes calibration metrics noisy
Only 60 events total; ~12 events in test set
Reliability diagrams have sparse bins
Bootstrap confidence intervals implemented in comprehensive_calibration_analysis.py; main pipeline reports point estimates only

Endpoint Definition

Outcome is 5-year mortality (overall survival), not recurrence-free survival
Recurrence data not available in this METABRIC format
Clinical scenario examples in documentation should reference mortality risk, not recurrence risk

Dataset Limitations

Single dataset (METABRIC); no external validation performed
Patients excluded if <5 year follow-up without event (412/1904 excluded)
Missing data handled by simple imputation (median/mode)
Limited to key cancer genes; full genomic profiles not used

Methodological Notes

ECE can be unstable with low event rates and fixed binning
Platt Scaling fit on validation set (N=298, ~12 events)
Temperature Scaling may underperform with small validation sets
No recalibration for distribution shift or temporal drift

🚀 Quick Start

Installation

git clone https://github.com/bcfeen/OncoCalibrate.git
cd OncoCalibrate

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Reproduce Full Pipeline

# 1. Download METABRIC data from Kaggle
export KAGGLE_API_KEY='your_key_here'
python scripts/download_metabric_kaggle.py

# 2. Run full pipeline (trains models, applies calibration, generates figures)
python scripts/run_pipeline.py

# Results saved to:
#   - results/models/     (trained models)
#   - results/figures/    (reliability diagrams, ROC curves)
#   - results/metrics/    (performance metrics)

Quick Test with Synthetic Data

python scripts/run_pipeline.py --use-synthetic --skip-tuning

🔍 What's Implemented

Calibration Methods

1. Platt Scaling (Logistic Calibration)

# Fits: q = 1 / (1 + exp(A*p + B))
calibrator = PlattScaling()
calibrator.fit(predictions_val, y_val)
calibrated = calibrator.predict(predictions_test)

2. Temperature Scaling (Neural Networks)

# Applies: q = sigmoid(logits / T)
temp_calibrator = TemperatureScaling()
optimal_T = temp_calibrator.fit(logits_val, y_val)
calibrated = temp_calibrator.predict(logits_test)

3. Conformal Prediction (Uncertainty Quantification)

# Provides prediction intervals with coverage guarantees
# Note: Uncertainty quantification under miscalibration, not calibration itself
conformal = ConformalPredictor(alpha=0.1)  # 90% coverage
conformal.fit(predictions_val, y_val)
intervals = conformal.predict(predictions_test)

Evaluation Metrics

Calibration: ECE, MCE, Brier Score (with decomposition)
Statistical Tests: Hosmer-Lemeshow, Calibration Slope
Discrimination: AUC-ROC, Average Precision
Visualization: Reliability diagrams, ROC curves

Models

XGBoost with hyperparameter tuning (Optuna)
PyTorch Neural Network with logit access for temperature scaling

📁 Project Structure

OncoCalibrate/
├── config/               # YAML configurations
│   ├── data_config.yaml  # Preprocessing, features, outcome
│   ├── model_config.yaml # XGBoost/NN hyperparameters
│   └── eval_config.yaml  # Calibration methods, metrics
├── data/                 # Data storage (gitignored)
├── src/                  # Source code
│   ├── data/            # Loading, preprocessing, splitting
│   ├── models/          # XGBoost & Neural Network
│   ├── calibration/     # Platt, Temperature, Conformal
│   ├── evaluation/      # Metrics & visualization
│   └── utils/           # Config, logging, seed management
├── scripts/             # Executable scripts
│   ├── run_pipeline.py         # Full end-to-end pipeline
│   ├── download_metabric_kaggle.py
│   └── test_pipeline.py        # Module testing
└── results/             # Generated outputs (gitignored)

🎓 Key Takeaways

1. Neural Networks Require Calibration

Uncalibrated NN showed severe miscalibration (ECE = 0.577), confirming findings from Guo et al. (2017). Direct probability outputs cannot be trusted for clinical use without post-hoc calibration.

2. Platt Scaling is Effective

Despite being a simple 2-parameter method, Platt Scaling markedly improved calibration for both model types under the chosen evaluation scheme. Temperature Scaling (T=10) helped but was less effective, likely due to small validation set and class imbalance.

3. Tree Models Better Calibrated Initially

XGBoost showed better out-of-box calibration (ECE = 0.025) compared to neural networks (ECE = 0.577), consistent with calibration literature.

4. Calibration Preserves Discrimination

All calibration methods preserved AUC-ROC (monotone transformations don't change ranking). No trade-off between calibration and discrimination.

5. Low Event Rates Create Challenges

With 4% event rate, calibration metrics are noisy and reliability diagrams have sparse bins. Larger samples needed for robust calibration assessment.

📖 Documentation

RESULTS_SUMMARY.md - Detailed experimental results and analysis
CALIBRATION_RIGOR.md - Rigor analysis: Bootstrap CIs, bin sensitivity, multiple baselines
PROJECT_COMPLETE.md - Development timeline and deliverables

🛠️ Usage Examples

Training a Model

from src.models.xgboost_model import XGBoostMortalityModel

model = XGBoostMortalityModel()
model.train(X_train, y_train, X_val, y_val)
predictions = model.predict_proba(X_test)

Applying Calibration

from src.calibration.platt_scaling import PlattScaling

calibrator = PlattScaling()
calibrator.fit(predictions_val, y_val)
calibrated_probs = calibrator.predict(predictions_test)

Computing Metrics

from src.evaluation.metrics import compute_all_metrics

metrics = compute_all_metrics(y_test, calibrated_probs)
print(f"ECE: {metrics['ece']:.4f}")
print(f"Brier: {metrics['brier_score']:.4f}")
print(f"AUC: {metrics['auc_roc']:.3f}")

📊 Dataset

METABRIC (Molecular Taxonomy of Breast Cancer International Consortium)

Source: Kaggle (raghadalharbi/breast-cancer-gene-expression-profiles-metabric)
Patients: 1,904 total; 1,492 after exclusions (<5 year follow-up without event)
Features: 49 total
- Clinical (18): Age, tumor characteristics, ER/PR/HER2 status, treatment
- Genetic (28): BRCA1/2, TP53, key cancer gene mutations
- Derived (4): Mutation burden, treatment intensity, PAM50 risk grouping
Outcome: Death within 5 years of diagnosis
Event Rate: 4.0% (60/1,492 patients)

Citation:

Curtis et al. (2012). "The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups." Nature 486:346-352.

🔮 Future Work

Implemented (Sensitivity Analysis)

Bootstrap confidence intervals (see CALIBRATION_RIGOR.md)
Bin sensitivity analysis (5/10/20 bins)
Multiple baseline comparisons

Not Yet Implemented

Isotonic regression and beta calibration baselines
External validation on independent cohort
Recalibration monitoring for temporal drift
SHAP values for feature importance
Survival analysis with time-to-event calibration

📚 References

Calibration Methods:

Guo et al. (2017). "On Calibration of Modern Neural Networks." ICML.
Platt (1999). "Probabilistic outputs for support vector machines."
Vovk et al. (2005). "Algorithmic Learning in a Random World."

Dataset: 4. Curtis et al. (2012). "The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups." Nature.

🤝 Contributing

Suggestions and improvements welcome via issues or pull requests. Particularly interested in:

Bootstrap CI implementation for calibration metrics
Isotonic/beta calibration implementations
External validation datasets

📄 License

MIT License - see LICENSE for details.

👤 Author

Ben Feeney

GitHub: @bcfeen
LinkedIn: Ben Feeney

Portfolio project demonstrating ML engineering, statistical rigor, and clinical ML expertise.

🙏 Acknowledgments

METABRIC Consortium for dataset
cBioPortal and Kaggle for data access
Calibration research community

Note: This is a research/educational project. Not validated for clinical use.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
config		config
data		data
figures		figures
models		models
results		results
scripts		scripts
src		src
.gitignore		.gitignore
CALIBRATION_RIGOR.md		CALIBRATION_RIGOR.md
DATA_DOWNLOAD.md		DATA_DOWNLOAD.md
LICENSE		LICENSE
PROJECT_COMPLETE.md		PROJECT_COMPLETE.md
README.md		README.md
RESULTS_SUMMARY.md		RESULTS_SUMMARY.md
requirements.txt		requirements.txt
setup.py		setup.py

License

bcfeen/OncoCalibrate

Folders and files

Latest commit

History

Repository files navigation