Skip to content

bcfeen/OncoCalibrate

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

10 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

OncoCalibrate

Evaluating Calibration Methods for Breast Cancer Mortality Prediction

Python 3.10+ License: MIT

A rigorous evaluation of probability calibration methods for machine learning models predicting 5-year mortality in breast cancer patients using the METABRIC dataset.


๐ŸŽฏ The Problem

Modern ML models can achieve high discrimination (AUC) but their probability estimates are often poorly calibrated. In clinical settings, uncalibrated probabilities cannot be trusted for risk-based decision-making.

๐Ÿ’ก This Project

Implements and evaluates multiple calibration methods (Platt Scaling, Temperature Scaling, Conformal Prediction) across two model architectures (XGBoost, Neural Networks) on the METABRIC breast cancer dataset (N=1,492).

Key Finding: Post-hoc calibration substantially reduced miscalibration for neural networks; point-estimate ECE approached zero under fixed-bin evaluation, though uncertainty is high due to low event count (4% event rate, ~12 test events). Discrimination preserved (AUC maintained).


๐Ÿ”ฌ Results Summary

Dataset: METABRIC (1,492 patients, 5-year mortality)

Outcome: Death within 5 years of diagnosis Event Rate: 4.0% (60/1,492 patients) Split: 60% train / 20% validation / 20% test (stratified)

Model Calibration ECE โ†“ Brier โ†“ AUC
Neural Network None 0.5772 0.4390 0.596
Neural Network Platt Scaling 0.0001โ€  0.0385 0.596
Neural Network Temperature (T=10) 0.4760 0.2655 0.596
XGBoost None 0.0250 0.0406 0.634
XGBoost Platt Scaling 0.0009โ€  0.0385 0.634

โ€  Point estimate under fixed-bin evaluation; high variance due to sparse events.

โš ๏ธ Critical Interpretation Note: Given the low number of test events (~12), ECE estimates are unstable and should be interpreted as qualitative improvement rather than precise calibration values. Point estimates approach zero but have high variance.

Calibration Metrics:

  • ECE = Expected Calibration Error (10 bins, equal-frequency binning)
  • Brier = Brier Score (reliability + resolution decomposition)
  • Results reported on held-out test set (N=299, ~12 events)

Important Context:

  • Substantial ECE reduction reflects correction of severe initial miscalibration combined with sparse-bin effects
  • With only 60 total events, calibration metrics have high variance
  • Bootstrap CIs implemented in comprehensive analysis (see CALIBRATION_RIGOR.md); main pipeline reports point estimates only
  • Reliability diagrams use 10 bins; finer binning creates sparser bins

๐Ÿ“Š Visual Results

Reliability Diagrams

Neural Network - Before/After Calibration:

Uncalibrated NN shows severe miscalibration (points far from diagonal); Platt Scaling corrects this.

XGBoost - Before/After Calibration:

XGBoost better calibrated initially but still benefits from post-hoc calibration.

ROC Curves

XGBoost (AUC = 0.634) Neural Network (AUC = 0.596)

Calibration preserves discrimination - AUC unchanged by monotone transformations.


๐Ÿ”’ Data Leakage Prevention

Strict separation enforced across train/validation/test:

Fitted on Training Set Only:

  • Missing data imputation (median/mode)
  • Feature standardization (mean/std)
  • Model training (XGBoost, Neural Network)

Fitted on Validation Set Only:

  • Calibration methods (Platt Scaling parameters A/B, Temperature T)
  • Conformal prediction quantiles

Test Set:

  • Never seen during any fitting step
  • Used only for final evaluation
  • All metrics reported on test set

Implementation Details:

  • Stratified splitting by outcome + ER status + tumor stage
  • Preprocessing pipeline frozen after training fit
  • See src/data/splitter.py:87-124 and src/data/preprocessor.py:45-89 for details

โš ๏ธ Limitations & Caveats

Sample Size & Event Rate

  • Low event rate (4%) makes calibration metrics noisy
  • Only 60 events total; ~12 events in test set
  • Reliability diagrams have sparse bins
  • Bootstrap confidence intervals implemented in comprehensive_calibration_analysis.py; main pipeline reports point estimates only

Endpoint Definition

  • Outcome is 5-year mortality (overall survival), not recurrence-free survival
  • Recurrence data not available in this METABRIC format
  • Clinical scenario examples in documentation should reference mortality risk, not recurrence risk

Dataset Limitations

  • Single dataset (METABRIC); no external validation performed
  • Patients excluded if <5 year follow-up without event (412/1904 excluded)
  • Missing data handled by simple imputation (median/mode)
  • Limited to key cancer genes; full genomic profiles not used

Methodological Notes

  • ECE can be unstable with low event rates and fixed binning
  • Platt Scaling fit on validation set (N=298, ~12 events)
  • Temperature Scaling may underperform with small validation sets
  • No recalibration for distribution shift or temporal drift

๐Ÿš€ Quick Start

Installation

git clone https://github.com/bcfeen/OncoCalibrate.git
cd OncoCalibrate

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install dependencies
pip install -r requirements.txt

Reproduce Full Pipeline

# 1. Download METABRIC data from Kaggle
export KAGGLE_API_KEY='your_key_here'
python scripts/download_metabric_kaggle.py

# 2. Run full pipeline (trains models, applies calibration, generates figures)
python scripts/run_pipeline.py

# Results saved to:
#   - results/models/     (trained models)
#   - results/figures/    (reliability diagrams, ROC curves)
#   - results/metrics/    (performance metrics)

Quick Test with Synthetic Data

python scripts/run_pipeline.py --use-synthetic --skip-tuning

๐Ÿ” What's Implemented

Calibration Methods

1. Platt Scaling (Logistic Calibration)

# Fits: q = 1 / (1 + exp(A*p + B))
calibrator = PlattScaling()
calibrator.fit(predictions_val, y_val)
calibrated = calibrator.predict(predictions_test)

2. Temperature Scaling (Neural Networks)

# Applies: q = sigmoid(logits / T)
temp_calibrator = TemperatureScaling()
optimal_T = temp_calibrator.fit(logits_val, y_val)
calibrated = temp_calibrator.predict(logits_test)

3. Conformal Prediction (Uncertainty Quantification)

# Provides prediction intervals with coverage guarantees
# Note: Uncertainty quantification under miscalibration, not calibration itself
conformal = ConformalPredictor(alpha=0.1)  # 90% coverage
conformal.fit(predictions_val, y_val)
intervals = conformal.predict(predictions_test)

Evaluation Metrics

  • Calibration: ECE, MCE, Brier Score (with decomposition)
  • Statistical Tests: Hosmer-Lemeshow, Calibration Slope
  • Discrimination: AUC-ROC, Average Precision
  • Visualization: Reliability diagrams, ROC curves

Models

  • XGBoost with hyperparameter tuning (Optuna)
  • PyTorch Neural Network with logit access for temperature scaling

๐Ÿ“ Project Structure

OncoCalibrate/
โ”œโ”€โ”€ config/               # YAML configurations
โ”‚   โ”œโ”€โ”€ data_config.yaml  # Preprocessing, features, outcome
โ”‚   โ”œโ”€โ”€ model_config.yaml # XGBoost/NN hyperparameters
โ”‚   โ””โ”€โ”€ eval_config.yaml  # Calibration methods, metrics
โ”œโ”€โ”€ data/                 # Data storage (gitignored)
โ”œโ”€โ”€ src/                  # Source code
โ”‚   โ”œโ”€โ”€ data/            # Loading, preprocessing, splitting
โ”‚   โ”œโ”€โ”€ models/          # XGBoost & Neural Network
โ”‚   โ”œโ”€โ”€ calibration/     # Platt, Temperature, Conformal
โ”‚   โ”œโ”€โ”€ evaluation/      # Metrics & visualization
โ”‚   โ””โ”€โ”€ utils/           # Config, logging, seed management
โ”œโ”€โ”€ scripts/             # Executable scripts
โ”‚   โ”œโ”€โ”€ run_pipeline.py         # Full end-to-end pipeline
โ”‚   โ”œโ”€โ”€ download_metabric_kaggle.py
โ”‚   โ””โ”€โ”€ test_pipeline.py        # Module testing
โ””โ”€โ”€ results/             # Generated outputs (gitignored)

๐ŸŽ“ Key Takeaways

1. Neural Networks Require Calibration

Uncalibrated NN showed severe miscalibration (ECE = 0.577), confirming findings from Guo et al. (2017). Direct probability outputs cannot be trusted for clinical use without post-hoc calibration.

2. Platt Scaling is Effective

Despite being a simple 2-parameter method, Platt Scaling markedly improved calibration for both model types under the chosen evaluation scheme. Temperature Scaling (T=10) helped but was less effective, likely due to small validation set and class imbalance.

3. Tree Models Better Calibrated Initially

XGBoost showed better out-of-box calibration (ECE = 0.025) compared to neural networks (ECE = 0.577), consistent with calibration literature.

4. Calibration Preserves Discrimination

All calibration methods preserved AUC-ROC (monotone transformations don't change ranking). No trade-off between calibration and discrimination.

5. Low Event Rates Create Challenges

With 4% event rate, calibration metrics are noisy and reliability diagrams have sparse bins. Larger samples needed for robust calibration assessment.


๐Ÿ“– Documentation


๐Ÿ› ๏ธ Usage Examples

Training a Model

from src.models.xgboost_model import XGBoostMortalityModel

model = XGBoostMortalityModel()
model.train(X_train, y_train, X_val, y_val)
predictions = model.predict_proba(X_test)

Applying Calibration

from src.calibration.platt_scaling import PlattScaling

calibrator = PlattScaling()
calibrator.fit(predictions_val, y_val)
calibrated_probs = calibrator.predict(predictions_test)

Computing Metrics

from src.evaluation.metrics import compute_all_metrics

metrics = compute_all_metrics(y_test, calibrated_probs)
print(f"ECE: {metrics['ece']:.4f}")
print(f"Brier: {metrics['brier_score']:.4f}")
print(f"AUC: {metrics['auc_roc']:.3f}")

๐Ÿ“Š Dataset

METABRIC (Molecular Taxonomy of Breast Cancer International Consortium)

  • Source: Kaggle (raghadalharbi/breast-cancer-gene-expression-profiles-metabric)
  • Patients: 1,904 total; 1,492 after exclusions (<5 year follow-up without event)
  • Features: 49 total
    • Clinical (18): Age, tumor characteristics, ER/PR/HER2 status, treatment
    • Genetic (28): BRCA1/2, TP53, key cancer gene mutations
    • Derived (4): Mutation burden, treatment intensity, PAM50 risk grouping
  • Outcome: Death within 5 years of diagnosis
  • Event Rate: 4.0% (60/1,492 patients)

Citation:

Curtis et al. (2012). "The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups." Nature 486:346-352.


๐Ÿ”ฎ Future Work

Implemented (Sensitivity Analysis)

  • Bootstrap confidence intervals (see CALIBRATION_RIGOR.md)
  • Bin sensitivity analysis (5/10/20 bins)
  • Multiple baseline comparisons

Not Yet Implemented

  • Isotonic regression and beta calibration baselines
  • External validation on independent cohort
  • Recalibration monitoring for temporal drift
  • SHAP values for feature importance
  • Survival analysis with time-to-event calibration

๐Ÿ“š References

Calibration Methods:

  1. Guo et al. (2017). "On Calibration of Modern Neural Networks." ICML.
  2. Platt (1999). "Probabilistic outputs for support vector machines."
  3. Vovk et al. (2005). "Algorithmic Learning in a Random World."

Dataset: 4. Curtis et al. (2012). "The genomic and transcriptomic architecture of 2,000 breast tumours reveals novel subgroups." Nature.


๐Ÿค Contributing

Suggestions and improvements welcome via issues or pull requests. Particularly interested in:

  • Bootstrap CI implementation for calibration metrics
  • Isotonic/beta calibration implementations
  • External validation datasets

๐Ÿ“„ License

MIT License - see LICENSE for details.


๐Ÿ‘ค Author

Ben Feeney

Portfolio project demonstrating ML engineering, statistical rigor, and clinical ML expertise.


๐Ÿ™ Acknowledgments

  • METABRIC Consortium for dataset
  • cBioPortal and Kaggle for data access
  • Calibration research community

Note: This is a research/educational project. Not validated for clinical use.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages