A comprehensive Python toolkit for analyzing Kaplan-Meier survival curves, reconstructing individual patient data, and fitting parametric survival distributions.
- Overview
- Features
- Installation
- Quick Start
- Detailed Usage
- File Structure
- Workflows
- Input Data Format
- Output Files
- Troubleshooting
- References
This pipeline addresses a common challenge in health economics and survival analysis: extracting usable data from published Kaplan-Meier curves. It implements:
- Guyot Algorithm - Reconstructs Individual Patient Data (IPD) from digitized KM curves
- Parametric Fitting - Fits 6 standard survival distributions
- Piecewise Models - Creates flexible piecewise parameterizations
- Google Sheets Integration - Bypasses Excel's character limits for collaborative work
- Published studies often only show KM curves, not raw data
- Excel has a 32,767 character limit per cell, making large datasets problematic
- Manual extraction is time-consuming and error-prone
- Parametric models are needed for economic modeling and extrapolation
| Feature | Description |
|---|---|
| Guyot Algorithm | Reconstruct IPD from digitized KM curves + number at risk tables |
| 6 Parametric Distributions | Exponential, Weibull, Log-Normal, Log-Logistic, Gompertz, Gamma |
| Piecewise Models | Piecewise Exponential and Piecewise Weibull with customizable breakpoints |
| Automatic Model Selection | Compare models using AIC/BIC criteria |
| Google Sheets Integration | Read inputs and write results directly to Google Sheets |
| Visualization | Generate publication-ready survival curve plots |
- Python 3.8 or higher
- pip package manager
pip install numpy pandas scipy matplotlib gspread google-auth# If using git
git clone <repository-url>
cd survival_analysis_pipeline
# Or download and extract the zip filepython -c "from survival_analysis import guyot_algorithm; print('Installation successful!')"python test_pipeline.pyThis runs a complete analysis with sample data and outputs results to CSV files.
# Open in VS Code
code survival_analysis_pipeline.ipynb
# Or start Jupyter
jupyter notebook survival_analysis_pipeline.ipynbimport numpy as np
from survival_analysis import guyot_algorithm, fit_all_distributions
# Your digitized KM curve data
t_km = np.array([0, 6, 12, 18, 24])
s_km = np.array([1.0, 0.8, 0.6, 0.45, 0.35])
# Number at risk table
t_risk = np.array([0, 6, 12, 18, 24])
n_risk = np.array([100, 75, 50, 35, 25])
# Reconstruct IPD
ipd_df = guyot_algorithm(t_km, s_km, t_risk, n_risk)
print(f"Reconstructed {len(ipd_df)} patients")
# Fit parametric distributions
results, summary = fit_all_distributions(ipd_df['time'], ipd_df['event'])
print(summary[['Distribution', 'AIC']].to_string())The Guyot algorithm reconstructs Individual Patient Data from published Kaplan-Meier curves.
Reference: Guyot P, et al. (2012) BMC Medical Research Methodology 12:9
- Digitized KM curve - Time points and survival probabilities
- Number at risk table - Patient counts at specific intervals
from survival_analysis import guyot_algorithm, calculate_km_from_ipd
# Digitized KM curve (from tools like WebPlotDigitizer)
t_km = np.array([0, 1, 2, 3, 4, 5, 6, 8, 10, 12])
s_km = np.array([1.0, 0.95, 0.90, 0.85, 0.78, 0.72, 0.65, 0.55, 0.48, 0.42])
# Number at risk (from the table below the KM plot)
t_risk = np.array([0, 3, 6, 9, 12])
n_risk = np.array([150, 120, 90, 65, 50])
# Reconstruct IPD
ipd_df = guyot_algorithm(t_km, s_km, t_risk, n_risk)
# Output: DataFrame with columns 'time' and 'event'
# event = 1 (death/event occurred), event = 0 (censored)
print(ipd_df.head())
# time event
# 0 0.234567 1
# 1 0.456789 0
# 2 0.678901 1
# ...
# Validate by reconstructing KM curve
reconstructed = calculate_km_from_ipd(ipd_df)| Parameter | Type | Description |
|---|---|---|
t_km |
array | Time points from digitized KM curve |
s_km |
array | Survival probabilities (start with 1.0) |
t_risk |
array | Time points where N at risk is reported |
n_risk |
array | Number at risk at each time point |
tot_events |
int (optional) | Total events if known |
Fit standard survival distributions to the reconstructed IPD.
| Distribution | Hazard Shape | Use Case |
|---|---|---|
| Exponential | Constant | Simplest model, memoryless property |
| Weibull | Monotonic (increasing or decreasing) | Most common, flexible |
| Log-Normal | Non-monotonic (unimodal) | When hazard rises then falls |
| Log-Logistic | Non-monotonic | Similar to Log-Normal |
| Gompertz | Exponentially increasing | Aging/mortality studies |
| Gamma | Flexible | General purpose |
from survival_analysis import fit_all_distributions, select_best_distribution
# Fit all distributions
results, summary_df = fit_all_distributions(
times=ipd_df['time'].values,
events=ipd_df['event'].values
)
# View comparison table (sorted by AIC)
print(summary_df)
# Distribution Log-Likelihood AIC BIC
# 0 Weibull -450.23 904.46 910.12
# 1 Log-Normal -452.11 908.22 913.88
# ...
# Get best model
best = select_best_distribution(results, criterion='AIC')
print(f"Best: {best.name}, Parameters: {best.params}")
# Predict survival at specific times
t_predict = np.array([6, 12, 24, 36])
survival_probs = best.survival(t_predict)
hazard_rates = best.hazard(t_predict)from survival_analysis import WeibullDistribution
weibull = WeibullDistribution()
weibull.fit(times, events)
print(f"Shape: {weibull.params['shape']}")
print(f"Scale: {weibull.params['scale']}")
print(f"AIC: {weibull.aic}")
# Survival at 12 months
print(f"S(12) = {weibull.survival(12)}")For complex survival patterns, use piecewise models that allow different hazard rates in different time periods.
Constant hazard within each segment, can change at breakpoints.
from survival_analysis import PiecewiseExponential
# Define breakpoints (e.g., every 6 months)
breakpoints = [6, 12, 18]
pw_exp = PiecewiseExponential(breakpoints)
pw_exp.fit(times, events)
# View segment parameters
summary = pw_exp.summary()
print(summary['segments'])
# Segment Start End Lambda Median_Survival
# 0 1 0 6 0.0450 15.40
# 1 2 6 12 0.0380 18.24
# 2 3 12 18 0.0320 21.66
# 3 4 18 Inf 0.0250 27.73More flexible - shape and scale can vary by segment.
from survival_analysis import PiecewiseWeibull
pw_weibull = PiecewiseWeibull(breakpoints=[12, 24])
pw_weibull.fit(times, events)
summary = pw_weibull.summary()
print(summary['segments'])from survival_analysis import find_optimal_breakpoints
result = find_optimal_breakpoints(
times, events,
max_breakpoints=3,
model_type='exponential'
)
print(f"Optimal breakpoints: {result['optimal_breakpoints']}")
print(result['comparison']) # AIC for each number of breakpointsBypass Excel's character limits by using Google Sheets as your data source and output destination.
from google.colab import auth
from google.auth import default
import gspread
# Authenticate (opens popup)
auth.authenticate_user()
creds, _ = default()
gc = gspread.authorize(creds)
# Create new spreadsheet
spreadsheet = gc.create("My_Survival_Analysis")
print(f"URL: {spreadsheet.url}")- Create a Google Cloud project
- Enable Google Sheets API and Google Drive API
- Create a service account and download JSON key
- Share your spreadsheet with the service account email
import gspread
gc = gspread.service_account(filename='service_account.json')
spreadsheet = gc.open("My_Survival_Analysis")# Read KM curve data
ws = spreadsheet.worksheet("KM_Curve_Input")
data = ws.get_all_values()[1:] # Skip header
t_km = np.array([float(row[0]) for row in data if row[0]])
s_km = np.array([float(row[1]) for row in data if row[1]])# Write IPD results (cell by cell - no character limit!)
ws = spreadsheet.worksheet("IPD_Results")
ws.clear()
data = [['time', 'event']] + ipd_df.values.tolist()
ws.update('A1', data)survival_analysis_pipeline/
│
├── survival_analysis/ # Core Python package
│ ├── __init__.py # Package exports
│ ├── guyot_algorithm.py # IPD reconstruction
│ ├── parametric_fitting.py # Distribution fitting
│ ├── piecewise_fitting.py # Piecewise models
│ └── sheets_integration.py # Google Sheets helpers
│
├── survival_analysis_pipeline.ipynb # Main analysis notebook
├── google_sheets_setup.ipynb # Sheets integration tutorial
├── test_pipeline.py # Test script with sample data
├── visualize_results.py # Generate plots
│
├── sheets_simulation/ # Local test outputs
│ ├── KM_Curve_Input.csv
│ ├── Risk_Table_Input.csv
│ ├── IPD_Results.csv
│ ├── Parametric_Results.csv
│ ├── Piecewise_Results.csv
│ └── Survival_Predictions.csv
│
├── README.md # This file
└── GOOGLE_SHEETS_QUICKSTART.md # Quick reference for Sheets
-
Prepare your data
- Digitize your KM curve using WebPlotDigitizer or similar
- Record the number at risk table from the publication
-
Open the notebook
code survival_analysis_pipeline.ipynb
-
Enter your data in the input cells
-
Run all cells to generate:
- Reconstructed IPD
- Parametric fits
- Piecewise models
- Visualizations
-
Export results to CSV files
-
Upload files to Colab
- Upload the
survival_analysis/folder - Upload
google_sheets_setup.ipynb
- Upload the
-
Authenticate with Google
from google.colab import auth auth.authenticate_user()
-
Create template spreadsheet
- The notebook creates worksheets for input and output
-
Enter data in Google Sheets
- Open the spreadsheet URL
- Enter KM data in
KM_Curve_Inputtab - Enter risk table in
Risk_Table_Inputtab
-
Run analysis
- Results are written back to the spreadsheet
-
Create charts
- Use Google Sheets' built-in charting on the predictions
| Time | Survival |
|---|---|
| 0 | 1.00 |
| 2 | 0.95 |
| 4 | 0.88 |
| 6 | 0.80 |
| ... | ... |
Requirements:
- First row must be Time=0, Survival=1.0
- Time should be monotonically increasing
- Survival should be monotonically decreasing
| Time | N_at_Risk |
|---|---|
| 0 | 200 |
| 6 | 150 |
| 12 | 100 |
| ... | ... |
Requirements:
- Must start at Time=0
- N_at_Risk should decrease over time
- Time points should align with or be subset of KM curve times
time,event
0.234,1
0.567,0
0.891,1
...time: Event or censoring timeevent: 1 = event occurred, 0 = censored
Distribution,Log-Likelihood,AIC,BIC
Weibull,-450.23,904.46,910.12
Log-Normal,-452.11,908.22,913.88
...Segment,Start,End,Lambda,Median_Survival
1,0,6,0.045,15.4
2,6,12,0.038,18.2
...Time,Original_KM,Weibull,Piecewise_Exp
0,1.0,1.0,1.0
6,0.77,0.76,0.75
12,0.55,0.54,0.53
...| Problem | Solution |
|---|---|
ModuleNotFoundError |
Run pip install numpy pandas scipy matplotlib |
| Google auth fails | Re-run authentication cell, check API is enabled |
| "Spreadsheet not found" | Share sheet with service account email |
| Poor fit quality | Check input data quality, try different distributions |
| Negative survival values | Ensure KM data is monotonically decreasing |
-
Always validate the Guyot reconstruction
reconstructed = calculate_km_from_ipd(ipd_df) # Plot original vs reconstructed - should overlap closely
-
Check parameter reasonableness
- Weibull shape > 1: increasing hazard
- Weibull shape < 1: decreasing hazard
- Weibull shape = 1: constant hazard (exponential)
-
Compare AIC differences
- ΔAIC < 2: Models essentially equivalent
- ΔAIC 4-7: Some evidence for better model
- ΔAIC > 10: Strong evidence for better model
-
Guyot Algorithm: Guyot P, Ades AE, Ouwens MJ, Welton NJ. (2012). Enhanced secondary analysis of survival data: reconstructing the data from published Kaplan-Meier survival curves. BMC Medical Research Methodology, 12:9.
-
Parametric Survival Models: Collett D. (2015). Modelling Survival Data in Medical Research (3rd ed.). CRC Press.
-
Digitizing Tools:
- WebPlotDigitizer: https://automeris.io/WebPlotDigitizer/
This project is provided for educational and research purposes.
For issues or questions:
- Check the Troubleshooting section
- Review the example notebooks
- Open an issue on the repository