Survival Analysis Pipeline

A comprehensive Python toolkit for analyzing Kaplan-Meier survival curves, reconstructing individual patient data, and fitting parametric survival distributions.

Overview

This pipeline addresses a common challenge in health economics and survival analysis: extracting usable data from published Kaplan-Meier curves. It implements:

Guyot Algorithm - Reconstructs Individual Patient Data (IPD) from digitized KM curves
Parametric Fitting - Fits 6 standard survival distributions
Piecewise Models - Creates flexible piecewise parameterizations
Google Sheets Integration - Bypasses Excel's character limits for collaborative work

Why This Tool?

Published studies often only show KM curves, not raw data
Excel has a 32,767 character limit per cell, making large datasets problematic
Manual extraction is time-consuming and error-prone
Parametric models are needed for economic modeling and extrapolation

Features

Feature	Description
Guyot Algorithm	Reconstruct IPD from digitized KM curves + number at risk tables
6 Parametric Distributions	Exponential, Weibull, Log-Normal, Log-Logistic, Gompertz, Gamma
Piecewise Models	Piecewise Exponential and Piecewise Weibull with customizable breakpoints
Automatic Model Selection	Compare models using AIC/BIC criteria
Google Sheets Integration	Read inputs and write results directly to Google Sheets
Visualization	Generate publication-ready survival curve plots

Installation

Prerequisites

Python 3.8 or higher
pip package manager

Step 1: Install Dependencies

pip install numpy pandas scipy matplotlib gspread google-auth

Step 2: Clone/Download the Project

# If using git
git clone <repository-url>
cd survival_analysis_pipeline

# Or download and extract the zip file

Step 3: Verify Installation

python -c "from survival_analysis import guyot_algorithm; print('Installation successful!')"

Quick Start

Option A: Run the Test Script

python test_pipeline.py

This runs a complete analysis with sample data and outputs results to CSV files.

Option B: Use the Jupyter Notebook

# Open in VS Code
code survival_analysis_pipeline.ipynb

# Or start Jupyter
jupyter notebook survival_analysis_pipeline.ipynb

Option C: Minimal Python Script

import numpy as np
from survival_analysis import guyot_algorithm, fit_all_distributions

# Your digitized KM curve data
t_km = np.array([0, 6, 12, 18, 24])
s_km = np.array([1.0, 0.8, 0.6, 0.45, 0.35])

# Number at risk table
t_risk = np.array([0, 6, 12, 18, 24])
n_risk = np.array([100, 75, 50, 35, 25])

# Reconstruct IPD
ipd_df = guyot_algorithm(t_km, s_km, t_risk, n_risk)
print(f"Reconstructed {len(ipd_df)} patients")

# Fit parametric distributions
results, summary = fit_all_distributions(ipd_df['time'], ipd_df['event'])
print(summary[['Distribution', 'AIC']].to_string())

Detailed Usage

1. Guyot Algorithm

The Guyot algorithm reconstructs Individual Patient Data from published Kaplan-Meier curves.

Reference: Guyot P, et al. (2012) BMC Medical Research Methodology 12:9

Input Requirements

Digitized KM curve - Time points and survival probabilities
Number at risk table - Patient counts at specific intervals

Code Example

from survival_analysis import guyot_algorithm, calculate_km_from_ipd

# Digitized KM curve (from tools like WebPlotDigitizer)
t_km = np.array([0, 1, 2, 3, 4, 5, 6, 8, 10, 12])
s_km = np.array([1.0, 0.95, 0.90, 0.85, 0.78, 0.72, 0.65, 0.55, 0.48, 0.42])

# Number at risk (from the table below the KM plot)
t_risk = np.array([0, 3, 6, 9, 12])
n_risk = np.array([150, 120, 90, 65, 50])

# Reconstruct IPD
ipd_df = guyot_algorithm(t_km, s_km, t_risk, n_risk)

# Output: DataFrame with columns 'time' and 'event'
# event = 1 (death/event occurred), event = 0 (censored)
print(ipd_df.head())
#        time  event
# 0  0.234567      1
# 1  0.456789      0
# 2  0.678901      1
# ...

# Validate by reconstructing KM curve
reconstructed = calculate_km_from_ipd(ipd_df)

Parameters

Parameter	Type	Description
`t_km`	array	Time points from digitized KM curve
`s_km`	array	Survival probabilities (start with 1.0)
`t_risk`	array	Time points where N at risk is reported
`n_risk`	array	Number at risk at each time point
`tot_events`	int (optional)	Total events if known

2. Parametric Fitting

Fit standard survival distributions to the reconstructed IPD.

Available Distributions

Distribution	Hazard Shape	Use Case
Exponential	Constant	Simplest model, memoryless property
Weibull	Monotonic (increasing or decreasing)	Most common, flexible
Log-Normal	Non-monotonic (unimodal)	When hazard rises then falls
Log-Logistic	Non-monotonic	Similar to Log-Normal
Gompertz	Exponentially increasing	Aging/mortality studies
Gamma	Flexible	General purpose

Code Example

from survival_analysis import fit_all_distributions, select_best_distribution

# Fit all distributions
results, summary_df = fit_all_distributions(
    times=ipd_df['time'].values,
    events=ipd_df['event'].values
)

# View comparison table (sorted by AIC)
print(summary_df)
#   Distribution  Log-Likelihood      AIC      BIC
# 0     Weibull         -450.23   904.46   910.12
# 1   Log-Normal        -452.11   908.22   913.88
# ...

# Get best model
best = select_best_distribution(results, criterion='AIC')
print(f"Best: {best.name}, Parameters: {best.params}")

# Predict survival at specific times
t_predict = np.array([6, 12, 24, 36])
survival_probs = best.survival(t_predict)
hazard_rates = best.hazard(t_predict)

Individual Distribution Usage

from survival_analysis import WeibullDistribution

weibull = WeibullDistribution()
weibull.fit(times, events)

print(f"Shape: {weibull.params['shape']}")
print(f"Scale: {weibull.params['scale']}")
print(f"AIC: {weibull.aic}")

# Survival at 12 months
print(f"S(12) = {weibull.survival(12)}")

3. Piecewise Models

For complex survival patterns, use piecewise models that allow different hazard rates in different time periods.

Piecewise Exponential

Constant hazard within each segment, can change at breakpoints.

from survival_analysis import PiecewiseExponential

# Define breakpoints (e.g., every 6 months)
breakpoints = [6, 12, 18]

pw_exp = PiecewiseExponential(breakpoints)
pw_exp.fit(times, events)

# View segment parameters
summary = pw_exp.summary()
print(summary['segments'])
#   Segment  Start   End    Lambda  Median_Survival
# 0       1      0     6    0.0450           15.40
# 1       2      6    12    0.0380           18.24
# 2       3     12    18    0.0320           21.66
# 3       4     18   Inf    0.0250           27.73

Piecewise Weibull

More flexible - shape and scale can vary by segment.

from survival_analysis import PiecewiseWeibull

pw_weibull = PiecewiseWeibull(breakpoints=[12, 24])
pw_weibull.fit(times, events)

summary = pw_weibull.summary()
print(summary['segments'])

Automatic Breakpoint Selection

from survival_analysis import find_optimal_breakpoints

result = find_optimal_breakpoints(
    times, events,
    max_breakpoints=3,
    model_type='exponential'
)

print(f"Optimal breakpoints: {result['optimal_breakpoints']}")
print(result['comparison'])  # AIC for each number of breakpoints

4. Google Sheets Integration

Bypass Excel's character limits by using Google Sheets as your data source and output destination.

Setup for Google Colab (Recommended)

from google.colab import auth
from google.auth import default
import gspread

# Authenticate (opens popup)
auth.authenticate_user()
creds, _ = default()
gc = gspread.authorize(creds)

# Create new spreadsheet
spreadsheet = gc.create("My_Survival_Analysis")
print(f"URL: {spreadsheet.url}")

Setup for Local Environment

Create a Google Cloud project
Enable Google Sheets API and Google Drive API
Create a service account and download JSON key
Share your spreadsheet with the service account email

import gspread

gc = gspread.service_account(filename='service_account.json')
spreadsheet = gc.open("My_Survival_Analysis")

Reading Data from Sheets

# Read KM curve data
ws = spreadsheet.worksheet("KM_Curve_Input")
data = ws.get_all_values()[1:]  # Skip header

t_km = np.array([float(row[0]) for row in data if row[0]])
s_km = np.array([float(row[1]) for row in data if row[1]])

Writing Results to Sheets

# Write IPD results (cell by cell - no character limit!)
ws = spreadsheet.worksheet("IPD_Results")
ws.clear()

data = [['time', 'event']] + ipd_df.values.tolist()
ws.update('A1', data)

File Structure

survival_analysis_pipeline/
│
├── survival_analysis/              # Core Python package
│   ├── __init__.py                 # Package exports
│   ├── guyot_algorithm.py          # IPD reconstruction
│   ├── parametric_fitting.py       # Distribution fitting
│   ├── piecewise_fitting.py        # Piecewise models
│   └── sheets_integration.py       # Google Sheets helpers
│
├── survival_analysis_pipeline.ipynb # Main analysis notebook
├── google_sheets_setup.ipynb        # Sheets integration tutorial
├── test_pipeline.py                 # Test script with sample data
├── visualize_results.py             # Generate plots
│
├── sheets_simulation/               # Local test outputs
│   ├── KM_Curve_Input.csv
│   ├── Risk_Table_Input.csv
│   ├── IPD_Results.csv
│   ├── Parametric_Results.csv
│   ├── Piecewise_Results.csv
│   └── Survival_Predictions.csv
│
├── README.md                        # This file
└── GOOGLE_SHEETS_QUICKSTART.md      # Quick reference for Sheets

Workflows

Local Workflow (VS Code)

Prepare your data
- Digitize your KM curve using WebPlotDigitizer or similar
- Record the number at risk table from the publication
Open the notebook
```
code survival_analysis_pipeline.ipynb
```
Enter your data in the input cells
Run all cells to generate:
- Reconstructed IPD
- Parametric fits
- Piecewise models
- Visualizations
Export results to CSV files

Google Colab Workflow

Upload files to Colab
- Upload the survival_analysis/ folder
- Upload google_sheets_setup.ipynb

Authenticate with Google

from google.colab import auth
auth.authenticate_user()

Create template spreadsheet
- The notebook creates worksheets for input and output
Enter data in Google Sheets
- Open the spreadsheet URL
- Enter KM data in KM_Curve_Input tab
- Enter risk table in Risk_Table_Input tab
Run analysis
- Results are written back to the spreadsheet
Create charts
- Use Google Sheets' built-in charting on the predictions

Input Data Format

KM Curve Data

Time	Survival
0	1.00
2	0.95
4	0.88
6	0.80
...	...

Requirements:

First row must be Time=0, Survival=1.0
Time should be monotonically increasing
Survival should be monotonically decreasing

Number at Risk Table

Time	N_at_Risk
0	200
6	150
12	100
...	...

Requirements:

Must start at Time=0
N_at_Risk should decrease over time
Time points should align with or be subset of KM curve times

Output Files

IPD_Results.csv

time,event
0.234,1
0.567,0
0.891,1
...

time: Event or censoring time
event: 1 = event occurred, 0 = censored

Parametric_Results.csv

Distribution,Log-Likelihood,AIC,BIC
Weibull,-450.23,904.46,910.12
Log-Normal,-452.11,908.22,913.88
...

Piecewise_Results.csv

Segment,Start,End,Lambda,Median_Survival
1,0,6,0.045,15.4
2,6,12,0.038,18.2
...

Survival_Predictions.csv

Time,Original_KM,Weibull,Piecewise_Exp
0,1.0,1.0,1.0
6,0.77,0.76,0.75
12,0.55,0.54,0.53
...

Troubleshooting

Common Issues

Problem	Solution
`ModuleNotFoundError`	Run `pip install numpy pandas scipy matplotlib`
Google auth fails	Re-run authentication cell, check API is enabled
"Spreadsheet not found"	Share sheet with service account email
Poor fit quality	Check input data quality, try different distributions
Negative survival values	Ensure KM data is monotonically decreasing

Validation Tips

Always validate the Guyot reconstruction

reconstructed = calculate_km_from_ipd(ipd_df)
# Plot original vs reconstructed - should overlap closely

Check parameter reasonableness
- Weibull shape > 1: increasing hazard
- Weibull shape < 1: decreasing hazard
- Weibull shape = 1: constant hazard (exponential)
Compare AIC differences
- ΔAIC < 2: Models essentially equivalent
- ΔAIC 4-7: Some evidence for better model
- ΔAIC > 10: Strong evidence for better model

References

Guyot Algorithm: Guyot P, Ades AE, Ouwens MJ, Welton NJ. (2012). Enhanced secondary analysis of survival data: reconstructing the data from published Kaplan-Meier survival curves. BMC Medical Research Methodology, 12:9.
Parametric Survival Models: Collett D. (2015). Modelling Survival Data in Medical Research (3rd ed.). CRC Press.
Digitizing Tools:
- WebPlotDigitizer: https://automeris.io/WebPlotDigitizer/

License

This project is provided for educational and research purposes.

Support

For issues or questions:

Check the Troubleshooting section
Review the example notebooks
Open an issue on the repository

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
sheets_simulation		sheets_simulation
survival_analysis		survival_analysis
.gitignore		.gitignore
GOOGLE_SHEETS_QUICKSTART.md		GOOGLE_SHEETS_QUICKSTART.md
README.md		README.md
google_sheets_setup.ipynb		google_sheets_setup.ipynb
ipd_distribution.png		ipd_distribution.png
model_comparison.png		model_comparison.png
piecewise_detail.png		piecewise_detail.png
survival_analysis_overview.png		survival_analysis_overview.png
survival_analysis_pipeline.ipynb		survival_analysis_pipeline.ipynb
test_ipd_results.csv		test_ipd_results.csv
test_parametric_comparison.csv		test_parametric_comparison.csv
test_pipeline.py		test_pipeline.py
test_sheets_integration.py		test_sheets_integration.py
test_survival_predictions.csv		test_survival_predictions.csv
visualize_results.py		visualize_results.py

danribes/sur_pwise

Folders and files

Latest commit

History

Repository files navigation

Survival Analysis Pipeline

Table of Contents

Overview

Why This Tool?

Features

Installation

Prerequisites

Step 1: Install Dependencies

Step 2: Clone/Download the Project

Step 3: Verify Installation

Quick Start

Option A: Run the Test Script

Option B: Use the Jupyter Notebook

Option C: Minimal Python Script

Detailed Usage

1. Guyot Algorithm

Input Requirements

Code Example

Parameters

2. Parametric Fitting

Available Distributions

Code Example

Individual Distribution Usage

3. Piecewise Models

Piecewise Exponential

Piecewise Weibull

Automatic Breakpoint Selection

4. Google Sheets Integration

Setup for Google Colab (Recommended)

Setup for Local Environment

Reading Data from Sheets

Writing Results to Sheets

File Structure

Workflows

Local Workflow (VS Code)

Google Colab Workflow

Input Data Format

KM Curve Data

Number at Risk Table

Output Files

IPD_Results.csv

Parametric_Results.csv

Piecewise_Results.csv

Survival_Predictions.csv

Troubleshooting

Common Issues

Validation Tips

References

License

Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages