Skip to content

danribes/sur_pwise

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Survival Analysis Pipeline

A comprehensive Python toolkit for analyzing Kaplan-Meier survival curves, reconstructing individual patient data, and fitting parametric survival distributions.

Table of Contents


Overview

This pipeline addresses a common challenge in health economics and survival analysis: extracting usable data from published Kaplan-Meier curves. It implements:

  1. Guyot Algorithm - Reconstructs Individual Patient Data (IPD) from digitized KM curves
  2. Parametric Fitting - Fits 6 standard survival distributions
  3. Piecewise Models - Creates flexible piecewise parameterizations
  4. Google Sheets Integration - Bypasses Excel's character limits for collaborative work

Why This Tool?

  • Published studies often only show KM curves, not raw data
  • Excel has a 32,767 character limit per cell, making large datasets problematic
  • Manual extraction is time-consuming and error-prone
  • Parametric models are needed for economic modeling and extrapolation

Features

Feature Description
Guyot Algorithm Reconstruct IPD from digitized KM curves + number at risk tables
6 Parametric Distributions Exponential, Weibull, Log-Normal, Log-Logistic, Gompertz, Gamma
Piecewise Models Piecewise Exponential and Piecewise Weibull with customizable breakpoints
Automatic Model Selection Compare models using AIC/BIC criteria
Google Sheets Integration Read inputs and write results directly to Google Sheets
Visualization Generate publication-ready survival curve plots

Installation

Prerequisites

  • Python 3.8 or higher
  • pip package manager

Step 1: Install Dependencies

pip install numpy pandas scipy matplotlib gspread google-auth

Step 2: Clone/Download the Project

# If using git
git clone <repository-url>
cd survival_analysis_pipeline

# Or download and extract the zip file

Step 3: Verify Installation

python -c "from survival_analysis import guyot_algorithm; print('Installation successful!')"

Quick Start

Option A: Run the Test Script

python test_pipeline.py

This runs a complete analysis with sample data and outputs results to CSV files.

Option B: Use the Jupyter Notebook

# Open in VS Code
code survival_analysis_pipeline.ipynb

# Or start Jupyter
jupyter notebook survival_analysis_pipeline.ipynb

Option C: Minimal Python Script

import numpy as np
from survival_analysis import guyot_algorithm, fit_all_distributions

# Your digitized KM curve data
t_km = np.array([0, 6, 12, 18, 24])
s_km = np.array([1.0, 0.8, 0.6, 0.45, 0.35])

# Number at risk table
t_risk = np.array([0, 6, 12, 18, 24])
n_risk = np.array([100, 75, 50, 35, 25])

# Reconstruct IPD
ipd_df = guyot_algorithm(t_km, s_km, t_risk, n_risk)
print(f"Reconstructed {len(ipd_df)} patients")

# Fit parametric distributions
results, summary = fit_all_distributions(ipd_df['time'], ipd_df['event'])
print(summary[['Distribution', 'AIC']].to_string())

Detailed Usage

1. Guyot Algorithm

The Guyot algorithm reconstructs Individual Patient Data from published Kaplan-Meier curves.

Reference: Guyot P, et al. (2012) BMC Medical Research Methodology 12:9

Input Requirements

  1. Digitized KM curve - Time points and survival probabilities
  2. Number at risk table - Patient counts at specific intervals

Code Example

from survival_analysis import guyot_algorithm, calculate_km_from_ipd

# Digitized KM curve (from tools like WebPlotDigitizer)
t_km = np.array([0, 1, 2, 3, 4, 5, 6, 8, 10, 12])
s_km = np.array([1.0, 0.95, 0.90, 0.85, 0.78, 0.72, 0.65, 0.55, 0.48, 0.42])

# Number at risk (from the table below the KM plot)
t_risk = np.array([0, 3, 6, 9, 12])
n_risk = np.array([150, 120, 90, 65, 50])

# Reconstruct IPD
ipd_df = guyot_algorithm(t_km, s_km, t_risk, n_risk)

# Output: DataFrame with columns 'time' and 'event'
# event = 1 (death/event occurred), event = 0 (censored)
print(ipd_df.head())
#        time  event
# 0  0.234567      1
# 1  0.456789      0
# 2  0.678901      1
# ...

# Validate by reconstructing KM curve
reconstructed = calculate_km_from_ipd(ipd_df)

Parameters

Parameter Type Description
t_km array Time points from digitized KM curve
s_km array Survival probabilities (start with 1.0)
t_risk array Time points where N at risk is reported
n_risk array Number at risk at each time point
tot_events int (optional) Total events if known

2. Parametric Fitting

Fit standard survival distributions to the reconstructed IPD.

Available Distributions

Distribution Hazard Shape Use Case
Exponential Constant Simplest model, memoryless property
Weibull Monotonic (increasing or decreasing) Most common, flexible
Log-Normal Non-monotonic (unimodal) When hazard rises then falls
Log-Logistic Non-monotonic Similar to Log-Normal
Gompertz Exponentially increasing Aging/mortality studies
Gamma Flexible General purpose

Code Example

from survival_analysis import fit_all_distributions, select_best_distribution

# Fit all distributions
results, summary_df = fit_all_distributions(
    times=ipd_df['time'].values,
    events=ipd_df['event'].values
)

# View comparison table (sorted by AIC)
print(summary_df)
#   Distribution  Log-Likelihood      AIC      BIC
# 0     Weibull         -450.23   904.46   910.12
# 1   Log-Normal        -452.11   908.22   913.88
# ...

# Get best model
best = select_best_distribution(results, criterion='AIC')
print(f"Best: {best.name}, Parameters: {best.params}")

# Predict survival at specific times
t_predict = np.array([6, 12, 24, 36])
survival_probs = best.survival(t_predict)
hazard_rates = best.hazard(t_predict)

Individual Distribution Usage

from survival_analysis import WeibullDistribution

weibull = WeibullDistribution()
weibull.fit(times, events)

print(f"Shape: {weibull.params['shape']}")
print(f"Scale: {weibull.params['scale']}")
print(f"AIC: {weibull.aic}")

# Survival at 12 months
print(f"S(12) = {weibull.survival(12)}")

3. Piecewise Models

For complex survival patterns, use piecewise models that allow different hazard rates in different time periods.

Piecewise Exponential

Constant hazard within each segment, can change at breakpoints.

from survival_analysis import PiecewiseExponential

# Define breakpoints (e.g., every 6 months)
breakpoints = [6, 12, 18]

pw_exp = PiecewiseExponential(breakpoints)
pw_exp.fit(times, events)

# View segment parameters
summary = pw_exp.summary()
print(summary['segments'])
#   Segment  Start   End    Lambda  Median_Survival
# 0       1      0     6    0.0450           15.40
# 1       2      6    12    0.0380           18.24
# 2       3     12    18    0.0320           21.66
# 3       4     18   Inf    0.0250           27.73

Piecewise Weibull

More flexible - shape and scale can vary by segment.

from survival_analysis import PiecewiseWeibull

pw_weibull = PiecewiseWeibull(breakpoints=[12, 24])
pw_weibull.fit(times, events)

summary = pw_weibull.summary()
print(summary['segments'])

Automatic Breakpoint Selection

from survival_analysis import find_optimal_breakpoints

result = find_optimal_breakpoints(
    times, events,
    max_breakpoints=3,
    model_type='exponential'
)

print(f"Optimal breakpoints: {result['optimal_breakpoints']}")
print(result['comparison'])  # AIC for each number of breakpoints

4. Google Sheets Integration

Bypass Excel's character limits by using Google Sheets as your data source and output destination.

Setup for Google Colab (Recommended)

from google.colab import auth
from google.auth import default
import gspread

# Authenticate (opens popup)
auth.authenticate_user()
creds, _ = default()
gc = gspread.authorize(creds)

# Create new spreadsheet
spreadsheet = gc.create("My_Survival_Analysis")
print(f"URL: {spreadsheet.url}")

Setup for Local Environment

  1. Create a Google Cloud project
  2. Enable Google Sheets API and Google Drive API
  3. Create a service account and download JSON key
  4. Share your spreadsheet with the service account email
import gspread

gc = gspread.service_account(filename='service_account.json')
spreadsheet = gc.open("My_Survival_Analysis")

Reading Data from Sheets

# Read KM curve data
ws = spreadsheet.worksheet("KM_Curve_Input")
data = ws.get_all_values()[1:]  # Skip header

t_km = np.array([float(row[0]) for row in data if row[0]])
s_km = np.array([float(row[1]) for row in data if row[1]])

Writing Results to Sheets

# Write IPD results (cell by cell - no character limit!)
ws = spreadsheet.worksheet("IPD_Results")
ws.clear()

data = [['time', 'event']] + ipd_df.values.tolist()
ws.update('A1', data)

File Structure

survival_analysis_pipeline/
│
├── survival_analysis/              # Core Python package
│   ├── __init__.py                 # Package exports
│   ├── guyot_algorithm.py          # IPD reconstruction
│   ├── parametric_fitting.py       # Distribution fitting
│   ├── piecewise_fitting.py        # Piecewise models
│   └── sheets_integration.py       # Google Sheets helpers
│
├── survival_analysis_pipeline.ipynb # Main analysis notebook
├── google_sheets_setup.ipynb        # Sheets integration tutorial
├── test_pipeline.py                 # Test script with sample data
├── visualize_results.py             # Generate plots
│
├── sheets_simulation/               # Local test outputs
│   ├── KM_Curve_Input.csv
│   ├── Risk_Table_Input.csv
│   ├── IPD_Results.csv
│   ├── Parametric_Results.csv
│   ├── Piecewise_Results.csv
│   └── Survival_Predictions.csv
│
├── README.md                        # This file
└── GOOGLE_SHEETS_QUICKSTART.md      # Quick reference for Sheets

Workflows

Local Workflow (VS Code)

  1. Prepare your data

    • Digitize your KM curve using WebPlotDigitizer or similar
    • Record the number at risk table from the publication
  2. Open the notebook

    code survival_analysis_pipeline.ipynb
  3. Enter your data in the input cells

  4. Run all cells to generate:

    • Reconstructed IPD
    • Parametric fits
    • Piecewise models
    • Visualizations
  5. Export results to CSV files

Google Colab Workflow

  1. Upload files to Colab

    • Upload the survival_analysis/ folder
    • Upload google_sheets_setup.ipynb
  2. Authenticate with Google

    from google.colab import auth
    auth.authenticate_user()
  3. Create template spreadsheet

    • The notebook creates worksheets for input and output
  4. Enter data in Google Sheets

    • Open the spreadsheet URL
    • Enter KM data in KM_Curve_Input tab
    • Enter risk table in Risk_Table_Input tab
  5. Run analysis

    • Results are written back to the spreadsheet
  6. Create charts

    • Use Google Sheets' built-in charting on the predictions

Input Data Format

KM Curve Data

Time Survival
0 1.00
2 0.95
4 0.88
6 0.80
... ...

Requirements:

  • First row must be Time=0, Survival=1.0
  • Time should be monotonically increasing
  • Survival should be monotonically decreasing

Number at Risk Table

Time N_at_Risk
0 200
6 150
12 100
... ...

Requirements:

  • Must start at Time=0
  • N_at_Risk should decrease over time
  • Time points should align with or be subset of KM curve times

Output Files

IPD_Results.csv

time,event
0.234,1
0.567,0
0.891,1
...
  • time: Event or censoring time
  • event: 1 = event occurred, 0 = censored

Parametric_Results.csv

Distribution,Log-Likelihood,AIC,BIC
Weibull,-450.23,904.46,910.12
Log-Normal,-452.11,908.22,913.88
...

Piecewise_Results.csv

Segment,Start,End,Lambda,Median_Survival
1,0,6,0.045,15.4
2,6,12,0.038,18.2
...

Survival_Predictions.csv

Time,Original_KM,Weibull,Piecewise_Exp
0,1.0,1.0,1.0
6,0.77,0.76,0.75
12,0.55,0.54,0.53
...

Troubleshooting

Common Issues

Problem Solution
ModuleNotFoundError Run pip install numpy pandas scipy matplotlib
Google auth fails Re-run authentication cell, check API is enabled
"Spreadsheet not found" Share sheet with service account email
Poor fit quality Check input data quality, try different distributions
Negative survival values Ensure KM data is monotonically decreasing

Validation Tips

  1. Always validate the Guyot reconstruction

    reconstructed = calculate_km_from_ipd(ipd_df)
    # Plot original vs reconstructed - should overlap closely
  2. Check parameter reasonableness

    • Weibull shape > 1: increasing hazard
    • Weibull shape < 1: decreasing hazard
    • Weibull shape = 1: constant hazard (exponential)
  3. Compare AIC differences

    • ΔAIC < 2: Models essentially equivalent
    • ΔAIC 4-7: Some evidence for better model
    • ΔAIC > 10: Strong evidence for better model

References

  1. Guyot Algorithm: Guyot P, Ades AE, Ouwens MJ, Welton NJ. (2012). Enhanced secondary analysis of survival data: reconstructing the data from published Kaplan-Meier survival curves. BMC Medical Research Methodology, 12:9.

  2. Parametric Survival Models: Collett D. (2015). Modelling Survival Data in Medical Research (3rd ed.). CRC Press.

  3. Digitizing Tools:


License

This project is provided for educational and research purposes.


Support

For issues or questions:

  1. Check the Troubleshooting section
  2. Review the example notebooks
  3. Open an issue on the repository

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published