Skip to content

washma-sajjad/Epic-Array-Aging-Clocks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 

Repository files navigation

EPIC Array Aging Clocks — Benchmarking Epigenetic Biomarkers of Aging

This repository contains a Jupyter notebook that benchmarks eight established epigenetic aging clocks across two publicly available DNA methylation datasets using the Bio-Learn open-source library. The analysis was completed as part of a course assignment on biomarkers of aging.


Background

DNA methylation patterns change predictably with age, and this property has been exploited to build computational models — commonly called "epigenetic clocks" — that estimate biological age from methylation array data. Unlike chronological age, epigenetic age captures meaningful biological variation: individuals who appear epigenetically older than their calendar age tend to have worse health outcomes, higher mortality risk, and faster functional decline.

The Bio-Learn library (Ying et al., 2025, Nature Aging) provides a unified framework for loading public datasets and running a standardised collection of aging clocks in only a few lines of code, making cross-dataset benchmarking straightforward and reproducible.


Datasets

Two datasets were selected from the Gene Expression Omnibus (GEO), both profiled on the Illumina HumanMethylation450k array and included in the Bio-Learn data library.

GEO Accession Description Samples Array
GSE41169 Dutch Schizophrenia Case-Control Cohort 95 Illumina 450k
GSE64495 Developmental Disorder Study 113 Illumina 450k

These two cohorts were chosen because they span a reasonable chronological age range, have curated age metadata within Bio-Learn, and represent distinct biological contexts — one psychiatric, one developmental — allowing us to examine how clock performance generalises across settings.


Aging Clocks

Eight aging clocks were selected to cover the major categories represented in the Bio-Learn model gallery.

Clock Category Description
Horvathv1 Chronological Multi-tissue clock trained on 353 CpGs; the original and most widely used epigenetic clock (Horvath 2013)
Hannum Chronological Blood-specific clock trained on 71 CpGs (Hannum 2013)
Lin Chronological Blood-specific clock using 99 CpGs (Lin 2016)
PhenoAge Mortality / Healthspan Trained to predict "phenotypic age" from clinical markers; captures morbidity beyond calendar age (Levine 2018)
GrimAge Mortality / Healthspan Trained on plasma protein surrogates and smoking pace; among the strongest predictors of lifespan (Lu 2019)
Zhang_10 Mortality Trained directly on mortality outcomes using 10 CpGs (Zhang 2017)
DunedinPACE Rate of Aging Unlike the others, outputs a rate (biological years per chronological year) rather than an age estimate; values above 1.0 indicate accelerated aging (Belsky 2022)
YingCausAge Causality-Enriched Uses CpGs selected for causal rather than merely correlative relationships with aging (Ying 2022)

Analysis

The notebook works through six main steps.

1. Data Loading. Both GEO datasets are downloaded and cached automatically using DataLibrary. No manual file handling is required.

2. Dataset Description. Each dataset is summarised with sample counts, CpG site counts, age range and distribution, and metadata previews.

3. Clock Predictions. All eight clocks are run on both datasets using ModelGallery. Clocks that fail on a given dataset (e.g. due to missing CpGs) are skipped gracefully with an informative message.

4. Correlation Matrix. A Pearson correlation matrix is computed across all clock predictions for each dataset separately. This reveals which clocks share biological signal and which measure distinct dimensions of aging. Both the Bio-Learn built-in visualisation and a custom annotated heatmap are produced.

5. Age Deviation Heatmap. For each chronological age clock, the deviation (predicted age minus chronological age) is computed per sample and displayed as a heatmap with samples ordered by increasing chronological age. Red indicates epigenetic age acceleration; blue indicates deceleration.

6. Predicted vs. Chronological Age. Scatter plots are produced for each age clock on each dataset, annotated with Pearson R, RMSE, and a linear regression line. A summary performance table comparing all clocks side-by-side is also included. DunedinPACE is handled separately and visualised as a histogram of its rate distribution.


Repository Contents

Epic-Array-Aging-Clocks/
└── EPIC_Array_Aging_Clocks_BioLearn.ipynb   # Main analysis notebook

Requirements

The notebook is designed to run in Google Colab with no local setup. All dependencies are installed in the first cell.

biolearn
pandas
numpy
matplotlib
seaborn
scipy

Python 3.8 or higher is required. The notebook has been tested on Google Colab (May 2026).


Running the Notebook

Option 1 — Google Colab (recommended)

  1. Download EPIC_Array_Aging_Clocks_BioLearn.ipynb from this repository.
  2. Upload it to Google Colab.
  3. Select Runtime > Run all.

The first run will take several minutes as the GEO datasets are downloaded and cached. Subsequent runs are faster.

Option 2 — Local Jupyter

pip install biolearn pandas numpy matplotlib seaborn scipy notebook
jupyter notebook EPIC_Array_Aging_Clocks_BioLearn.ipynb

Key Observations

Chronological clocks (Horvath, Hannum, Lin) tend to correlate strongly with one another, as they are all optimised to predict the same target. Mortality-oriented clocks (GrimAge, PhenoAge, Zhang_10) show more variable correlation, reflecting that they capture aspects of biological aging that go beyond calendar age. DunedinPACE is structurally distinct — its output is not comparable to predicted age in years and is visualised separately. Clock accuracy, measured by Pearson R and RMSE, varies across datasets, which is expected given differences in age range, tissue type, and cohort composition.


Reference

Ying, K., Paulson, S., Perez-Guevara, M., Emamifar, M., Martinez, M. C., Kwon, D., Poganik, J. R., Moqri, M., and Gladyshev, V. N. (2025). A unified framework for systematic curation and evaluation of aging biomarkers. Nature Aging. https://doi.org/10.1038/s43587-025-00987-y


License

This project is for academic and educational purposes. The underlying datasets are publicly available through NCBI GEO. Clock implementations are provided by the Bio-Learn library under its respective license.

About

This repository contains a Jupyter notebook that benchmarks eight established epigenetic aging clocks across two publicly available DNA methylation datasets using the Bio-Learn open-source library.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors