EPIC Array Aging Clocks — Benchmarking Epigenetic Biomarkers of Aging

This repository contains a Jupyter notebook that benchmarks eight established epigenetic aging clocks across two publicly available DNA methylation datasets using the Bio-Learn open-source library. The analysis was completed as part of a course assignment on biomarkers of aging.

Background

DNA methylation patterns change predictably with age, and this property has been exploited to build computational models — commonly called "epigenetic clocks" — that estimate biological age from methylation array data. Unlike chronological age, epigenetic age captures meaningful biological variation: individuals who appear epigenetically older than their calendar age tend to have worse health outcomes, higher mortality risk, and faster functional decline.

The Bio-Learn library (Ying et al., 2025, Nature Aging) provides a unified framework for loading public datasets and running a standardised collection of aging clocks in only a few lines of code, making cross-dataset benchmarking straightforward and reproducible.

Datasets

Two datasets were selected from the Gene Expression Omnibus (GEO), both profiled on the Illumina HumanMethylation450k array and included in the Bio-Learn data library.

GEO Accession	Description	Samples	Array
GSE41169	Dutch Schizophrenia Case-Control Cohort	95	Illumina 450k
GSE64495	Developmental Disorder Study	113	Illumina 450k

These two cohorts were chosen because they span a reasonable chronological age range, have curated age metadata within Bio-Learn, and represent distinct biological contexts — one psychiatric, one developmental — allowing us to examine how clock performance generalises across settings.

Aging Clocks

Eight aging clocks were selected to cover the major categories represented in the Bio-Learn model gallery.

Clock	Category	Description
Horvathv1	Chronological	Multi-tissue clock trained on 353 CpGs; the original and most widely used epigenetic clock (Horvath 2013)
Hannum	Chronological	Blood-specific clock trained on 71 CpGs (Hannum 2013)
Lin	Chronological	Blood-specific clock using 99 CpGs (Lin 2016)
PhenoAge	Mortality / Healthspan	Trained to predict "phenotypic age" from clinical markers; captures morbidity beyond calendar age (Levine 2018)
GrimAge	Mortality / Healthspan	Trained on plasma protein surrogates and smoking pace; among the strongest predictors of lifespan (Lu 2019)
Zhang_10	Mortality	Trained directly on mortality outcomes using 10 CpGs (Zhang 2017)
DunedinPACE	Rate of Aging	Unlike the others, outputs a rate (biological years per chronological year) rather than an age estimate; values above 1.0 indicate accelerated aging (Belsky 2022)
YingCausAge	Causality-Enriched	Uses CpGs selected for causal rather than merely correlative relationships with aging (Ying 2022)

Analysis

The notebook works through six main steps.

1. Data Loading. Both GEO datasets are downloaded and cached automatically using DataLibrary. No manual file handling is required.

2. Dataset Description. Each dataset is summarised with sample counts, CpG site counts, age range and distribution, and metadata previews.

3. Clock Predictions. All eight clocks are run on both datasets using ModelGallery. Clocks that fail on a given dataset (e.g. due to missing CpGs) are skipped gracefully with an informative message.

4. Correlation Matrix. A Pearson correlation matrix is computed across all clock predictions for each dataset separately. This reveals which clocks share biological signal and which measure distinct dimensions of aging. Both the Bio-Learn built-in visualisation and a custom annotated heatmap are produced.

5. Age Deviation Heatmap. For each chronological age clock, the deviation (predicted age minus chronological age) is computed per sample and displayed as a heatmap with samples ordered by increasing chronological age. Red indicates epigenetic age acceleration; blue indicates deceleration.

6. Predicted vs. Chronological Age. Scatter plots are produced for each age clock on each dataset, annotated with Pearson R, RMSE, and a linear regression line. A summary performance table comparing all clocks side-by-side is also included. DunedinPACE is handled separately and visualised as a histogram of its rate distribution.

Repository Contents

Epic-Array-Aging-Clocks/
└── EPIC_Array_Aging_Clocks_BioLearn.ipynb   # Main analysis notebook

Requirements

The notebook is designed to run in Google Colab with no local setup. All dependencies are installed in the first cell.

biolearn
pandas
numpy
matplotlib
seaborn
scipy

Python 3.8 or higher is required. The notebook has been tested on Google Colab (May 2026).

Running the Notebook

Option 1 — Google Colab (recommended)

Download EPIC_Array_Aging_Clocks_BioLearn.ipynb from this repository.
Upload it to Google Colab.
Select Runtime > Run all.

The first run will take several minutes as the GEO datasets are downloaded and cached. Subsequent runs are faster.

Option 2 — Local Jupyter

pip install biolearn pandas numpy matplotlib seaborn scipy notebook
jupyter notebook EPIC_Array_Aging_Clocks_BioLearn.ipynb

Key Observations

Chronological clocks (Horvath, Hannum, Lin) tend to correlate strongly with one another, as they are all optimised to predict the same target. Mortality-oriented clocks (GrimAge, PhenoAge, Zhang_10) show more variable correlation, reflecting that they capture aspects of biological aging that go beyond calendar age. DunedinPACE is structurally distinct — its output is not comparable to predicted age in years and is visualised separately. Clock accuracy, measured by Pearson R and RMSE, varies across datasets, which is expected given differences in age range, tissue type, and cohort composition.

Reference

Ying, K., Paulson, S., Perez-Guevara, M., Emamifar, M., Martinez, M. C., Kwon, D., Poganik, J. R., Moqri, M., and Gladyshev, V. N. (2025). A unified framework for systematic curation and evaluation of aging biomarkers. Nature Aging. https://doi.org/10.1038/s43587-025-00987-y

License

This project is for academic and educational purposes. The underlying datasets are publicly available through NCBI GEO. Clock implementations are provided by the Bio-Learn library under its respective license.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
EPIC_Array_Aging_Clocks_BioLearn .ipynb		EPIC_Array_Aging_Clocks_BioLearn .ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EPIC Array Aging Clocks — Benchmarking Epigenetic Biomarkers of Aging

Background

Datasets

Aging Clocks

Analysis

Repository Contents

Requirements

Running the Notebook

Key Observations

Reference

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EPIC Array Aging Clocks — Benchmarking Epigenetic Biomarkers of Aging

Background

Datasets

Aging Clocks

Analysis

Repository Contents

Requirements

Running the Notebook

Key Observations

Reference

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages