This repository contains a Jupyter notebook that benchmarks eight established epigenetic aging clocks across two publicly available DNA methylation datasets using the Bio-Learn open-source library. The analysis was completed as part of a course assignment on biomarkers of aging.
DNA methylation patterns change predictably with age, and this property has been exploited to build computational models — commonly called "epigenetic clocks" — that estimate biological age from methylation array data. Unlike chronological age, epigenetic age captures meaningful biological variation: individuals who appear epigenetically older than their calendar age tend to have worse health outcomes, higher mortality risk, and faster functional decline.
The Bio-Learn library (Ying et al., 2025, Nature Aging) provides a unified framework for loading public datasets and running a standardised collection of aging clocks in only a few lines of code, making cross-dataset benchmarking straightforward and reproducible.
Two datasets were selected from the Gene Expression Omnibus (GEO), both profiled on the Illumina HumanMethylation450k array and included in the Bio-Learn data library.
| GEO Accession | Description | Samples | Array |
|---|---|---|---|
| GSE41169 | Dutch Schizophrenia Case-Control Cohort | 95 | Illumina 450k |
| GSE64495 | Developmental Disorder Study | 113 | Illumina 450k |
These two cohorts were chosen because they span a reasonable chronological age range, have curated age metadata within Bio-Learn, and represent distinct biological contexts — one psychiatric, one developmental — allowing us to examine how clock performance generalises across settings.
Eight aging clocks were selected to cover the major categories represented in the Bio-Learn model gallery.
| Clock | Category | Description |
|---|---|---|
| Horvathv1 | Chronological | Multi-tissue clock trained on 353 CpGs; the original and most widely used epigenetic clock (Horvath 2013) |
| Hannum | Chronological | Blood-specific clock trained on 71 CpGs (Hannum 2013) |
| Lin | Chronological | Blood-specific clock using 99 CpGs (Lin 2016) |
| PhenoAge | Mortality / Healthspan | Trained to predict "phenotypic age" from clinical markers; captures morbidity beyond calendar age (Levine 2018) |
| GrimAge | Mortality / Healthspan | Trained on plasma protein surrogates and smoking pace; among the strongest predictors of lifespan (Lu 2019) |
| Zhang_10 | Mortality | Trained directly on mortality outcomes using 10 CpGs (Zhang 2017) |
| DunedinPACE | Rate of Aging | Unlike the others, outputs a rate (biological years per chronological year) rather than an age estimate; values above 1.0 indicate accelerated aging (Belsky 2022) |
| YingCausAge | Causality-Enriched | Uses CpGs selected for causal rather than merely correlative relationships with aging (Ying 2022) |
The notebook works through six main steps.
1. Data Loading. Both GEO datasets are downloaded and cached automatically using DataLibrary. No manual file handling is required.
2. Dataset Description. Each dataset is summarised with sample counts, CpG site counts, age range and distribution, and metadata previews.
3. Clock Predictions. All eight clocks are run on both datasets using ModelGallery. Clocks that fail on a given dataset (e.g. due to missing CpGs) are skipped gracefully with an informative message.
4. Correlation Matrix. A Pearson correlation matrix is computed across all clock predictions for each dataset separately. This reveals which clocks share biological signal and which measure distinct dimensions of aging. Both the Bio-Learn built-in visualisation and a custom annotated heatmap are produced.
5. Age Deviation Heatmap. For each chronological age clock, the deviation (predicted age minus chronological age) is computed per sample and displayed as a heatmap with samples ordered by increasing chronological age. Red indicates epigenetic age acceleration; blue indicates deceleration.
6. Predicted vs. Chronological Age. Scatter plots are produced for each age clock on each dataset, annotated with Pearson R, RMSE, and a linear regression line. A summary performance table comparing all clocks side-by-side is also included. DunedinPACE is handled separately and visualised as a histogram of its rate distribution.
Epic-Array-Aging-Clocks/
└── EPIC_Array_Aging_Clocks_BioLearn.ipynb # Main analysis notebook
The notebook is designed to run in Google Colab with no local setup. All dependencies are installed in the first cell.
biolearn
pandas
numpy
matplotlib
seaborn
scipy
Python 3.8 or higher is required. The notebook has been tested on Google Colab (May 2026).
Option 1 — Google Colab (recommended)
- Download
EPIC_Array_Aging_Clocks_BioLearn.ipynbfrom this repository. - Upload it to Google Colab.
- Select
Runtime > Run all.
The first run will take several minutes as the GEO datasets are downloaded and cached. Subsequent runs are faster.
Option 2 — Local Jupyter
pip install biolearn pandas numpy matplotlib seaborn scipy notebook
jupyter notebook EPIC_Array_Aging_Clocks_BioLearn.ipynbChronological clocks (Horvath, Hannum, Lin) tend to correlate strongly with one another, as they are all optimised to predict the same target. Mortality-oriented clocks (GrimAge, PhenoAge, Zhang_10) show more variable correlation, reflecting that they capture aspects of biological aging that go beyond calendar age. DunedinPACE is structurally distinct — its output is not comparable to predicted age in years and is visualised separately. Clock accuracy, measured by Pearson R and RMSE, varies across datasets, which is expected given differences in age range, tissue type, and cohort composition.
Ying, K., Paulson, S., Perez-Guevara, M., Emamifar, M., Martinez, M. C., Kwon, D., Poganik, J. R., Moqri, M., and Gladyshev, V. N. (2025). A unified framework for systematic curation and evaluation of aging biomarkers. Nature Aging. https://doi.org/10.1038/s43587-025-00987-y
This project is for academic and educational purposes. The underlying datasets are publicly available through NCBI GEO. Clock implementations are provided by the Bio-Learn library under its respective license.