AAAI Symposium Series 2024
π Paper: https://doi.org/10.1609/aaai.v4i1.31212
π» Code: https://github.com/Sajib-006/DeepAge
DeepAge is a deep learning framework for predicting biological age from DNA methylation profiles. The pipeline integrates large-scale methylation datasets, performs robust feature selection using correlation-based filtering, and trains a neural network model to estimate chronological age.
This repository provides a fully reproducible pipeline including data preprocessing, feature selection, model training, evaluation, and visualization.
Epigenetic clocks based on DNA methylation have become powerful tools for studying aging and disease. However, many existing models rely on limited CpG sets or shallow statistical models.
DeepAge introduces a scalable deep learning pipeline that:
- Integrates multiple public methylation datasets
- Performs dual-correlation feature selection
- Learns complex nonlinear relationships between CpG sites and age
- Provides interpretable evaluation metrics and visualizations
This repository contains everything needed to reproduce the modeling pipeline and experiments.
DeepAge
β
βββ README.md
βββ LICENSE
βββ requirements.txt
βββ pyproject.toml
βββ .gitignore
β
βββ data/
β βββ raw/ # Raw datasets (not included in repo)
β βββ processed/ # Processed training data
β βββ README.md
β
βββ src/
β βββ deepage/
β βββ __init__.py
β βββ preprocessing.py
β βββ feature_selection.py
β βββ model.py
β βββ train.py
β βββ evaluate.py
β βββ visualization.py
β βββ utils.py
β
βββ scripts/
β βββ prepare_data.py
β βββ select_features.py
β βββ train_model.py
β βββ evaluate_model.py
β βββ plot_results.py
β
βββ results/
β βββ models/
β βββ figures/
β βββ metrics/
β
βββ docs/
β βββ methodology.md
β
βββ notebooks_archive/
git clone https://github.com/<your-username>/DeepAge.git
cd DeepAgeUsing conda:
conda create -n deepage python=3.10
conda activate deepageOr using venv:
python -m venv deepage_env
source deepage_env/bin/activatepip install -r requirements.txtThis repository does not distribute raw methylation datasets due to licensing restrictions.
Users should download the datasets from their original sources (e.g., GEO) and place them in:
data/raw/
Expected data format:
sample_id,age,cpg1,cpg2,cpg3,...
S1,45,0.78,0.43,0.91,...
S2,62,0.21,0.88,0.12,...
Where:
- rows = samples
- columns = CpG methylation beta values
age= chronological age
After downloading the datasets, run the preprocessing pipeline to generate the processed dataset.
The full pipeline consists of five stages.
Clean datasets, harmonize metadata, and prepare training matrices.
python scripts/prepare_data.py \
--input_dir data/raw \
--output data/processed/combined_dataset.csvOutputs:
data/processed/combined_dataset.csv
Identify informative CpG sites using Pearson and Spearman correlations.
python scripts/select_features.py \
--input data/processed/combined_dataset.csv \
--output data/processed/selected_features.csvOutputs:
data/processed/selected_features.csv
Train the deep learning model.
python scripts/train_model.py \
--data data/processed/combined_dataset.csv \
--features data/processed/selected_features.csv \
--output results/models/deepage_model.ptOutputs:
results/models/deepage_model.pt
Evaluate predictions on the test set.
python scripts/evaluate_model.py \
--model results/models/deepage_model.pt \
--data data/processed/combined_dataset.csv \
--features data/processed/selected_features.csv \
--output results/metrics/evaluation.jsonEvaluation metrics include:
- MAE
- RMSE
- RΒ²
- Median Absolute Error
Generate plots of model predictions and embeddings.
python scripts/plot_results.py \
--model results/models/deepage_model.pt \
--data data/processed/combined_dataset.csv \
--features data/processed/selected_features.csv \
--output_dir results/figuresGenerated figures include:
- Predicted vs chronological age
- Residual error distribution
- Feature heatmaps
- t-SNE / UMAP embeddings
To reproduce the full pipeline:
python scripts/prepare_data.py
python scripts/select_features.py
python scripts/train_model.py
python scripts/evaluate_model.py
python scripts/plot_results.pyAfter running the full pipeline, the repository will produce:
results/
β
βββ models/
β βββ deepage_model.pt
β
βββ metrics/
β βββ evaluation.json
β
βββ figures/
βββ predicted_vs_age.png
βββ residuals.png
βββ tsne_embedding.png
βββ heatmap.png
Main dependencies:
- Python 3.9+
- PyTorch
- NumPy
- Pandas
- Scikit-learn
- Matplotlib
- Seaborn
All dependencies are listed in:
requirements.txt
To ensure reproducibility:
- All random seeds are fixed
- Dataset splits are deterministic
- Model hyperparameters are documented
- All scripts support command-line arguments
If you use this repository in your research, please cite:
@inproceedings{dip2024deepage,
title={DeepAge: Harnessing Deep Neural Network for Epigenetic Age Estimation From DNA Methylation Data of human blood samples},
author={Dip, Sajib Acharjee and Ma, Da and Zhang, Liqing},
booktitle={Proceedings of the AAAI Symposium Series},
volume={4},
number={1},
pages={267--274},
year={2024}
}
Contributions are welcome.
To contribute:
- Fork the repository
- Create a feature branch
- Submit a pull request
Please ensure all new code includes documentation and tests where applicable.
This project is released under the MIT License.
See LICENSE for details.
For questions or collaboration inquiries, please open an issue or contact:
Sajib Acharjee Dip
sajibacharjeedip@vt.edu