This repository contains a fully reproducible pipeline for analyzing DNA methylation data from GSE140686. The pipeline handles data downloading (IDAT files), preprocessing, normalization, and dimensionality reduction (UMAP/t-SNE).
The analysis is implemented in R and containerized with Docker to ensure reproducibility of results across different computing environments.
.
├── Dockerfile # Instructions to build the reproducible container
├── renv.lock # Exact package versions used in the analysis
├── README.md # This file
├── code/ # Analysis scripts
│ ├── run_pipeline.R # Master wrapper script
│ ├── download_GEO_IDAT.R # Data acquisition
│ ├── process_IDAT.R # Normalization & filtering
│ ├── prepare_metadata.R # Metadata cleaning
│ └── make_plots.R # Figure generation
├── data/ # (Auto-generated) Input and processed data
└── plots/ # (Auto-generated) Final figure PDFs
Using Docker is the most reliable way to reproduce the analysis, as it encapsulates the operating system, system libraries, and R environment.
Prerequisites
- Docker Desktop (or Docker Engine on Linux)
- Getting the Docker Image
Pull the pre-built image from Docker hub:
docker pull bemert/methylation_umap_pipeline:latestIf you want to build the Docker image yourself, see Building the Docker Image
Once you have the image (either built locally or pulled from a registry), use the following command to run the analysis.
docker run --rm \
-v $(pwd)/data:/home/methylation_umap_example/data \
-v $(pwd)/plots:/home/methylation_umap_example/plots \
methylation_umap_pipeline Rscript code/run_pipeline.R
What this command does:
--rm: Automatically removes the container after the script finishes.-v $(pwd)/data:...: Maps your local data folder to the containermethylation_umap_pipeline: The name of the Docker image.Rscript code/run_pipeline.R: The command to execute the pipeline.
If you prefer to run the analysis directly on your machine, you can use renv to restore the exact R package versions used in this project.
Prerequisites:
- R (version 4.3 or higher)
- Git
- Note for Linux users: You may need to install system libraries (e.g., libcurl4-openssl-dev, libxml2-dev) before installing R packages. See the Dockerfile for a reference list.
-
Clone the repository:
git clone [https://github.com/BenEmert/methylation_umap_example.git](https://github.com/BenEmert/methylation_umap_example.git) cd methylation_umap_example -
Restore the R environment: Open R in the project root directory and run:
if (!require("renv")) install.packages("renv") renv::restore()This will automatically install all required packages specified in renv.lock.
-
Run the analysis: You can execute the entire pipeline using the wrapper script from your terminal:
Rscript code/run_pipeline.RAlternatively, you can source
code/run_pipeline.Rfrom within an RStudio session.
Upon successful execution, the pipeline will populate the following directories:
- data/raw/: Contains downloaded IDAT files and metadata.
- data/analyzed/: Contains intermediate RDS files (normalized beta values, M-values) and the clean metadata CSV.
- plots/: Contains the final visualizations, including:
- minfi_QC_plot.png
- dx_legend.pdf
- umap_GPL13534_M_random_state_comparison.pdf
- umap_GPL13534_M_n_neighbors_comparison.pdf
- tsne_GPL13534_M_perplexity_comparison.pdf
This repository includes a Shiny application to interactively explore the UMAP and t-SNE embeddings, allowing you to tune hyperparameters (e.g., neighbors, perplexity) and filter samples by diagnosis on your local computer.
Prerequisites
- The methylation analysis pipeline must be run first so that processed data exists in
data/analyzed/. - If running locally, ensure you have restored the R environment with
renv::restore().
How to Run You can launch the app directly from an R console at the project root:
shiny::runApp("shinyApp")Alternatively, if using RStudio:
- Open
shinyApp/app.R - Click the Run App button in the source editor.
Features
- Dual View: Compare two different embedding settings (e.g., UMAP vs t-SNE) side-by-side.
- Dynamic Tuning: Adjust
n_neighbors,min_dist, andperplexityon the fly. - Filtering: Select specific diagnoses to highlight or subset.
- Download: Export the coordinate data for your custom views.
If you prefer not to install R or run the pipeline locally, you can explore the data using our Python-based dashboard on Google Colab. This notebook downloads the necessary data from the cloud and launches an interactive interface similar to the Shiny app.
Features:
- Zero Setup: Runs entirely in your browser.
- Interactive: Uses
PanelandPlotlyto replicate the UMAP/t-SNE tuning. - Self-Contained: Automatically fetches processed data.
If you want to build the Docker image yourself (e.g., to verify the build process or modify the environment), follow these steps.
- Navigate to the project root:
cd /path/to/YOUR_REPO_NAME
- Build the image: Run the following command. This may take 20-40 minutes the first time as it compiles the R packages from source for Linux.
docker build -t methylation_umap_pipeline .
- Verify the build: You can verify the image was created by running:
docker images
You should see methylation_umap_pipeline in the list. You can now proceed to step 2 ("Running the Pipeline with Docker").