diff --git a/code/xqtl_protocol_demo.ipynb b/code/xqtl_protocol_demo.ipynb index 8b1bc7a6..3dbb2df6 100644 --- a/code/xqtl_protocol_demo.ipynb +++ b/code/xqtl_protocol_demo.ipynb @@ -2,140 +2,318 @@ "cells": [ { "cell_type": "markdown", - "id": "extensive-communication", - "metadata": { - "kernel": "SoS", - "tags": [] - }, - "source": [ - "# Illustration of xQTL protocol\n", - "\n", - "This notebook illustrates the computational protocols available from this repository for the detection and analysis of molecular QTLs (xQTLs). A minimal toy data-set consisting of 49 de-identified samples are used for the analysis." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "4a3d482b-982f-47b2-b3ef-fc6775a74e33", "metadata": {}, - "outputs": [], - "source": [] - }, - { - "cell_type": "markdown", - "id": "physical-postage", - "metadata": { - "kernel": "SoS", - "tags": [] - }, "source": [ - "## Analysis\n", + "# Getting Started\n", "\n", - "Please visit [the homepage of the protocol website](https://statfungen.github.io/xqtl-protocol/) for the general background on this resource, in particular the [How to use the resource](https://statfungen.github.io/xqtl-protocol/README.html#how-to-use-the-resource) section. To perform a complete analysis from molecular phenotype quantification to xQTL discovery, please conduct your analysis in the order listed below, each link contains a mini-protocol for a specific task. All commands documented in each mini-protocol should be executed in the command line environment.\n", + "**A reproducible pipeline for molecular QTL analysis \u2014 from raw genotypes and phenotypes through discovery, fine-mapping, and integration with GWAS.**\n", "\n", - "### Molecular Phenotype Quantification\n", + "This guide takes you from a clean machine to your first successful run in about an hour.\n", "\n", - "Molecular phenotypic data is required for the generation of QTLs. We support bulk RNA-Seq, methylation and splicing phenotypes in our pipeline. Multiple [reference data](https://statfungen.github.io/xqtl-protocol/code/reference_data/reference_data.html#) files are required before molecular phenotypes are quantified in samples. These include, but are not limited to, reference genomes, gene annotations, variant annotations, linkage disequilibirum data and topologically associated domains. [Quantification of gene expression](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/bulk_expression.html) is conducted with either RNA-SeQC for gene-level counts, or RSEM for transcript-level counts. [Quantification of alternative splicing events](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/splicing.html) is conducted with leafcutter2 to identify alternatively excised introns. [Quantification of DNA methylation](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/methylation.html) is done using SeSAMe. Each of these molecular phenotypes then undergo phenotype specific quality control and normalization.\n", + "::::{grid} 1 1 3 3\n", + ":gutter: 3\n", "\n", - "### Data Pre-Processing\n", + ":::{grid-item-card} \ud83d\udce6 Install\n", + ":link: #step-1-install-pixi\n", + "One installer for Python, R, JupyterLab, and bioinformatics tools via [pixi](https://pixi.sh/).\n", + ":::\n", "\n", - "[Preprocessing of genotype data](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/genotype_preprocessing.html) begins with the application of variant filters using bcftools. VCF files are then converted to plink format so that kinship analyses may be performed to identify unrelated individuals. Genetic principal components are then generated for unrelated samples and genotype files are formatted for later generation of quantitative trait loci. \n", + ":::{grid-item-card} \ud83e\uddec Run\n", + ":link: #step-5-run-your-first-workflow\n", + "Clone the repo, grab demo data, and launch a cis-QTL scan.\n", + ":::\n", "\n", - "[Preprocessing of phenotypic data](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/phenotype_preprocessing.html) begins with annotation of features, if required. Missing entries may then be imputed using a variety of methods included in the pipeline. Last, the phenotypes are formatted for later generation of quantitative trait loci. \n", + ":::{grid-item-card} \ud83d\ude80 Go Further\n", + ":link: #what-to-do-next\n", + "Fine-mapping, multivariate analysis, GWAS integration, HPC templates.\n", + ":::\n", "\n", - "[Preprocessing of covariates](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/covariate_preprocessing.html) begins with the merging of phenotypic data with previously generated genetic principal components. The merged data is then used to calculate hidden factors which will later be used as additional covariates. \n", + "::::\n", "\n", - "### QTL Association Analysis\n", "\n", - "[QTL association analysis](https://statfungen.github.io/xqtl-protocol/code/association_scan/qtl_association_testing.html) is conducted with TensorQTL. We include options for cis or trans analysis, with options to include interaction terms. [Hierarchical multiple testing](https://statfungen.github.io/xqtl-protocol/code/association_scan/qtl_association_postprocessing.html) may then be applied to the results to adjust p-values. \n", + "---\n", "\n", - "### Integrative Analysis\n", + "## Before You Start\n", "\n", - "We include methods to conduct [TWAS](https://statfungen.github.io/xqtl-protocol/code/pecotmr_integration/twas_ctwas.html) in our pipeline to identify genes associated with complex traits. \n", + "You'll need a Linux or macOS shell. Windows users: install [WSL2](https://learn.microsoft.com/windows/wsl/install) first.\n", "\n", - "Our pipeline includes multiple methods for fine-mapping of QTLs. [Univariate fine-mapping and TWAS with SuSiE](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/univariate_fine_mapping_twas_vignette.html) generates TWAS weights and credible sets using SuSiE. [Regression with summary statistics](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/summary_stats_finemapping_vignette.html) allows for the inclusion of summary statistics from GWAS in SuSiE finemapping. [Univariate fine-mapping of functional data](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/univariate_fine_mapping_fsusie_vignette.html) uses epigenomic data to fine-map with fSuSiE. \n", + "| Requirement | Minimum | Recommended |\n", + "|---|---|---|\n", + "| Disk space | 10 GB (minimal install) | 40 GB (full bioinformatics stack) |\n", + "| Memory | 16 GB | 50 GB+ on HPC for the installer |\n", + "| Network | GitHub, conda-forge, synapse.org | Same |\n", + "| Git | Any recent version | 2.30+ |\n", "\n", - "We also include method for [colocalization analysis](https://statfungen.github.io/xqtl-protocol/code/pecotmr_integration/SuSiE_enloc.html). This starts with the generation of prior probabilities followed by pairwise colocalization analysis of xQTL and GWAS fine-mapping results to identifies shared causal variants. We also include an alternative method, [colocboost](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/mnm_methods/colocboost.html), to identify shared genetic variants influencing multiple molecular traits. \n", + ":::{admonition} On HPC? Start on a compute node.\n", + ":class: tip\n", + "The installer is memory-hungry and login nodes will kill it. Grab an interactive session first:\n", "\n", - "We utilize an [excess of overlap](https://statfungen.github.io/xqtl-protocol/code/enrichment/eoo_enrichment.html) method to evaluate the enrichment of significant variants within specific genomic annotations. [Pathway enrichment analysis](https://statfungen.github.io/xqtl-protocol/code/enrichment/gsea.html) identifies biological pathways that are statistically overrepresented in a given gene set, giving information on potential biological functions, disease relevance, or regulatory mechanisms associated with the gene set. [Stratified LD Score Regression](https://statfungen.github.io/xqtl-protocol/code/enrichment/sldsc_enrichment.html) (S-LDSC) is used to quantify the contribution of different genomic functional annotations to the heritability of complex traits and assess their statistical significance. By integrating GWAS summary statistics with genome annotations, S-LDSC distinguishes true polygenic signals from confounding effects.\n", + "```bash\n", + "srun --mem=50G --pty bash # SLURM\n", + "bsub -Is -M 50000 -n 4 bash # LSF\n", + "```\n", + ":::\n", "\n", "\n", - "\n" - ] - }, - { - "cell_type": "markdown", - "id": "dietary-vector", - "metadata": { - "kernel": "SoS", - "tags": [] - }, - "source": [ - "## Data\n", + "---\n", + "\n", + "## Step 1 \u2014 Install pixi\n", + "\n", + "We manage every dependency \u2014 Python, R, JupyterLab, bioinformatics tools \u2014 with [pixi](https://pixi.sh/). One installer sets it all up.\n", + "\n", + "```bash\n", + "curl -fsSL https://raw.githubusercontent.com/StatFunGen/pixi-setup/refs/heads/main/pixi-setup.sh -o pixi-setup.sh\n", + "bash pixi-setup.sh\n", + "```\n", + "\n", + "The installer will prompt you for two choices:\n", + "\n", + ":::{dropdown} 1. Installation path\n", + ":open:\n", + "Where pixi stores environments and the package cache.\n", + "\n", + "| Setting | When to use |\n", + "|---|---|\n", + "| `$HOME/.pixi` (default) | Laptops and workstations with plenty of home-directory space |\n", + "| `/lab/$USER/.pixi` or scratch | HPC systems with strict home-directory quotas |\n", + "\n", + ":::\n", + "\n", + ":::{dropdown} 2. Installation type\n", + ":open:\n", + "Pick based on what you plan to do.\n", + "\n", + "| Type | Size | Files | Includes |\n", + "|---|---|---|---|\n", + "| **1. minimal** | ~5 GB | ~100k | CLI tools, Python data-science stack, JupyterLab, base R |\n", + "| **2. full** | ~35 GB | ~350k | Everything above, **plus** samtools, bcftools, plink2, GATK4, STAR, Seurat, Bioconductor |\n", + "\n", + "Choose **minimal** for xQTL runs with pre-processed inputs; choose **full** if you'll also do upstream QC, alignment, or single-cell work.\n", + ":::\n", + "\n", + "**Activate and verify:**\n", + "\n", + "```bash\n", + "source ~/.bashrc # or ~/.zshrc on macOS\n", + "pixi --version\n", + "```\n", + "\n", + "You should see a version number. If not, open a fresh terminal.\n", + "\n", + "\n", + "---\n", + "\n", + "## Step 2 \u2014 Add SoS\n", + "\n", + "The protocol's pipelines are written as [SoS (Script of Scripts)](https://vatlab.github.io/sos-docs/) workflows. Install the SoS suite into pixi's Python environment:\n", + "\n", + "```bash\n", + "pixi global install --environment python -c conda-forge \\\n", + " sos sos-pbs sos-notebook jupyterlab-sos \\\n", + " sos-bash sos-python sos-r\n", + "\n", + "pixi run -e python python -m sos_notebook.install\n", + "```\n", + "\n", + "**Verify:**\n", + "\n", + "```bash\n", + "sos --version\n", + "jupyter kernelspec list # should include 'sos'\n", + "```\n", + "\n", + "\n", + "---\n", + "\n", + "## Step 3 \u2014 Clone the Protocol\n", "\n", - "For record keeping: preparation of the demo dataset is documented [on this page](https://github.com/cumc/fungen-xqtl-analysis/tree/main/analysis/Wang_Columbia/ROSMAP/MWE) --- this is a private repository accessible to FunGen-xQTL analysis working group members.\n", + "```bash\n", + "git clone https://github.com/StatFunGen/xqtl-protocol.git\n", + "cd xqtl-protocol\n", + "```\n", + "\n", + ":::{admonition} What's in the repo?\n", + ":class: note\n", + "\n", + "| Folder | Contents |\n", + "|---|---|\n", + "| `pipeline/` | The SoS workflows you'll run |\n", + "| `code/` | Notebook documentation (this page lives here) |\n", + "| `data/` | Small example inputs and configuration templates |\n", + "| `website/` | JupyterBook sources for the docs site |\n", + ":::\n", + "\n", + "\n", + "---\n", + "\n", + "## Step 4 \u2014 Download the Demo Data\n", + "\n", + "The demo dataset lives on [Synapse](https://www.synapse.org/#!Synapse:syn36416559). Create a free account first, then:\n", + "\n", + "```bash\n", + "pixi global install -c conda-forge --environment python synapseclient\n", + "synapse login -p\n", + "synapse get -r syn36416559 --downloadLocation data/example/\n", + "```\n", + "\n", + "\n", + "---\n", + "\n", + "## Step 5 \u2014 Run Your First Workflow\n", + "\n", + "Confirm SoS can see the pipelines:\n", + "\n", + "```bash\n", + "sos run pipeline/1_xqtl_association.ipynb -h\n", + "```\n", + "\n", + "You should see a list of workflow options. Now run a minimal cis-QTL scan:\n", + "\n", + "```bash\n", + "sos run pipeline/TensorQTL.ipynb cis \\\n", + " --genotype-file data/example/genotype.bed \\\n", + " --phenotype-file data/example/phenotype.bed.gz \\\n", + " --covariate-file data/example/covariates.tsv \\\n", + " --cwd output/demo_tensorqtl\n", + "```\n", + "\n", + "Results land in `output/demo_tensorqtl/`.\n", + "\n", + ":::{tip}\n", + "Every pipeline supports `-h` and prints the shell commands it runs under the hood \u2014 a great way to learn what's happening and debug failures.\n", + ":::\n", + "\n", + "\n", + "---\n", + "\n", + "## What to Do Next\n", "\n", - "For protocols listed in this page, downloaded required input data in [Synapse](https://www.synapse.org/#!Synapse:syn36416601). \n", - "* To be able downloading the data, first create user account on [Synapse Login](https://www.synapse.org/). Username and password will be required when downloading\n", - "* Downloading required installing of Synapse API Clients, type `pip install synapseclient` in terminal or Command Prompt to install the Python package. Details list [on this page](https://help.synapse.org/docs/Installing-Synapse-API-Clients.1985249668.html).\n", - "* Each folder in different level has unique Synapse ID, which allowing you to download only some folders or files within the entire folder.\n", + "::::{grid} 1 2 2 2\n", + ":gutter: 3\n", "\n", - "To download the test data for section \"Bulk RNA-seq molecular phenotype quantification\", please use the following Python codes,\n", + ":::{grid-item-card} \ud83d\udd2c Preprocess your data\n", + "`1_phenotype_preprocessing.ipynb`\n", + "`2_genotype_preprocessing.ipynb`\n", + "`4_covariates_preprocessing.ipynb`\n", + ":::\n", "\n", + ":::{grid-item-card} \ud83e\udded Discover QTLs\n", + "`TensorQTL.ipynb`\n", + "`1_xqtl_association.ipynb`\n", + "`APEX.ipynb`\n", + ":::\n", + "\n", + ":::{grid-item-card} \ud83c\udfaf Fine-map\n", + "`SuSiE.ipynb`\n", + "`mvSuSiE.ipynb`\n", + "`fSuSiE.ipynb`\n", + ":::\n", + "\n", + ":::{grid-item-card} \ud83d\udd17 Integrate with GWAS\n", + "`coloc.ipynb`\n", + "`cTWAS.ipynb`\n", + "`GWAS_integration.ipynb`\n", + ":::\n", + "\n", + "::::\n", + "\n", + "Full documentation: [statfungen.github.io/xqtl-protocol](https://statfungen.github.io/xqtl-protocol/).\n", + "\n", + "\n", + "---\n", + "\n", + "## Troubleshooting\n", + "\n", + ":::{dropdown} `pixi: command not found` after install\n", + "Open a new terminal, or re-source your shell rc file:\n", + "```bash\n", + "source ~/.bashrc # Linux / HPC\n", + "source ~/.zshrc # macOS\n", "```\n", - "import synapseclient \n", - "import synapseutils \n", - "syn = synapseclient.Synapse()\n", - "syn.login(\"your username on synapse.org\",\"your password on synapse.org\")\n", - "files = synapseutils.syncFromSynapse(syn, 'syn53174239', path=\"./\")\n", + ":::\n", + "\n", + ":::{dropdown} Installer killed on HPC\n", + "You're running on a login node. Request a compute node with at least 50 GB of memory and re-run the installer:\n", + "```bash\n", + "srun --mem=50G --pty bash\n", + "bash pixi-setup.sh\n", "```\n", + ":::\n", "\n", - "To download the test data for section \"xQTL association analysis\", please use the following Python codes, \n", + ":::{dropdown} `sos: command not found`\n", + "Step 2 didn't complete. Re-run the `pixi global install` command and make sure `jupyter kernelspec list` shows the `sos` kernel.\n", + ":::\n", "\n", + ":::{dropdown} `ModuleNotFoundError` during a pipeline\n", + "Install the missing package into pixi's python environment:\n", + "```bash\n", + "pixi global install -c conda-forge --environment python \n", "```\n", - "import synapseclient \n", - "import synapseutils \n", - "syn = synapseclient.Synapse()\n", - "syn.login(\"your username on synapse.org\",\"your password on synapse.org\")\n", - "files = synapseutils.syncFromSynapse(syn, 'syn52369482', path=\"./\")\n", - "```" + ":::\n", + "\n", + ":::{dropdown} R package conflicts or install failures\n", + "Prefer conda-forge R packages over `install.packages()`:\n", + "```bash\n", + "pixi global install --environment r-base r-\n", + "```\n", + "Mixing CRAN builds with conda R leads to ABI mismatches \u2014 avoid it.\n", + ":::\n", + "\n", + ":::{dropdown} Still stuck?\n", + "[Open an issue](https://github.com/StatFunGen/xqtl-protocol/issues) with the command you ran and the full error output.\n", + ":::\n" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, { "cell_type": "markdown", - "id": "complete-extent", "metadata": {}, "source": [ - "## Software environment: use Singularity containers\n", + "## Analysis Overview\n", "\n", - "Analysis documented on this website are best performed using containers we provide either through `singularity` (recommended) or `docker`, via the `--container` option pointing to a container image file. For example, `--container oras://ghcr.io/statfungen/tensorqtl_apptainer:latest` uses a singularity image to perform analysis for QTL association mapping via software `TensorQTL`. If you drop the `--container` option then you will rely on software installed on your computer to perform the analysis. \n", + "The protocol is modular. Each numbered pipeline is a self-contained SoS notebook that can run independently or chained together.\n", "\n", - "#### Troubleshooting\n", + "::::{grid} 1 2 2 2\n", + ":gutter: 3\n", "\n", - "If you run into errors relating to R libraries while including the `--container` option then you may need to unload your R packages locally before running the sos commands. For example, this error:\n", + ":::{grid-item-card} 1. Preprocess\n", + "`1_phenotype_preprocessing.ipynb` \u2014 QC, normalization\n", + "`2_genotype_preprocessing.ipynb` \u2014 variant QC, imputation\n", + "`4_covariates_preprocessing.ipynb` \u2014 PEER / hidden factors\n", + ":::\n", "\n", - "```\n", - "Error in dyn.load(file, DLLpath = DLLPath, ...):\n", - "unable to load shared object '$PATH/R/x86_64-pc-linux-gnu-library/4.2/stringi/libs/stringi.so':\n", - "libicui18n.so.63: cannot open shared object file: No such file or directory\n", - "```\n", + ":::{grid-item-card} 2. Discover\n", + "`TensorQTL.ipynb` \u2014 cis/trans scans\n", + "`APEX.ipynb` \u2014 interaction QTLs\n", + "`1_xqtl_association.ipynb` \u2014 end-to-end wrapper\n", + ":::\n", "\n", - "May be fixed by running this before the sos commands are run:\n", + ":::{grid-item-card} 3. Fine-map\n", + "`SuSiE.ipynb` \u2014 single-context credible sets\n", + "`mvSuSiE.ipynb` \u2014 multi-context\n", + "`fSuSiE.ipynb` \u2014 functional annotations\n", + ":::\n", "\n", - "```\n", - "export R_LIBS=\"\"\n", - "export R_LIBS_USER=\"\"\n", - "```\n", + ":::{grid-item-card} 4. Integrate\n", + "`coloc.ipynb` \u2014 colocalization with GWAS\n", + "`cTWAS.ipynb` \u2014 causal TWAS\n", + "`GWAS_integration.ipynb` \u2014 joint reporting\n", + ":::\n", "\n", - "## Analyses on High Performance Computing clusters\n", + "::::\n", "\n", - "The protocol example shown above performs analysis on a desktop workstation, as a demonstration. Typically the analyses should be performed on HPC cluster environments. This can be achieved via [SoS Remote Tasks](https://vatlab.github.io/sos-docs/doc/user_guide/task_statement.html) on [configured host computers](https://vatlab.github.io/sos-docs/doc/user_guide/host_setup.html). We provide this [toy example for running SoS pipeline on a typical HPC cluster environment](https://github.com/statfungen/xqtl-protocol/blob/main/code/misc/Job_Example.ipynb). First time users are encouraged to try it out in order to help setting up the computational environment necessary to run the analysis in this protocol." + "All pipelines share a common config layout, so once you know one you can read the rest.\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "17f0cb95-9663-4c3f-a911-caa5f2d130a4", "metadata": {}, "outputs": [], "source": [] @@ -143,15 +321,17 @@ ], "metadata": { "kernelspec": { - "display_name": "Bash", - "language": "bash", - "name": "bash" + "display_name": "SoS", + "language": "sos", + "name": "sos" }, "language_info": { - "codemirror_mode": "shell", - "file_extension": ".sh", - "mimetype": "text/x-sh", - "name": "bash" + "codemirror_mode": "sos", + "file_extension": ".sos", + "mimetype": "text/x-sos", + "name": "sos", + "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", + "pygments_lexer": "sos" }, "sos": { "kernels": [ @@ -162,7 +342,7 @@ "" ] ], - "version": "0.22.6" + "version": "0.24.4" } }, "nbformat": 4,