From 682c698267a28c6de5a78d02d3eab1c48d896c78 Mon Sep 17 00:00:00 2001 From: Jenny Date: Wed, 15 Apr 2026 19:22:21 -0400 Subject: [PATCH 1/5] Rewrite getting-started: minimalist website-style layout, pixi + SoS setup --- code/xqtl_protocol_demo.ipynb | 135 ++++++++-------------------------- 1 file changed, 29 insertions(+), 106 deletions(-) diff --git a/code/xqtl_protocol_demo.ipynb b/code/xqtl_protocol_demo.ipynb index 3dbb2df6..db810a38 100644 --- a/code/xqtl_protocol_demo.ipynb +++ b/code/xqtl_protocol_demo.ipynb @@ -10,27 +10,6 @@ "\n", "This guide takes you from a clean machine to your first successful run in about an hour.\n", "\n", - "::::{grid} 1 1 3 3\n", - ":gutter: 3\n", - "\n", - ":::{grid-item-card} \ud83d\udce6 Install\n", - ":link: #step-1-install-pixi\n", - "One installer for Python, R, JupyterLab, and bioinformatics tools via [pixi](https://pixi.sh/).\n", - ":::\n", - "\n", - ":::{grid-item-card} \ud83e\uddec Run\n", - ":link: #step-5-run-your-first-workflow\n", - "Clone the repo, grab demo data, and launch a cis-QTL scan.\n", - ":::\n", - "\n", - ":::{grid-item-card} \ud83d\ude80 Go Further\n", - ":link: #what-to-do-next\n", - "Fine-mapping, multivariate analysis, GWAS integration, HPC templates.\n", - ":::\n", - "\n", - "::::\n", - "\n", - "\n", "---\n", "\n", "## Before You Start\n", @@ -44,9 +23,8 @@ "| Network | GitHub, conda-forge, synapse.org | Same |\n", "| Git | Any recent version | 2.30+ |\n", "\n", - ":::{admonition} On HPC? Start on a compute node.\n", - ":class: tip\n", - "The installer is memory-hungry and login nodes will kill it. Grab an interactive session first:\n", + ":::{tip}\n", + "**On HPC, start on a compute node.** The installer is memory-hungry and login nodes will kill it. Grab an interactive session first:\n", "\n", "```bash\n", "srun --mem=50G --pty bash # SLURM\n", @@ -54,10 +32,9 @@ "```\n", ":::\n", "\n", - "\n", "---\n", "\n", - "## Step 1 \u2014 Install pixi\n", + "## Step 1. Install pixi\n", "\n", "We manage every dependency \u2014 Python, R, JupyterLab, bioinformatics tools \u2014 with [pixi](https://pixi.sh/). One installer sets it all up.\n", "\n", @@ -66,22 +43,16 @@ "bash pixi-setup.sh\n", "```\n", "\n", - "The installer will prompt you for two choices:\n", + "The installer will prompt you for two things:\n", "\n", - ":::{dropdown} 1. Installation path\n", - ":open:\n", - "Where pixi stores environments and the package cache.\n", + "**1. Installation path** \u2014 where pixi stores environments and the package cache.\n", "\n", "| Setting | When to use |\n", "|---|---|\n", "| `$HOME/.pixi` (default) | Laptops and workstations with plenty of home-directory space |\n", "| `/lab/$USER/.pixi` or scratch | HPC systems with strict home-directory quotas |\n", "\n", - ":::\n", - "\n", - ":::{dropdown} 2. Installation type\n", - ":open:\n", - "Pick based on what you plan to do.\n", + "**2. Installation type** \u2014 pick based on what you plan to do.\n", "\n", "| Type | Size | Files | Includes |\n", "|---|---|---|---|\n", @@ -89,7 +60,6 @@ "| **2. full** | ~35 GB | ~350k | Everything above, **plus** samtools, bcftools, plink2, GATK4, STAR, Seurat, Bioconductor |\n", "\n", "Choose **minimal** for xQTL runs with pre-processed inputs; choose **full** if you'll also do upstream QC, alignment, or single-cell work.\n", - ":::\n", "\n", "**Activate and verify:**\n", "\n", @@ -100,10 +70,9 @@ "\n", "You should see a version number. If not, open a fresh terminal.\n", "\n", - "\n", "---\n", "\n", - "## Step 2 \u2014 Add SoS\n", + "## Step 2. Add SoS\n", "\n", "The protocol's pipelines are written as [SoS (Script of Scripts)](https://vatlab.github.io/sos-docs/) workflows. Install the SoS suite into pixi's Python environment:\n", "\n", @@ -122,18 +91,17 @@ "jupyter kernelspec list # should include 'sos'\n", "```\n", "\n", - "\n", "---\n", "\n", - "## Step 3 \u2014 Clone the Protocol\n", + "## Step 3. Clone the Protocol\n", "\n", "```bash\n", "git clone https://github.com/StatFunGen/xqtl-protocol.git\n", "cd xqtl-protocol\n", "```\n", "\n", - ":::{admonition} What's in the repo?\n", - ":class: note\n", + ":::{note}\n", + "**What's in the repo**\n", "\n", "| Folder | Contents |\n", "|---|---|\n", @@ -143,10 +111,9 @@ "| `website/` | JupyterBook sources for the docs site |\n", ":::\n", "\n", - "\n", "---\n", "\n", - "## Step 4 \u2014 Download the Demo Data\n", + "## Step 4. Download the Demo Data\n", "\n", "The demo dataset lives on [Synapse](https://www.synapse.org/#!Synapse:syn36416559). Create a free account first, then:\n", "\n", @@ -156,10 +123,9 @@ "synapse get -r syn36416559 --downloadLocation data/example/\n", "```\n", "\n", - "\n", "---\n", "\n", - "## Step 5 \u2014 Run Your First Workflow\n", + "## Step 5. Run Your First Workflow\n", "\n", "Confirm SoS can see the pipelines:\n", "\n", @@ -180,46 +146,25 @@ "Results land in `output/demo_tensorqtl/`.\n", "\n", ":::{tip}\n", - "Every pipeline supports `-h` and prints the shell commands it runs under the hood \u2014 a great way to learn what's happening and debug failures.\n", + "Every pipeline supports `-h` and prints the shell commands it runs under the hood \u2014 a great way to learn what's happening and to debug failures.\n", ":::\n", "\n", - "\n", "---\n", "\n", "## What to Do Next\n", "\n", - "::::{grid} 1 2 2 2\n", - ":gutter: 3\n", + "Pick a path based on what you're analyzing:\n", "\n", - ":::{grid-item-card} \ud83d\udd2c Preprocess your data\n", - "`1_phenotype_preprocessing.ipynb`\n", - "`2_genotype_preprocessing.ipynb`\n", - "`4_covariates_preprocessing.ipynb`\n", - ":::\n", - "\n", - ":::{grid-item-card} \ud83e\udded Discover QTLs\n", - "`TensorQTL.ipynb`\n", - "`1_xqtl_association.ipynb`\n", - "`APEX.ipynb`\n", - ":::\n", - "\n", - ":::{grid-item-card} \ud83c\udfaf Fine-map\n", - "`SuSiE.ipynb`\n", - "`mvSuSiE.ipynb`\n", - "`fSuSiE.ipynb`\n", - ":::\n", - "\n", - ":::{grid-item-card} \ud83d\udd17 Integrate with GWAS\n", - "`coloc.ipynb`\n", - "`cTWAS.ipynb`\n", - "`GWAS_integration.ipynb`\n", - ":::\n", - "\n", - "::::\n", + "| Goal | Pipelines |\n", + "|---|---|\n", + "| Preprocess your data | `1_phenotype_preprocessing.ipynb`, `2_genotype_preprocessing.ipynb`, `4_covariates_preprocessing.ipynb` |\n", + "| Discover QTLs | `TensorQTL.ipynb`, `1_xqtl_association.ipynb`, `APEX.ipynb` |\n", + "| Fine-map | `SuSiE.ipynb`, `mvSuSiE.ipynb`, `fSuSiE.ipynb` |\n", + "| Integrate with GWAS | `coloc.ipynb`, `cTWAS.ipynb`, `GWAS_integration.ipynb` |\n", + "| Run on HPC | `Job_Example.ipynb` (SLURM / LSF / SGE / PBS template) |\n", "\n", "Full documentation: [statfungen.github.io/xqtl-protocol](https://statfungen.github.io/xqtl-protocol/).\n", "\n", - "\n", "---\n", "\n", "## Troubleshooting\n", @@ -241,7 +186,7 @@ ":::\n", "\n", ":::{dropdown} `sos: command not found`\n", - "Step 2 didn't complete. Re-run the `pixi global install` command and make sure `jupyter kernelspec list` shows the `sos` kernel.\n", + "Step 2 didn't complete. Re-run the `pixi global install` command and confirm with `jupyter kernelspec list` that the `sos` kernel is registered.\n", ":::\n", "\n", ":::{dropdown} `ModuleNotFoundError` during a pipeline\n", @@ -277,36 +222,14 @@ "source": [ "## Analysis Overview\n", "\n", - "The protocol is modular. Each numbered pipeline is a self-contained SoS notebook that can run independently or chained together.\n", - "\n", - "::::{grid} 1 2 2 2\n", - ":gutter: 3\n", + "The protocol is modular. Each pipeline is a self-contained SoS notebook that can run independently or chained together.\n", "\n", - ":::{grid-item-card} 1. Preprocess\n", - "`1_phenotype_preprocessing.ipynb` \u2014 QC, normalization\n", - "`2_genotype_preprocessing.ipynb` \u2014 variant QC, imputation\n", - "`4_covariates_preprocessing.ipynb` \u2014 PEER / hidden factors\n", - ":::\n", - "\n", - ":::{grid-item-card} 2. Discover\n", - "`TensorQTL.ipynb` \u2014 cis/trans scans\n", - "`APEX.ipynb` \u2014 interaction QTLs\n", - "`1_xqtl_association.ipynb` \u2014 end-to-end wrapper\n", - ":::\n", - "\n", - ":::{grid-item-card} 3. Fine-map\n", - "`SuSiE.ipynb` \u2014 single-context credible sets\n", - "`mvSuSiE.ipynb` \u2014 multi-context\n", - "`fSuSiE.ipynb` \u2014 functional annotations\n", - ":::\n", - "\n", - ":::{grid-item-card} 4. Integrate\n", - "`coloc.ipynb` \u2014 colocalization with GWAS\n", - "`cTWAS.ipynb` \u2014 causal TWAS\n", - "`GWAS_integration.ipynb` \u2014 joint reporting\n", - ":::\n", - "\n", - "::::\n", + "| Stage | Pipelines | What happens |\n", + "|---|---|---|\n", + "| **1. Preprocess** | `1_phenotype_preprocessing.ipynb`, `2_genotype_preprocessing.ipynb`, `4_covariates_preprocessing.ipynb` | QC, normalization, imputation, PEER / hidden covariates |\n", + "| **2. Discover** | `TensorQTL.ipynb`, `APEX.ipynb`, `1_xqtl_association.ipynb` | Cis/trans scans, interaction QTLs, end-to-end wrapper |\n", + "| **3. Fine-map** | `SuSiE.ipynb`, `mvSuSiE.ipynb`, `fSuSiE.ipynb` | Credible sets, multi-context, functional annotations |\n", + "| **4. Integrate** | `coloc.ipynb`, `cTWAS.ipynb`, `GWAS_integration.ipynb` | Colocalization, causal TWAS, joint reporting with GWAS |\n", "\n", "All pipelines share a common config layout, so once you know one you can read the rest.\n" ] From 4e7d89fb3004a12faa5b567d3f389566a9b737e4 Mon Sep 17 00:00:00 2001 From: Jenny Date: Wed, 15 Apr 2026 19:34:46 -0400 Subject: [PATCH 2/5] Rewrite getting-started: minimalist website-style layout, pixi + SoS setup --- code/xqtl_protocol_demo.ipynb | 204 +++++++++++++++++++--------------- 1 file changed, 115 insertions(+), 89 deletions(-) diff --git a/code/xqtl_protocol_demo.ipynb b/code/xqtl_protocol_demo.ipynb index db810a38..25ca206e 100644 --- a/code/xqtl_protocol_demo.ipynb +++ b/code/xqtl_protocol_demo.ipynb @@ -6,9 +6,11 @@ "source": [ "# Getting Started\n", "\n", - "**A reproducible pipeline for molecular QTL analysis \u2014 from raw genotypes and phenotypes through discovery, fine-mapping, and integration with GWAS.**\n", + "This notebook is the on-ramp to the xQTL protocol \u2014 a reproducible pipeline for molecular QTL analysis from raw genotypes and phenotypes through discovery, fine-mapping, and integration with GWAS.\n", "\n", - "This guide takes you from a clean machine to your first successful run in about an hour.\n", + "A minimal toy dataset of **49 de-identified samples** is used throughout the examples on this site so you can try every pipeline end-to-end before running on real data.\n", + "\n", + "This page walks you from a clean machine to your first successful run in about an hour. If you already have pixi and SoS installed, jump to [Analysis](#analysis).\n", "\n", "---\n", "\n", @@ -68,8 +70,6 @@ "pixi --version\n", "```\n", "\n", - "You should see a version number. If not, open a fresh terminal.\n", - "\n", "---\n", "\n", "## Step 2. Add SoS\n", @@ -109,129 +109,155 @@ "| `code/` | Notebook documentation (this page lives here) |\n", "| `data/` | Small example inputs and configuration templates |\n", "| `website/` | JupyterBook sources for the docs site |\n", - ":::\n", + ":::\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Analysis\n", "\n", - "---\n", + "Please visit [the homepage of the protocol website](https://statfungen.github.io/xqtl-protocol/) for general background on this resource, in particular the [How to use the resource](https://statfungen.github.io/xqtl-protocol/README.html#how-to-use-the-resource) section. To perform a complete analysis from molecular phenotype quantification to xQTL discovery, conduct your analysis in the order listed below. Each link contains a mini-protocol for a specific task, and all commands documented in each mini-protocol should be executed from the command line.\n", "\n", - "## Step 4. Download the Demo Data\n", + "### Molecular Phenotype Quantification\n", "\n", - "The demo dataset lives on [Synapse](https://www.synapse.org/#!Synapse:syn36416559). Create a free account first, then:\n", + "Molecular phenotype data is required for the generation of QTLs. We support bulk RNA-seq, methylation, and splicing phenotypes. Multiple [reference data](https://statfungen.github.io/xqtl-protocol/code/reference_data/reference_data.html#) files are required before molecular phenotypes are quantified \u2014 reference genomes, gene annotations, variant annotations, linkage disequilibrium data, and topologically associated domains.\n", "\n", - "```bash\n", - "pixi global install -c conda-forge --environment python synapseclient\n", - "synapse login -p\n", - "synapse get -r syn36416559 --downloadLocation data/example/\n", - "```\n", + "- [Quantification of gene expression](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/bulk_expression.html) \u2014 RNA-SeQC for gene-level counts, or RSEM for transcript-level counts\n", + "- [Quantification of alternative splicing](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/splicing.html) \u2014 leafcutter2 to identify alternatively excised introns\n", + "- [Quantification of DNA methylation](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/methylation.html) \u2014 SeSAMe\n", "\n", - "---\n", + "Each phenotype then undergoes phenotype-specific quality control and normalization.\n", "\n", - "## Step 5. Run Your First Workflow\n", + "### Data Pre-Processing\n", "\n", - "Confirm SoS can see the pipelines:\n", + "- [Genotype preprocessing](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/genotype_preprocessing.html) \u2014 variant filters with bcftools, conversion to plink format, kinship and PCA on unrelated individuals\n", + "- [Phenotype preprocessing](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/phenotype_preprocessing.html) \u2014 feature annotation, imputation of missing entries, formatting for QTL analysis\n", + "- [Covariate preprocessing](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/covariate_preprocessing.html) \u2014 merge phenotypic data with genetic PCs, then compute hidden factors to use as additional covariates\n", "\n", - "```bash\n", - "sos run pipeline/1_xqtl_association.ipynb -h\n", - "```\n", + "### QTL Association Analysis\n", "\n", - "You should see a list of workflow options. Now run a minimal cis-QTL scan:\n", + "- [QTL association analysis](https://statfungen.github.io/xqtl-protocol/code/association_scan/qtl_association_testing.html) \u2014 TensorQTL with options for cis, trans, and interaction terms\n", + "- [Hierarchical multiple testing](https://statfungen.github.io/xqtl-protocol/code/association_scan/qtl_association_postprocessing.html) \u2014 adjust p-values across levels\n", "\n", - "```bash\n", - "sos run pipeline/TensorQTL.ipynb cis \\\n", - " --genotype-file data/example/genotype.bed \\\n", - " --phenotype-file data/example/phenotype.bed.gz \\\n", - " --covariate-file data/example/covariates.tsv \\\n", - " --cwd output/demo_tensorqtl\n", - "```\n", + "### Integrative Analysis\n", "\n", - "Results land in `output/demo_tensorqtl/`.\n", + "- [TWAS](https://statfungen.github.io/xqtl-protocol/code/pecotmr_integration/twas_ctwas.html) \u2014 identify genes associated with complex traits\n", + "- [Univariate fine-mapping and TWAS with SuSiE](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/univariate_fine_mapping_twas_vignette.html) \u2014 TWAS weights and credible sets\n", + "- [Regression with summary statistics](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/summary_stats_finemapping_vignette.html) \u2014 include GWAS summary stats in SuSiE fine-mapping\n", + "- [Univariate fine-mapping of functional data](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/univariate_fine_mapping_fsusie_vignette.html) \u2014 fSuSiE with epigenomic annotations\n", + "- [Colocalization analysis](https://statfungen.github.io/xqtl-protocol/code/pecotmr_integration/SuSiE_enloc.html) \u2014 pairwise colocalization of xQTL and GWAS fine-mapping results\n", + "- [Colocboost](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/mnm_methods/colocboost.html) \u2014 alternative shared-variant discovery across multiple molecular traits\n", + "- [Excess-of-overlap enrichment](https://statfungen.github.io/xqtl-protocol/code/enrichment/eoo_enrichment.html) \u2014 enrichment of significant variants within genomic annotations\n", + "- [Pathway enrichment (GSEA)](https://statfungen.github.io/xqtl-protocol/code/enrichment/gsea.html) \u2014 overrepresented biological pathways in a gene set\n", + "- [Stratified LD Score Regression](https://statfungen.github.io/xqtl-protocol/code/enrichment/sldsc_enrichment.html) \u2014 heritability partitioning by annotation\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data\n", "\n", - ":::{tip}\n", - "Every pipeline supports `-h` and prints the shell commands it runs under the hood \u2014 a great way to learn what's happening and to debug failures.\n", - ":::\n", + "For record-keeping, preparation of the demo dataset is documented [on this page](https://github.com/cumc/fungen-xqtl-analysis/tree/main/analysis/Wang_Columbia/ROSMAP/MWE) \u2014 a private repository accessible to FunGen-xQTL analysis working group members.\n", "\n", - "---\n", + "All required input data for the protocols on this site live on [Synapse](https://www.synapse.org/#!Synapse:syn36416601). To download it:\n", "\n", - "## What to Do Next\n", + "1. Create a free account on [synapse.org](https://www.synapse.org/) \u2014 username and password are required to download.\n", + "2. Install the Synapse API client. It's already packaged for pixi:\n", + " ```bash\n", + " pixi global install -c conda-forge --environment python synapseclient\n", + " ```\n", + " Alternatively, `pip install synapseclient`. See [the Synapse install docs](https://help.synapse.org/docs/Installing-Synapse-API-Clients.1985249668.html) for details.\n", + "3. Every folder at each level of the Synapse project has its own unique ID, so you can download just the subset you need.\n", "\n", - "Pick a path based on what you're analyzing:\n", + "To download the test data for **Bulk RNA-seq molecular phenotype quantification**:\n", "\n", - "| Goal | Pipelines |\n", - "|---|---|\n", - "| Preprocess your data | `1_phenotype_preprocessing.ipynb`, `2_genotype_preprocessing.ipynb`, `4_covariates_preprocessing.ipynb` |\n", - "| Discover QTLs | `TensorQTL.ipynb`, `1_xqtl_association.ipynb`, `APEX.ipynb` |\n", - "| Fine-map | `SuSiE.ipynb`, `mvSuSiE.ipynb`, `fSuSiE.ipynb` |\n", - "| Integrate with GWAS | `coloc.ipynb`, `cTWAS.ipynb`, `GWAS_integration.ipynb` |\n", - "| Run on HPC | `Job_Example.ipynb` (SLURM / LSF / SGE / PBS template) |\n", + "```python\n", + "import synapseclient\n", + "import synapseutils\n", + "syn = synapseclient.Synapse()\n", + "syn.login(\"your username on synapse.org\", \"your password on synapse.org\")\n", + "files = synapseutils.syncFromSynapse(syn, 'syn53174239', path=\"./\")\n", + "```\n", "\n", - "Full documentation: [statfungen.github.io/xqtl-protocol](https://statfungen.github.io/xqtl-protocol/).\n", + "To download the test data for **xQTL association analysis**:\n", "\n", - "---\n", + "```python\n", + "import synapseclient\n", + "import synapseutils\n", + "syn = synapseclient.Synapse()\n", + "syn.login(\"your username on synapse.org\", \"your password on synapse.org\")\n", + "files = synapseutils.syncFromSynapse(syn, 'syn52369482', path=\"./\")\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Software Environment\n", "\n", - "## Troubleshooting\n", + "Every protocol on this site is designed to run inside the [pixi](https://pixi.sh/) environment set up in Steps 1\u20132 above. Once pixi and SoS are installed, every example command \"just works\" \u2014 no per-pipeline container, no manual dependency wrangling.\n", "\n", - ":::{dropdown} `pixi: command not found` after install\n", - "Open a new terminal, or re-source your shell rc file:\n", - "```bash\n", - "source ~/.bashrc # Linux / HPC\n", - "source ~/.zshrc # macOS\n", - "```\n", - ":::\n", + "If you need to add extra software later, install it into the appropriate pixi environment:\n", "\n", - ":::{dropdown} Installer killed on HPC\n", - "You're running on a login node. Request a compute node with at least 50 GB of memory and re-run the installer:\n", "```bash\n", - "srun --mem=50G --pty bash\n", - "bash pixi-setup.sh\n", + "# Python package (into the shared python env)\n", + "pixi global install -c conda-forge --environment python \n", + "\n", + "# R package (into the r-base env)\n", + "pixi global install -c conda-forge --environment r-base r-\n", + "\n", + "# Standalone CLI tools\n", + "pixi global install -c bioconda \n", "```\n", - ":::\n", "\n", - ":::{dropdown} `sos: command not found`\n", - "Step 2 didn't complete. Re-run the `pixi global install` command and confirm with `jupyter kernelspec list` that the `sos` kernel is registered.\n", - ":::\n", + "### Troubleshooting\n", + "\n", + "**R library conflicts** \u2014 if you see errors like\n", "\n", - ":::{dropdown} `ModuleNotFoundError` during a pipeline\n", - "Install the missing package into pixi's python environment:\n", - "```bash\n", - "pixi global install -c conda-forge --environment python \n", "```\n", - ":::\n", + "Error in dyn.load(file, DLLpath = DLLPath, ...):\n", + "unable to load shared object '$PATH/R/x86_64-pc-linux-gnu-library/4.2/stringi/libs/stringi.so':\n", + "libicui18n.so.63: cannot open shared object file: No such file or directory\n", + "```\n", + "\n", + "your system R libraries are being picked up alongside the pixi ones. Unset them before running the pipeline:\n", "\n", - ":::{dropdown} R package conflicts or install failures\n", - "Prefer conda-forge R packages over `install.packages()`:\n", "```bash\n", - "pixi global install --environment r-base r-\n", + "export R_LIBS=\"\"\n", + "export R_LIBS_USER=\"\"\n", "```\n", - "Mixing CRAN builds with conda R leads to ABI mismatches \u2014 avoid it.\n", - ":::\n", "\n", - ":::{dropdown} Still stuck?\n", - "[Open an issue](https://github.com/StatFunGen/xqtl-protocol/issues) with the command you ran and the full error output.\n", - ":::\n" + "**`pixi: command not found`** \u2014 open a new terminal or `source ~/.bashrc` (Linux/HPC) / `source ~/.zshrc` (macOS).\n", + "\n", + "**Installer killed on HPC** \u2014 you're on a login node. Request a compute node with at least 50 GB of memory and re-run.\n", + "\n", + "**`sos: command not found`** \u2014 Step 2 didn't complete. Re-run the `pixi global install` command for SoS.\n", + "\n", + "**`ModuleNotFoundError` during a pipeline** \u2014 install the missing package into pixi's python env with the command above.\n", + "\n", + "Still stuck? [Open an issue](https://github.com/StatFunGen/xqtl-protocol/issues) with the command you ran and the full error output.\n" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [] - }, { "cell_type": "markdown", "metadata": {}, "source": [ - "## Analysis Overview\n", + "## Analyses on High Performance Computing Clusters\n", "\n", - "The protocol is modular. Each pipeline is a self-contained SoS notebook that can run independently or chained together.\n", - "\n", - "| Stage | Pipelines | What happens |\n", - "|---|---|---|\n", - "| **1. Preprocess** | `1_phenotype_preprocessing.ipynb`, `2_genotype_preprocessing.ipynb`, `4_covariates_preprocessing.ipynb` | QC, normalization, imputation, PEER / hidden covariates |\n", - "| **2. Discover** | `TensorQTL.ipynb`, `APEX.ipynb`, `1_xqtl_association.ipynb` | Cis/trans scans, interaction QTLs, end-to-end wrapper |\n", - "| **3. Fine-map** | `SuSiE.ipynb`, `mvSuSiE.ipynb`, `fSuSiE.ipynb` | Credible sets, multi-context, functional annotations |\n", - "| **4. Integrate** | `coloc.ipynb`, `cTWAS.ipynb`, `GWAS_integration.ipynb` | Colocalization, causal TWAS, joint reporting with GWAS |\n", + "The protocol example on this page runs on a desktop workstation as a demonstration. Typical production analyses should run on an HPC cluster. SoS supports this natively via [SoS Remote Tasks](https://vatlab.github.io/sos-docs/doc/user_guide/task_statement.html) on [configured host computers](https://vatlab.github.io/sos-docs/doc/user_guide/host_setup.html).\n", "\n", - "All pipelines share a common config layout, so once you know one you can read the rest.\n" + "We provide a [toy example for running SoS pipelines on a typical HPC cluster environment](https://github.com/statfungen/xqtl-protocol/blob/main/code/misc/Job_Example.ipynb) \u2014 first-time users are encouraged to work through it before launching real jobs, as it covers the host and task configuration you'll reuse for every subsequent pipeline.\n" ] }, { From ddb33a846634b1b00cfa5f2a04d2fdd54102080c Mon Sep 17 00:00:00 2001 From: Jenny Date: Wed, 15 Apr 2026 19:43:55 -0400 Subject: [PATCH 3/5] Rewrite getting-started: minimalist website-style layout, pixi + SoS setup --- code/xqtl_protocol_demo.ipynb | 211 +++++++++++++++++++++------------- 1 file changed, 132 insertions(+), 79 deletions(-) diff --git a/code/xqtl_protocol_demo.ipynb b/code/xqtl_protocol_demo.ipynb index 25ca206e..56b0734a 100644 --- a/code/xqtl_protocol_demo.ipynb +++ b/code/xqtl_protocol_demo.ipynb @@ -6,17 +6,42 @@ "source": [ "# Getting Started\n", "\n", - "This notebook is the on-ramp to the xQTL protocol \u2014 a reproducible pipeline for molecular QTL analysis from raw genotypes and phenotypes through discovery, fine-mapping, and integration with GWAS.\n", + "**The FunGen-xQTL protocol is a reproducible, end-to-end pipeline for molecular quantitative trait loci (QTL) analysis** \u2014 from raw genotypes and phenotypes through discovery, fine-mapping, and integration with GWAS.\n", "\n", - "A minimal toy dataset of **49 de-identified samples** is used throughout the examples on this site so you can try every pipeline end-to-end before running on real data.\n", + "This page is a guided on-ramp. A minimal toy dataset of **49 de-identified samples** is used throughout the examples so you can try every pipeline end-to-end before running on real data. In about an hour you'll install the environment, clone the repo, download the demo dataset, and run your first cis-QTL scan.\n", + "\n", + "```{image} images/complete_workflow.png\n", + ":alt: FunGen-xQTL analysis workflow\n", + ":align: center\n", + ":width: 90%\n", + "```\n", + "\n", + ":::{seealso}\n", + "**New to the consortium?** Start with [How to use the resource](https://statfungen.github.io/xqtl-protocol/README.html#how-to-use-the-resource) on the homepage for the big-picture background, then come back here to set up.\n", + ":::\n", + "\n", + "\n", + "---\n", + "\n", + "## At a Glance\n", + "\n", + "The protocol is modular. Each numbered pipeline is a self-contained [SoS (Script of Scripts)](https://vatlab.github.io/sos-docs/) notebook that can run independently or be chained into the full workflow.\n", + "\n", + "| Stage | What it does | Key pipelines |\n", + "|---|---|---|\n", + "| **1. Preprocess** | Clean, normalize, and align inputs | phenotype QC, genotype QC, covariate generation |\n", + "| **2. Discover** | Scan for QTLs | TensorQTL (cis/trans), APEX (interactions) |\n", + "| **3. Fine-map** | Identify credible causal variants | SuSiE, mvSuSiE, fSuSiE |\n", + "| **4. Integrate** | Link QTLs to disease and biology | coloc, cTWAS, GWAS integration, enrichment |\n", + "\n", + "Full details with links to every mini-protocol are further down in [Analysis](#analysis). For now, let's get you set up.\n", "\n", - "This page walks you from a clean machine to your first successful run in about an hour. If you already have pixi and SoS installed, jump to [Analysis](#analysis).\n", "\n", "---\n", "\n", "## Before You Start\n", "\n", - "You'll need a Linux or macOS shell. Windows users: install [WSL2](https://learn.microsoft.com/windows/wsl/install) first.\n", + "You'll need a Linux or macOS shell. Windows users: install [WSL2](https://learn.microsoft.com/windows/wsl/install) first, then follow the Linux path.\n", "\n", "| Requirement | Minimum | Recommended |\n", "|---|---|---|\n", @@ -26,7 +51,7 @@ "| Git | Any recent version | 2.30+ |\n", "\n", ":::{tip}\n", - "**On HPC, start on a compute node.** The installer is memory-hungry and login nodes will kill it. Grab an interactive session first:\n", + "**On HPC, start on a compute node.** The pixi installer is memory-hungry and login nodes will kill it mid-run. Grab an interactive session first:\n", "\n", "```bash\n", "srun --mem=50G --pty bash # SLURM\n", @@ -34,11 +59,12 @@ "```\n", ":::\n", "\n", + "\n", "---\n", "\n", "## Step 1. Install pixi\n", "\n", - "We manage every dependency \u2014 Python, R, JupyterLab, bioinformatics tools \u2014 with [pixi](https://pixi.sh/). One installer sets it all up.\n", + "We manage every dependency \u2014 Python, R, JupyterLab, common CLI utilities, and a full bioinformatics stack \u2014 with [pixi](https://pixi.sh/), a fast reproducible package manager for conda channels. One installer sets it all up.\n", "\n", "```bash\n", "curl -fsSL https://raw.githubusercontent.com/StatFunGen/pixi-setup/refs/heads/main/pixi-setup.sh -o pixi-setup.sh\n", @@ -58,8 +84,8 @@ "\n", "| Type | Size | Files | Includes |\n", "|---|---|---|---|\n", - "| **1. minimal** | ~5 GB | ~100k | CLI tools, Python data-science stack, JupyterLab, base R |\n", - "| **2. full** | ~35 GB | ~350k | Everything above, **plus** samtools, bcftools, plink2, GATK4, STAR, Seurat, Bioconductor |\n", + "| **1. minimal** | ~5 GB | ~100k | CLI tools, Python data-science stack, JupyterLab, base R (tidyverse, devtools, IRkernel) |\n", + "| **2. full** | ~35 GB | ~350k | Everything above, **plus** samtools, bcftools, plink2, GATK4, STAR, Seurat, tensorQTL, Bioconductor |\n", "\n", "Choose **minimal** for xQTL runs with pre-processed inputs; choose **full** if you'll also do upstream QC, alignment, or single-cell work.\n", "\n", @@ -70,11 +96,14 @@ "pixi --version\n", "```\n", "\n", + "You should see a version number. If not, open a fresh terminal.\n", + "\n", + "\n", "---\n", "\n", "## Step 2. Add SoS\n", "\n", - "The protocol's pipelines are written as [SoS (Script of Scripts)](https://vatlab.github.io/sos-docs/) workflows. Install the SoS suite into pixi's Python environment:\n", + "The protocol's pipelines are written as [SoS](https://vatlab.github.io/sos-docs/) workflows, so we install the SoS suite on top of pixi's Python environment.\n", "\n", "```bash\n", "pixi global install --environment python -c conda-forge \\\n", @@ -91,6 +120,7 @@ "jupyter kernelspec list # should include 'sos'\n", "```\n", "\n", + "\n", "---\n", "\n", "## Step 3. Clone the Protocol\n", @@ -106,9 +136,71 @@ "| Folder | Contents |\n", "|---|---|\n", "| `pipeline/` | The SoS workflows you'll run |\n", - "| `code/` | Notebook documentation (this page lives here) |\n", + "| `code/` | Notebook-based documentation (this page lives here) |\n", "| `data/` | Small example inputs and configuration templates |\n", - "| `website/` | JupyterBook sources for the docs site |\n", + "| `website/` | JupyterBook sources for [statfungen.github.io/xqtl-protocol](https://statfungen.github.io/xqtl-protocol/) |\n", + ":::\n", + "\n", + "\n", + "---\n", + "\n", + "## Step 4. Download the Demo Data\n", + "\n", + "Preparation of the demo dataset is documented [on this page](https://github.com/cumc/fungen-xqtl-analysis/tree/main/analysis/Wang_Columbia/ROSMAP/MWE) (a private repository accessible to FunGen-xQTL working group members). The data itself lives on [Synapse](https://www.synapse.org/#!Synapse:syn36416601).\n", + "\n", + "1. Create a free account on [synapse.org](https://www.synapse.org/) \u2014 username and password are required to download.\n", + "2. Install the Synapse API client into pixi's python environment:\n", + " ```bash\n", + " pixi global install -c conda-forge --environment python synapseclient\n", + " ```\n", + " Alternatively, `pip install synapseclient`. See [the Synapse install docs](https://help.synapse.org/docs/Installing-Synapse-API-Clients.1985249668.html) for details.\n", + "3. Every folder at each level of the Synapse project has its own unique ID, so you can download just the subset you need.\n", + "\n", + "**Bulk RNA-seq molecular phenotype quantification** \u2014 test data:\n", + "\n", + "```python\n", + "import synapseclient\n", + "import synapseutils\n", + "syn = synapseclient.Synapse()\n", + "syn.login(\"your username on synapse.org\", \"your password on synapse.org\")\n", + "files = synapseutils.syncFromSynapse(syn, 'syn53174239', path=\"./\")\n", + "```\n", + "\n", + "**xQTL association analysis** \u2014 test data:\n", + "\n", + "```python\n", + "import synapseclient\n", + "import synapseutils\n", + "syn = synapseclient.Synapse()\n", + "syn.login(\"your username on synapse.org\", \"your password on synapse.org\")\n", + "files = synapseutils.syncFromSynapse(syn, 'syn52369482', path=\"./\")\n", + "```\n", + "\n", + "\n", + "---\n", + "\n", + "## Step 5. Run Your First Workflow\n", + "\n", + "Confirm SoS can see the pipelines:\n", + "\n", + "```bash\n", + "sos run pipeline/1_xqtl_association.ipynb -h\n", + "```\n", + "\n", + "You should see a list of workflow options. Now run a minimal cis-QTL scan using the demo data you just downloaded:\n", + "\n", + "```bash\n", + "sos run pipeline/TensorQTL.ipynb cis \\\n", + " --genotype-file data/example/genotype.bed \\\n", + " --phenotype-file data/example/phenotype.bed.gz \\\n", + " --covariate-file data/example/covariates.tsv \\\n", + " --cwd output/demo_tensorqtl\n", + "```\n", + "\n", + "Results land in `output/demo_tensorqtl/`. You now have a working environment and a known-good reference run to compare against when you bring in your own data.\n", + "\n", + ":::{tip}\n", + "Every pipeline supports `-h` and `--help`, and SoS prints the exact shell commands it runs under the hood \u2014 a great way to learn what's happening and to debug failures.\n", ":::\n" ] }, @@ -125,31 +217,33 @@ "source": [ "## Analysis\n", "\n", - "Please visit [the homepage of the protocol website](https://statfungen.github.io/xqtl-protocol/) for general background on this resource, in particular the [How to use the resource](https://statfungen.github.io/xqtl-protocol/README.html#how-to-use-the-resource) section. To perform a complete analysis from molecular phenotype quantification to xQTL discovery, conduct your analysis in the order listed below. Each link contains a mini-protocol for a specific task, and all commands documented in each mini-protocol should be executed from the command line.\n", + "With the environment set up, here's the full protocol in order. Each link is a self-contained mini-protocol; all commands in them should be executed from the command line.\n", "\n", "### Molecular Phenotype Quantification\n", "\n", - "Molecular phenotype data is required for the generation of QTLs. We support bulk RNA-seq, methylation, and splicing phenotypes. Multiple [reference data](https://statfungen.github.io/xqtl-protocol/code/reference_data/reference_data.html#) files are required before molecular phenotypes are quantified \u2014 reference genomes, gene annotations, variant annotations, linkage disequilibrium data, and topologically associated domains.\n", + "Molecular phenotype data is required to generate QTLs. We support bulk RNA-seq, methylation, and splicing phenotypes. Before quantification, you'll need a handful of [reference data](https://statfungen.github.io/xqtl-protocol/code/reference_data/reference_data.html#) files \u2014 reference genomes, gene annotations, variant annotations, linkage disequilibrium maps, and topologically associated domains.\n", "\n", - "- [Quantification of gene expression](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/bulk_expression.html) \u2014 RNA-SeQC for gene-level counts, or RSEM for transcript-level counts\n", - "- [Quantification of alternative splicing](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/splicing.html) \u2014 leafcutter2 to identify alternatively excised introns\n", - "- [Quantification of DNA methylation](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/methylation.html) \u2014 SeSAMe\n", + "- [Gene expression quantification](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/bulk_expression.html) \u2014 RNA-SeQC (gene-level) or RSEM (transcript-level)\n", + "- [Alternative splicing quantification](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/splicing.html) \u2014 leafcutter2 to identify alternatively excised introns\n", + "- [DNA methylation quantification](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/methylation.html) \u2014 SeSAMe\n", "\n", - "Each phenotype then undergoes phenotype-specific quality control and normalization.\n", + "Each phenotype then undergoes phenotype-specific QC and normalization.\n", "\n", "### Data Pre-Processing\n", "\n", - "- [Genotype preprocessing](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/genotype_preprocessing.html) \u2014 variant filters with bcftools, conversion to plink format, kinship and PCA on unrelated individuals\n", + "- [Genotype preprocessing](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/genotype_preprocessing.html) \u2014 variant filters with bcftools, conversion to plink format, kinship analysis, and genetic PCs on unrelated individuals\n", "- [Phenotype preprocessing](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/phenotype_preprocessing.html) \u2014 feature annotation, imputation of missing entries, formatting for QTL analysis\n", - "- [Covariate preprocessing](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/covariate_preprocessing.html) \u2014 merge phenotypic data with genetic PCs, then compute hidden factors to use as additional covariates\n", + "- [Covariate preprocessing](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/covariate_preprocessing.html) \u2014 merge phenotypic data with genetic PCs, then compute hidden factors as additional covariates\n", "\n", "### QTL Association Analysis\n", "\n", - "- [QTL association analysis](https://statfungen.github.io/xqtl-protocol/code/association_scan/qtl_association_testing.html) \u2014 TensorQTL with options for cis, trans, and interaction terms\n", + "- [QTL association analysis](https://statfungen.github.io/xqtl-protocol/code/association_scan/qtl_association_testing.html) \u2014 TensorQTL scans with cis, trans, and interaction options\n", "- [Hierarchical multiple testing](https://statfungen.github.io/xqtl-protocol/code/association_scan/qtl_association_postprocessing.html) \u2014 adjust p-values across levels\n", "\n", "### Integrative Analysis\n", "\n", + "Multiple methods are available for fine-mapping and for linking xQTLs to GWAS and disease biology:\n", + "\n", "- [TWAS](https://statfungen.github.io/xqtl-protocol/code/pecotmr_integration/twas_ctwas.html) \u2014 identify genes associated with complex traits\n", "- [Univariate fine-mapping and TWAS with SuSiE](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/univariate_fine_mapping_twas_vignette.html) \u2014 TWAS weights and credible sets\n", "- [Regression with summary statistics](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/summary_stats_finemapping_vignette.html) \u2014 include GWAS summary stats in SuSiE fine-mapping\n", @@ -157,58 +251,17 @@ "- [Colocalization analysis](https://statfungen.github.io/xqtl-protocol/code/pecotmr_integration/SuSiE_enloc.html) \u2014 pairwise colocalization of xQTL and GWAS fine-mapping results\n", "- [Colocboost](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/mnm_methods/colocboost.html) \u2014 alternative shared-variant discovery across multiple molecular traits\n", "- [Excess-of-overlap enrichment](https://statfungen.github.io/xqtl-protocol/code/enrichment/eoo_enrichment.html) \u2014 enrichment of significant variants within genomic annotations\n", - "- [Pathway enrichment (GSEA)](https://statfungen.github.io/xqtl-protocol/code/enrichment/gsea.html) \u2014 overrepresented biological pathways in a gene set\n", - "- [Stratified LD Score Regression](https://statfungen.github.io/xqtl-protocol/code/enrichment/sldsc_enrichment.html) \u2014 heritability partitioning by annotation\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Data\n", + "- [Pathway enrichment (GSEA)](https://statfungen.github.io/xqtl-protocol/code/enrichment/gsea.html) \u2014 overrepresented pathways in a gene set\n", + "- [Stratified LD Score Regression](https://statfungen.github.io/xqtl-protocol/code/enrichment/sldsc_enrichment.html) \u2014 heritability partitioning by annotation\n", "\n", - "For record-keeping, preparation of the demo dataset is documented [on this page](https://github.com/cumc/fungen-xqtl-analysis/tree/main/analysis/Wang_Columbia/ROSMAP/MWE) \u2014 a private repository accessible to FunGen-xQTL analysis working group members.\n", "\n", - "All required input data for the protocols on this site live on [Synapse](https://www.synapse.org/#!Synapse:syn36416601). To download it:\n", - "\n", - "1. Create a free account on [synapse.org](https://www.synapse.org/) \u2014 username and password are required to download.\n", - "2. Install the Synapse API client. It's already packaged for pixi:\n", - " ```bash\n", - " pixi global install -c conda-forge --environment python synapseclient\n", - " ```\n", - " Alternatively, `pip install synapseclient`. See [the Synapse install docs](https://help.synapse.org/docs/Installing-Synapse-API-Clients.1985249668.html) for details.\n", - "3. Every folder at each level of the Synapse project has its own unique ID, so you can download just the subset you need.\n", - "\n", - "To download the test data for **Bulk RNA-seq molecular phenotype quantification**:\n", - "\n", - "```python\n", - "import synapseclient\n", - "import synapseutils\n", - "syn = synapseclient.Synapse()\n", - "syn.login(\"your username on synapse.org\", \"your password on synapse.org\")\n", - "files = synapseutils.syncFromSynapse(syn, 'syn53174239', path=\"./\")\n", - "```\n", - "\n", - "To download the test data for **xQTL association analysis**:\n", + "---\n", "\n", - "```python\n", - "import synapseclient\n", - "import synapseutils\n", - "syn = synapseclient.Synapse()\n", - "syn.login(\"your username on synapse.org\", \"your password on synapse.org\")\n", - "files = synapseutils.syncFromSynapse(syn, 'syn52369482', path=\"./\")\n", - "```\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ "## Software Environment\n", "\n", - "Every protocol on this site is designed to run inside the [pixi](https://pixi.sh/) environment set up in Steps 1\u20132 above. Once pixi and SoS are installed, every example command \"just works\" \u2014 no per-pipeline container, no manual dependency wrangling.\n", + "Every protocol on this site runs inside the pixi environment configured in Steps 1\u20132. Once pixi and SoS are installed, each example \"just works\" \u2014 no per-pipeline container, no manual dependency wrangling.\n", "\n", - "If you need to add extra software later, install it into the appropriate pixi environment:\n", + "Need something extra? Install it into the right pixi environment:\n", "\n", "```bash\n", "# Python package (into the shared python env)\n", @@ -217,13 +270,14 @@ "# R package (into the r-base env)\n", "pixi global install -c conda-forge --environment r-base r-\n", "\n", - "# Standalone CLI tools\n", + "# Standalone bioinformatics CLI tool\n", "pixi global install -c bioconda \n", "```\n", "\n", "### Troubleshooting\n", "\n", - "**R library conflicts** \u2014 if you see errors like\n", + ":::{warning}\n", + "**R library conflicts.** If you see an error like\n", "\n", "```\n", "Error in dyn.load(file, DLLpath = DLLPath, ...):\n", @@ -237,27 +291,26 @@ "export R_LIBS=\"\"\n", "export R_LIBS_USER=\"\"\n", "```\n", + ":::\n", "\n", - "**`pixi: command not found`** \u2014 open a new terminal or `source ~/.bashrc` (Linux/HPC) / `source ~/.zshrc` (macOS).\n", + "**`pixi: command not found`** \u2014 open a new terminal, or re-source your shell rc file (`source ~/.bashrc` on Linux/HPC, `source ~/.zshrc` on macOS).\n", "\n", - "**Installer killed on HPC** \u2014 you're on a login node. Request a compute node with at least 50 GB of memory and re-run.\n", + "**Installer killed on HPC** \u2014 you're on a login node. Request a compute node with \u2265 50 GB memory and re-run.\n", "\n", "**`sos: command not found`** \u2014 Step 2 didn't complete. Re-run the `pixi global install` command for SoS.\n", "\n", "**`ModuleNotFoundError` during a pipeline** \u2014 install the missing package into pixi's python env with the command above.\n", "\n", - "Still stuck? [Open an issue](https://github.com/StatFunGen/xqtl-protocol/issues) with the command you ran and the full error output.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ + "Still stuck? [Open an issue](https://github.com/StatFunGen/xqtl-protocol/issues) with the command you ran and the full error output.\n", + "\n", + "\n", + "---\n", + "\n", "## Analyses on High Performance Computing Clusters\n", "\n", - "The protocol example on this page runs on a desktop workstation as a demonstration. Typical production analyses should run on an HPC cluster. SoS supports this natively via [SoS Remote Tasks](https://vatlab.github.io/sos-docs/doc/user_guide/task_statement.html) on [configured host computers](https://vatlab.github.io/sos-docs/doc/user_guide/host_setup.html).\n", + "The demo on this page runs on a desktop workstation. Production analyses typically run on an HPC cluster, and SoS supports this natively via [SoS Remote Tasks](https://vatlab.github.io/sos-docs/doc/user_guide/task_statement.html) on [configured host computers](https://vatlab.github.io/sos-docs/doc/user_guide/host_setup.html).\n", "\n", - "We provide a [toy example for running SoS pipelines on a typical HPC cluster environment](https://github.com/statfungen/xqtl-protocol/blob/main/code/misc/Job_Example.ipynb) \u2014 first-time users are encouraged to work through it before launching real jobs, as it covers the host and task configuration you'll reuse for every subsequent pipeline.\n" + "We provide a [toy example for running SoS pipelines on a typical HPC cluster environment](https://github.com/statfungen/xqtl-protocol/blob/main/code/misc/Job_Example.ipynb) \u2014 first-time users are encouraged to work through it before launching real jobs. It covers the host and task configuration you'll reuse for every subsequent pipeline, and it's schedule-agnostic (SLURM, LSF, SGE, PBS/Torque all work).\n" ] }, { From a9e534fac53aca5ef2e51d9d65e34151a1e5f6ba Mon Sep 17 00:00:00 2001 From: Jenny Date: Wed, 15 Apr 2026 19:48:05 -0400 Subject: [PATCH 4/5] Rewrite getting-started: minimalist website-style layout, pixi + SoS setup --- code/xqtl_protocol_demo.ipynb | 101 ++++++++++++++++++++++++++-------- 1 file changed, 77 insertions(+), 24 deletions(-) diff --git a/code/xqtl_protocol_demo.ipynb b/code/xqtl_protocol_demo.ipynb index 56b0734a..894f7c0b 100644 --- a/code/xqtl_protocol_demo.ipynb +++ b/code/xqtl_protocol_demo.ipynb @@ -217,43 +217,96 @@ "source": [ "## Analysis\n", "\n", - "With the environment set up, here's the full protocol in order. Each link is a self-contained mini-protocol; all commands in them should be executed from the command line.\n", + "With the environment set up, here's the full protocol in order. Each link is a self-contained mini-protocol; all commands in them should be executed from the command line with `sos run pipeline/.ipynb ...`.\n", "\n", - "### Molecular Phenotype Quantification\n", + ":::{important}\n", + "**Minimum Working Example (MWE) \u2014 new users, start here.**\n", "\n", - "Molecular phenotype data is required to generate QTLs. We support bulk RNA-seq, methylation, and splicing phenotypes. Before quantification, you'll need a handful of [reference data](https://statfungen.github.io/xqtl-protocol/code/reference_data/reference_data.html#) files \u2014 reference genomes, gene annotations, variant annotations, linkage disequilibrium maps, and topologically associated domains.\n", + "Every module in the repo ships a minimal `MWE`-prefixed test dataset under [Synapse `syn36416559`](https://www.synapse.org/#!Synapse:syn36416559/files/). To go end-to-end on the demo data, run these **five** pipelines in order and skip everything else on the first pass:\n", "\n", - "- [Gene expression quantification](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/bulk_expression.html) \u2014 RNA-SeQC (gene-level) or RSEM (transcript-level)\n", - "- [Alternative splicing quantification](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/splicing.html) \u2014 leafcutter2 to identify alternatively excised introns\n", - "- [DNA methylation quantification](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/methylation.html) \u2014 SeSAMe\n", + "1. [`reference_data.ipynb`](https://statfungen.github.io/xqtl-protocol/code/reference_data/reference_data.html) \u2014 pull the standardized reference files\n", + "2. [`bulk_expression.ipynb`](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/bulk_expression.html) \u2014 quantify gene expression (MWE default)\n", + "3. [`genotype_preprocessing.ipynb`](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/genotype_preprocessing.html) \u2192 [`phenotype_preprocessing.ipynb`](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/phenotype_preprocessing.html) \u2192 [`covariate_preprocessing.ipynb`](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/covariate_preprocessing.html) \u2014 QC + normalization\n", + "4. [`qtl_association_testing.ipynb`](https://statfungen.github.io/xqtl-protocol/code/association_scan/qtl_association_testing.html) \u2014 run cis-QTL with TensorQTL\n", + "5. [`mnm_miniprotocol.ipynb`](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/mnm_miniprotocol.html) \u2014 single-trait fine-mapping + TWAS with SuSiE\n", "\n", - "Each phenotype then undergoes phenotype-specific QC and normalization.\n", + "Once this pass completes cleanly, branch out to the additional modules below (methylation, splicing, multivariate mixture, GWAS integration, enrichment, EMS) based on what your project needs.\n", + ":::\n", + "\n", + "### 1. Reference Data\n", + "\n", + "Before quantifying phenotypes, set up the standardized reference files \u2014 genomes, gene annotations, variant annotations, LD maps, and topologically associated domains.\n", + "\n", + "- [Reference data setup](https://statfungen.github.io/xqtl-protocol/code/reference_data/reference_data.html) \u2014 main entry point \u2b50 *MWE*\n", + "- [Reference data preparation](https://statfungen.github.io/xqtl-protocol/code/reference_data/reference_data_preparation.html) \u2014 detailed preparation steps\n", + "- [Generalized TAD-B](https://statfungen.github.io/xqtl-protocol/code/reference_data/generalized_TADB.html) \u2014 TAD boundaries for analysis windows\n", + "- [LD reference pruning](https://statfungen.github.io/xqtl-protocol/code/reference_data/ld_prune_reference.html) and [RSS LD sketching](https://statfungen.github.io/xqtl-protocol/code/reference_data/rss_ld_sketch.html) \u2014 advanced LD utilities\n", + "\n", + "### 2. Molecular Phenotypes\n", + "\n", + "We support bulk RNA-seq, DNA methylation, and alternative splicing phenotypes. Each path has its own calling, QC, and normalization steps.\n", + "\n", + "- **Bulk RNA-seq** \u2014 [bulk_expression mini-protocol](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/bulk_expression.html) \u2b50 *MWE*, with sub-modules for [RNA calling](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/calling/RNA_calling.html), [QC](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/QC/bulk_expression_QC.html), and [normalization](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/QC/bulk_expression_normalization.html)\n", + "- **DNA methylation** \u2014 [methylation mini-protocol](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/methylation.html) with [methylation calling via SeSAMe](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/calling/methylation_calling.html)\n", + "- **Alternative splicing** \u2014 [splicing mini-protocol](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/splicing.html) with [splicing calling via leafcutter2](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/calling/splicing_calling.html) and [normalization](https://statfungen.github.io/xqtl-protocol/code/molecular_phenotypes/QC/splicing_normalization.html)\n", + "\n", + "### 3. Data Pre-processing\n", + "\n", + "- [Genotype preprocessing](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/genotype_preprocessing.html) \u2b50 *MWE* \u2014 VCF QC, GWAS QC, PCA, GRM, plink formatting\n", + "- [Phenotype preprocessing](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/phenotype_preprocessing.html) \u2b50 *MWE* \u2014 gene annotation, imputation, formatting\n", + "- [Covariate preprocessing](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/covariate_preprocessing.html) \u2b50 *MWE* \u2014 merge genetic PCs with phenotypes, compute hidden factors\n", + "\n", + "### 4. QTL Association Testing\n", "\n", - "### Data Pre-Processing\n", + "- [QTL association testing](https://statfungen.github.io/xqtl-protocol/code/association_scan/qtl_association_testing.html) \u2b50 *MWE* \u2014 [TensorQTL](https://statfungen.github.io/xqtl-protocol/code/association_scan/TensorQTL/TensorQTL.html) scans (cis, trans, interaction) and [quantile regression QTL](https://statfungen.github.io/xqtl-protocol/code/association_scan/quantile_models/qr_and_twas.html)\n", + "- [Association postprocessing](https://statfungen.github.io/xqtl-protocol/code/association_scan/qtl_association_postprocessing.html) \u2014 hierarchical multiple testing and p-value adjustment\n", "\n", - "- [Genotype preprocessing](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/genotype_preprocessing.html) \u2014 variant filters with bcftools, conversion to plink format, kinship analysis, and genetic PCs on unrelated individuals\n", - "- [Phenotype preprocessing](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/phenotype_preprocessing.html) \u2014 feature annotation, imputation of missing entries, formatting for QTL analysis\n", - "- [Covariate preprocessing](https://statfungen.github.io/xqtl-protocol/code/data_preprocessing/covariate_preprocessing.html) \u2014 merge phenotypic data with genetic PCs, then compute hidden factors as additional covariates\n", + "### 5. Multivariate Mixture Model\n", "\n", - "### QTL Association Analysis\n", + "Learn a data-driven mixture prior across contexts/tissues for multivariate fine-mapping.\n", "\n", - "- [QTL association analysis](https://statfungen.github.io/xqtl-protocol/code/association_scan/qtl_association_testing.html) \u2014 TensorQTL scans with cis, trans, and interaction options\n", - "- [Hierarchical multiple testing](https://statfungen.github.io/xqtl-protocol/code/association_scan/qtl_association_postprocessing.html) \u2014 adjust p-values across levels\n", + "- [Multivariate mixture vignette](https://statfungen.github.io/xqtl-protocol/code/multivariate_genome/multivariate_mixture_vignette.html) \u2014 overview\n", + "- [Mixture prior with MASH](https://statfungen.github.io/xqtl-protocol/code/multivariate_genome/MASH/mixture_prior.html) and [MASH fit](https://statfungen.github.io/xqtl-protocol/code/multivariate_genome/MASH/mash_fit.html) \u2014 data-driven prior estimation\n", "\n", - "### Integrative Analysis\n", + "### 6. Multiomics Regression Models\n", "\n", - "Multiple methods are available for fine-mapping and for linking xQTLs to GWAS and disease biology:\n", + "Fine-mapping and multi-context regression \u2014 the core of the post-discovery analysis.\n", "\n", - "- [TWAS](https://statfungen.github.io/xqtl-protocol/code/pecotmr_integration/twas_ctwas.html) \u2014 identify genes associated with complex traits\n", - "- [Univariate fine-mapping and TWAS with SuSiE](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/univariate_fine_mapping_twas_vignette.html) \u2014 TWAS weights and credible sets\n", - "- [Regression with summary statistics](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/summary_stats_finemapping_vignette.html) \u2014 include GWAS summary stats in SuSiE fine-mapping\n", - "- [Univariate fine-mapping of functional data](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/univariate_fine_mapping_fsusie_vignette.html) \u2014 fSuSiE with epigenomic annotations\n", - "- [Colocalization analysis](https://statfungen.github.io/xqtl-protocol/code/pecotmr_integration/SuSiE_enloc.html) \u2014 pairwise colocalization of xQTL and GWAS fine-mapping results\n", - "- [Colocboost](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/mnm_methods/colocboost.html) \u2014 alternative shared-variant discovery across multiple molecular traits\n", - "- [Excess-of-overlap enrichment](https://statfungen.github.io/xqtl-protocol/code/enrichment/eoo_enrichment.html) \u2014 enrichment of significant variants within genomic annotations\n", - "- [Pathway enrichment (GSEA)](https://statfungen.github.io/xqtl-protocol/code/enrichment/gsea.html) \u2014 overrepresented pathways in a gene set\n", + "- [Multi-omic regression mini-protocol](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/mnm_miniprotocol.html) \u2b50 *MWE* \u2014 start here\n", + "- [Univariate fine-mapping + TWAS with SuSiE](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/univariate_fine_mapping_twas_vignette.html)\n", + "- [Multivariate multi-gene fine-mapping](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/multivariate_multigene_fine_mapping_vignette.html)\n", + "- [Univariate fine-mapping with fSuSiE](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/univariate_fine_mapping_fsusie_vignette.html) \u2014 functional / epigenomic data\n", + "- [Multivariate fine-mapping vignette](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/multivariate_fine_mapping_vignette.html)\n", + "- [Summary-statistics fine-mapping](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/summary_stats_finemapping_vignette.html)\n", + "- [Multi-omic multi-trait regression](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/mnm_methods/mnm_regression.html) and [RSS analysis](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/mnm_methods/rss_analysis.html)\n", + "- [MNM postprocessing](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/mnm_postprocessing.html)\n", + "\n", + "### 7. GWAS Integration\n", + "\n", + "Link xQTL signals to disease-associated loci.\n", + "\n", + "- [SuSiE-enloc colocalization](https://statfungen.github.io/xqtl-protocol/code/pecotmr_integration/SuSiE_enloc.html) \u2014 pairwise colocalization of xQTL and GWAS fine-mapping\n", + "- [TWAS / cTWAS](https://statfungen.github.io/xqtl-protocol/code/pecotmr_integration/twas_ctwas.html) \u2014 causal TWAS for complex traits\n", + "- [Colocboost](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/mnm_methods/colocboost.html) \u2014 shared-variant discovery across multiple molecular traits\n", + "\n", + "### 8. Enrichment and Validation\n", + "\n", + "- [Excess-of-overlap enrichment](https://statfungen.github.io/xqtl-protocol/code/enrichment/eoo_enrichment.html) \u2014 significance of variants in annotation sets\n", + "- [Pathway enrichment (GSEA)](https://statfungen.github.io/xqtl-protocol/code/enrichment/gsea.html)\n", + "- [GREGOR](https://statfungen.github.io/xqtl-protocol/code/enrichment/gregor.html) \u2014 annotation-based enrichment for significant variants\n", "- [Stratified LD Score Regression](https://statfungen.github.io/xqtl-protocol/code/enrichment/sldsc_enrichment.html) \u2014 heritability partitioning by annotation\n", "\n", + "### 9. xQTL Modifier Score (EMS)\n", + "\n", + "Train and apply a per-variant score for prioritizing regulatory variants.\n", + "\n", + "- [EMS training](https://statfungen.github.io/xqtl-protocol/code/xqtl_modifier_score/ems_training.html)\n", + "- [EMS prediction](https://statfungen.github.io/xqtl-protocol/code/xqtl_modifier_score/ems_prediction.html)\n", + "\n", + "### Command Generator (shortcut)\n", + "\n", + "Want to skip writing SoS commands by hand? The [eQTL analysis command generator](https://statfungen.github.io/xqtl-protocol/code/commands_generator/eQTL_analysis_commands.html) produces the full pipeline from a single configuration file \u2014 great for reproducing a run or sharing a recipe.\n", + "\n", "\n", "---\n", "\n", From 3f61b9a6e6268535d215e63cd8758acdbaaa7151 Mon Sep 17 00:00:00 2001 From: Jenny Date: Fri, 17 Apr 2026 09:15:08 -0400 Subject: [PATCH 5/5] Rewrite getting-started: SoS conda env + pixi install, updated analysis modules --- code/xqtl_protocol_demo.ipynb | 85 +++++++++++++++++++++++------------ 1 file changed, 57 insertions(+), 28 deletions(-) diff --git a/code/xqtl_protocol_demo.ipynb b/code/xqtl_protocol_demo.ipynb index 894f7c0b..35dbb303 100644 --- a/code/xqtl_protocol_demo.ipynb +++ b/code/xqtl_protocol_demo.ipynb @@ -51,36 +51,78 @@ "| Git | Any recent version | 2.30+ |\n", "\n", ":::{tip}\n", - "**On HPC, start on a compute node.** The pixi installer is memory-hungry and login nodes will kill it mid-run. Grab an interactive session first:\n", + "**On HPC** \u2014 make sure you have access to a compute node with at least 50 GB of memory for the pixi installation step (Step 2). Login nodes often kill large installs. See Step 2 for details.\n", + ":::\n", + "\n", + "\n", + "---\n", + "\n", + "## Step 1. Install SoS in a Conda Environment\n", + "\n", + "The protocol's pipelines are written as [SoS (Script of Scripts)](https://vatlab.github.io/sos-docs/) workflows. First, create a dedicated conda environment and install SoS along with its language modules. Full installation reference: [SoS Conda installation guide](https://vatlab.github.io/sos-docs/running.html#Conda-installation).\n", + "\n", + "If you don't have conda yet, install [Miniforge](https://github.com/conda-forge/miniforge) (recommended) or [Anaconda](https://www.anaconda.com/download).\n", "\n", "```bash\n", - "srun --mem=50G --pty bash # SLURM\n", - "bsub -Is -M 50000 -n 4 bash # LSF\n", + "# Create and activate a new environment for SoS\n", + "conda create -n sos python=3.12 -y\n", + "conda activate sos\n", + "\n", + "# Install the full SoS suite\n", + "conda install -c conda-forge \\\n", + " sos sos-pbs sos-notebook jupyterlab-sos sos-papermill \\\n", + " sos-bash sos-python sos-r\n", + "\n", + "# Register the SoS kernel with Jupyter\n", + "python -m sos_notebook.install\n", "```\n", + "\n", + "**Verify:**\n", + "\n", + "```bash\n", + "sos --version\n", + "jupyter kernelspec list # should include 'sos'\n", + "```\n", + "\n", + ":::{tip}\n", + "Make sure you always `conda activate sos` before running any pipeline commands.\n", ":::\n", "\n", "\n", "---\n", "\n", - "## Step 1. Install pixi\n", + "## Step 2. Install the xQTL Software Stack with pixi\n", "\n", - "We manage every dependency \u2014 Python, R, JupyterLab, common CLI utilities, and a full bioinformatics stack \u2014 with [pixi](https://pixi.sh/), a fast reproducible package manager for conda channels. One installer sets it all up.\n", + "Next, install the bioinformatics and data-science packages the protocol depends on using [pixi](https://pixi.sh/) via the [StatFunGen/pixi-setup](https://github.com/StatFunGen/pixi-setup) installer.\n", + "\n", + "**On HPC systems**, your home directory likely has a storage quota that won't fit the full install. Temporarily point `$HOME` to a path with enough space, and add pixi to your `$PATH`:\n", "\n", "```bash\n", - "curl -fsSL https://raw.githubusercontent.com/StatFunGen/pixi-setup/refs/heads/main/pixi-setup.sh -o pixi-setup.sh\n", - "bash pixi-setup.sh\n", + "# Point HOME to a location with enough disk space\n", + "export HOME=\"/your_pixi_install_path\"\n", + "\n", + "# Add pixi to your path\n", + "export PATH=\"/your_pixi_install_path/.pixi/bin:$PATH\"\n", "```\n", "\n", + "Then run the installer:\n", + "\n", + "```bash\n", + "curl -fsSL https://raw.githubusercontent.com/StatFunGen/pixi-setup/refs/heads/main/pixi-setup.sh | bash\n", + "```\n", + "\n", + "**On a laptop or workstation** you can skip the `HOME`/`PATH` exports and just run the `curl` command \u2014 the installer will prompt you to choose an install path and type interactively.\n", + "\n", "The installer will prompt you for two things:\n", "\n", - "**1. Installation path** \u2014 where pixi stores environments and the package cache.\n", + "**1. Installation path** \u2014 where pixi stores environments and packages.\n", "\n", "| Setting | When to use |\n", "|---|---|\n", "| `$HOME/.pixi` (default) | Laptops and workstations with plenty of home-directory space |\n", - "| `/lab/$USER/.pixi` or scratch | HPC systems with strict home-directory quotas |\n", + "| `/your_pixi_install_path/.pixi` | HPC systems with strict home-directory quotas |\n", "\n", - "**2. Installation type** \u2014 pick based on what you plan to do.\n", + "**2. Installation type**\n", "\n", "| Type | Size | Files | Includes |\n", "|---|---|---|---|\n", @@ -98,27 +140,14 @@ "\n", "You should see a version number. If not, open a fresh terminal.\n", "\n", - "\n", - "---\n", - "\n", - "## Step 2. Add SoS\n", - "\n", - "The protocol's pipelines are written as [SoS](https://vatlab.github.io/sos-docs/) workflows, so we install the SoS suite on top of pixi's Python environment.\n", - "\n", - "```bash\n", - "pixi global install --environment python -c conda-forge \\\n", - " sos sos-pbs sos-notebook jupyterlab-sos \\\n", - " sos-bash sos-python sos-r\n", - "\n", - "pixi run -e python python -m sos_notebook.install\n", - "```\n", - "\n", - "**Verify:**\n", + ":::{warning}\n", + "**On HPC**, run the installer from a compute node with at least 50 GB of memory, not the login node. The install process can be memory-intensive and may be killed on login nodes:\n", "\n", "```bash\n", - "sos --version\n", - "jupyter kernelspec list # should include 'sos'\n", + "srun --mem=50G --pty bash # SLURM\n", + "bsub -Is -M 50000 -n 4 bash # LSF\n", "```\n", + ":::\n", "\n", "\n", "---\n",