From f7ee2a3f67dc9f913545da89b09fc37053cb2873 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Wed, 22 Apr 2026 14:03:45 -0400 Subject: [PATCH 1/2] Delete code/mnm_analysis/mnm_methods/colocboost.ipynb --- code/mnm_analysis/mnm_methods/colocboost.ipynb | 5 ----- 1 file changed, 5 deletions(-) delete mode 100644 code/mnm_analysis/mnm_methods/colocboost.ipynb diff --git a/code/mnm_analysis/mnm_methods/colocboost.ipynb b/code/mnm_analysis/mnm_methods/colocboost.ipynb deleted file mode 100644 index 6d8d9586..00000000 --- a/code/mnm_analysis/mnm_methods/colocboost.ipynb +++ /dev/null @@ -1,5 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - \ No newline at end of file From 8abeb96643d255b3acf82f077dccef9bac282095 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Wed, 22 Apr 2026 14:04:13 -0400 Subject: [PATCH 2/2] Fixed bugs. --- .../mnm_analysis/mnm_methods/colocboost.ipynb | 1430 +++++++++++++++++ 1 file changed, 1430 insertions(+) create mode 100644 code/mnm_analysis/mnm_methods/colocboost.ipynb diff --git a/code/mnm_analysis/mnm_methods/colocboost.ipynb b/code/mnm_analysis/mnm_methods/colocboost.ipynb new file mode 100644 index 00000000..7ffeed0e --- /dev/null +++ b/code/mnm_analysis/mnm_methods/colocboost.ipynb @@ -0,0 +1,1430 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "# Multi-trait colocalization using ColocBoost" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Description" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "We use ColocBoost to perform colocalization analyses [cf. Cao et al (2025)](https://www.medrxiv.org/content/10.1101/2025.04.17.25326042v1). In short, ColocBoost performs colocalization while accounting for multiple causal variants within a genomic region of interest and can scale to hundreds of traits. We allow users to include or omit GWAS summary statistics. Required inputs are individual xQTL data from the same cohort (multiple phenotypes, same genotype). Linkage disequilibrium reference data is required if GWAS summary statistics are used." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS", + "tags": [] + }, + "source": [ + "## Input\n", + "\n", + "1. A list of regions to be analyzed (optional); the last column of this file should be region name.\n", + "2. Either a list of per chromosome genotype files, or one file for genotype data of the entire genome. Genotype data has to be in PLINK `bed` format. \n", + "3. Vector of lists of phenotype files per region to be analyzed, in UCSC `bed.gz` with index in `bed.gz.tbi` formats.\n", + "4. Vector of covariate files corresponding to the lists above.\n", + "5. Customized association windows file for variants (cis or trans). If it is not provided, a fixed sized window will be used around the region (a cis-window)\n", + "6. Optionally a vector of names of the phenotypic conditions in the form of `cond1 cond2 cond3` separated with whitespace.\n", + "7. Optionally summary statistics meta-data file and LD reference meta-data file.\n", + "\n", + "Input 2 and 3 should be outputs from `genotype_per_region` and `annotate_coord` modules in previous preprocessing steps. 4 should be output of `covariate_preprocessing` pipeline that contains genotype PC, phenotypic hidden confounders and fixed covariates." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS", + "tags": [] + }, + "source": [ + "### Example genotype, phenotype and association analysis windows \n", + "\n", + "See [this page](mnm_regression) for example inputs of these information." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### Example summary statistics and LD reference\n", + "\n", + "See [rss_analysis](rss_analysis) for LD reference formats (pre-computed blocks and PLINK genotype files),\n", + "per-study LD reference via the `ld_meta_data` column in GWAS meta-data, and mixture LD panels.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### About indels\n", + "\n", + "Use `--no-indel` to remove indel variants from analysis. This applies to both individual-level genotype loading and summary statistics QC. Default keeps indels.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "jp-MarkdownHeadingCollapsed": true, + "kernel": "SoS", + "tags": [] + }, + "source": [ + "## Output\n", + "\n", + "For each analysis region, the output are various ColocBoost models fitted and saved in RDS format. An RDS file with the following values will be generated for each gene of interest and for each GWAS, if GWAS integration is used. \n", + "\n", + "1. ucos_summary - a summary table of colocalization events\n", + "2. vpa - the variable colocalized probability for each variable, or the probability of a variant being colocalized with at least one trait.\n", + "3. data_info - an object containing information on the input data\n", + "4. model_info - an object containing information on the colocboost model\n", + "5. ucos_details - contains trait specific (uncolocalized) effects information\n", + "6. region_info - information on the region(s) of interest\n", + "7. computing_time - time to run the steps of coloc boost (loading, QC and analysis)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Minimal Working Example Steps" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "Timing: ~X min" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### xQTL only anlaysis" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "Below we duplicate the examples for phenotype and covariates to demonstrate that when there are multiple phenotypes for the same genotype it is possible to use this pipeline to analyze all of them (more than two is accepted as well).\n", + "\n", + "Here using `--region-name` we focus the analysis on 3 genes. In practice if this parameter is dropped, the union of all regions in all phenotype region lists will be analyzed. It is possible for some of the regions there are no genotype data, in which case the pipeline will output RDS files with a warning message to indicate the lack of genotype data to analyze." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "**Note:** Suggested output naming convention is cohort_modality, eg ROSMAP_snRNA_pseudobulk." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### v. Colocboost" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS", + "vscode": { + "languageId": "plaintext" + } + }, + "outputs": [], + "source": [ + "sos run pipeline/colocboost.ipynb colocboost \\\n", + " --name test_coloc_boost_xqtl_only \\\n", + " --cwd output/colocboost \\\n", + " --genoFile output/plink/wgs.merged.bed \\\n", + " --phenoFile output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \\\n", + " --covFile output/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \\\n", + " --customized-association-windows reference_data/TAD/TADB_enhanced_cis.bed \\\n", + " --no-separate-gwas --xqtl-coloc \\\n", + " --region-name ENSG00000239945 \\\n", + " --phenotype-names trait_A\n", + "\n", + "#It is also possible to analyze a selected list of regions using option `--region-list`. The last column of this file will be used for the list to analyze. Here for example use the same list of regions as we used for customized association-window:\n", + "sos run pipeline/colocboost.ipynb colcoboost \\\n", + " --name protocol_example_protein \\\n", + " --genoFile input/xqtl_association/protocol_example.genotype.chr21_22.bed \\\n", + " --phenoFile output/phenotype/protocol_example.protein.region_list.txt \\\n", + " output/phenotype/protocol_example.protein.region_list.txt \\\n", + " --covFile output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \\\n", + " output/covariate/protocol_example.protein.protocol_example.samples.protocol_example.genotype.chr21_22.pQTL.plink_qc.prune.pca.Marchenko_PC.gz \\\n", + " --customized-association-windows input/xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \\\n", + " --no-separate-gwas --xqtl-coloc \\\n", + " --region-list xqtl_association/protocol_example.protein.enhanced_cis_chr21_chr22.bed \\\n", + " --phenotype-names trait_A trait_B\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "```\n", + "\n", + "INFO: Running get_rss_analysis_regions: \n", + "INFO: get_rss_analysis_regions is completed.\n", + "INFO: get_rss_analysis_regions output: regional_rss_data\n", + "INFO: Running get_analysis_regions: \n", + "Loading customized association analysis window from reference_data/TAD/TADB_enhanced_cis.bed\n", + "INFO: get_analysis_regions is completed.\n", + "INFO: get_analysis_regions output: regional_data\n", + "INFO: Running colocboost: \n", + "INFO: colocboost is completed.\n", + "INFO: colocboost output: /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/colocboost/colocboost/test_coloc_boost_xqtl_only.chr1_ENSG00000239945.cb_xqtl.rds\n", + "INFO: Workflow colocboost (ID=w675afec4b9a12f39) is executed successfully with 3 completed steps.\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "### vi. Colocboost with GWAS\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run pipeline/colocboost.ipynb colocboost \\\n", + " --name colocboost_gwas \\\n", + " --cwd output/colocboost_gwas \\\n", + " --genoFile output/plink/wgs.merged.bed \\\n", + " --phenoFile output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \\\n", + " --covFile output/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \\\n", + " --customized-association-windows reference_data/TAD/TADB_enhanced_cis.bed \\\n", + " --ld-meta-data data/ld_meta_file.tsv \\\n", + " --gwas-meta-data data/colocboost/gwas_meta.txt \\\n", + " --separate-gwas --xqtl-coloc \\\n", + " --region-name ENSG00000239945 \\\n", + " --phenotype-names trait_A\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "```\n", + "INFO: Running get_rss_analysis_regions: \n", + "Using customized association window file: reference_data/TAD/TADB_enhanced_cis.bed\n", + "INFO: Running get_analysis_regions: \n", + "INFO: get_rss_analysis_regions is completed.\n", + "INFO: get_rss_analysis_regions output: regional_rss_data\n", + "Loading customized association analysis window from reference_data/TAD/TADB_enhanced_cis.bed\n", + "INFO: get_analysis_regions is completed.\n", + "INFO: get_analysis_regions output: regional_data\n", + "INFO: Running colocboost: \n", + "INFO: colocboost is completed.\n", + "INFO: colocboost output: /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/colocboost_gwas/colocboost/colocboost_gwas.chr1_ENSG00000239945.cb_xqtl.rds /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/colocboost_gwas/colocboost/colocboost_gwas.chr1_ENSG00000239945.cb_xqtl_Wightman.rds\n", + "INFO: Workflow colocboost (ID=w39fc8e9aa4cca1be) is executed successfully with 3 completed steps.\n", + "\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### vii. Colocboost with multiple LD references (multi-ancestry)\n", + "\n", + "When GWAS studies come from different populations, each can use its own LD reference.\n", + "Add an `ld_meta_data` column to the GWAS meta-data file to specify per-study LD:\n", + "\n", + "```\n", + "# gwas_meta_multi.tsv\n", + "study_id chrom file_path column_mapping_file n_sample n_case n_control ld_meta_data\n", + "AD_EUR 0 ad_eur.txt mapping.yml 0 111326 677663 ld_ref_EUR.tsv\n", + "PD_EUR 0 pd_eur.txt mapping.yml 123455 0 0 ld_ref_EUR.tsv\n", + "AD_AFR 0 ad_afr.txt mapping.yml 0 21982 41944 ld_ref_AFR.tsv\n", + "```\n" + ] + }, + { + "cell_type": "code", + "metadata": {}, + "source": [ + "# Multi-ancestry: per-study LD reference specified in gwas_meta_multi.tsv\n", + "sos run pipeline/colocboost.ipynb colocboost \\\n", + " --name colocboost_multi_ld \\\n", + " --cwd output/colocboost_multi_ld \\\n", + " --genoFile output/plink/wgs.merged.bed \\\n", + " --phenoFile output/phenotype/pheno.bed.gz \\\n", + " --covFile output/covariate/cov.gz \\\n", + " --customized-association-windows reference_data/TAD/TADB_enhanced_cis.bed \\\n", + " --ld-meta-data reference_data/ld_EUR.tsv \\\n", + " --gwas-meta-data data/colocboost/gwas_meta_multi.tsv \\\n", + " --separate-gwas --xqtl-coloc \\\n", + " --region-name ENSG00000239945 \\\n", + " --phenotype-names trait_A\n" + ], + "outputs": [], + "execution_count": null + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Troubleshooting" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "| Step | Substep | Problem | Possible Reason | Solution |\n", + "|------|---------|---------|------------------|---------|\n", + "| | | | | |\n", + "\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Command interface" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "sos run colocboost.ipynb -h" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS", + "tags": [] + }, + "source": [ + "```\n", + "usage: sos run /restricted/projectnb/xqtl/xqtl_protocol/xqtl-protocol/code/mnm_analysis/mnm_methods/colocboost.ipynb\n", + " [workflow_name | -t targets] [options] [workflow_options]\n", + " workflow_name: Single or combined workflows defined in this script\n", + " targets: One or more targets to generate\n", + " options: Single-hyphen sos parameters (see \"sos run -h\" for details)\n", + " workflow_options: Double-hyphen workflow-specific parameters\n", + "\n", + "Workflows:\n", + " get_analysis_regions\n", + " get_rss_analysis_regions\n", + " colocboost\n", + "\n", + "Global Workflow Options:\n", + " --name VAL (as str, required)\n", + " It is required to input the name of the analysis\n", + " --cwd output (as path)\n", + " --genoFile VAL (as path, required)\n", + " A list of file paths for genotype data, or the genotype data itself.\n", + " --phenoFile paths\n", + "\n", + " One or multiple lists of file paths for phenotype data.\n", + " --phenoIDFile paths()\n", + "\n", + " One or multiple lists of file paths for phenotype ID mapping file. The first column should be the original ID, the 2nd column should be the ID to be mapped to.\n", + " --covFile paths\n", + "\n", + " Covariate file path\n", + " --gwas-meta-data . (as path)\n", + " Summary statistics interface, see `rss_analysis.ipynb` for details\n", + " --ld-meta-data . (as path)\n", + " --gwas-name (as list)\n", + " --gwas-data (as list)\n", + " --column-mapping (as list)\n", + " --region-list . (as path)\n", + " Optional: if a region list is provide the analysis will be focused on provided region. The LAST column of this list will contain the ID of regions to focus on when\n", + " region_name is given Otherwise, all regions with both genotype and phenotype files will be analyzed\n", + " --region-name (as list)\n", + " Optional: if a region name is provided the analysis would be focused on the union of provides region list and region names\n", + " --keep-samples . (as path)\n", + " Only focus on a subset of samples\n", + " --keep-variants . (as path)\n", + " Only focus on a subset of variants\n", + " --customized-association-windows . (as path)\n", + " An optional list documenting the custom association window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID). If this list is not\n", + " provided, the default `window` parameter (see below) will be used.\n", + " --cis-window -1 (as int)\n", + " Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp When this is set to negative, we will rely on using\n", + " customized_association_windows\n", + " --[no-]save-data (default to False)\n", + " save data object or not\n", + " --phenotype-names [f'{x:bn}' for x in phenoFile]\n", + "\n", + " Name of phenotypes\n", + " --[no-]trans-analysis (default to False)\n", + " And indicator whether it is trans-analysis, ie, not using phenotypic coordinate information\n", + " --imiss 1.0 (as float)\n", + " remove a variant if it has more than imiss missing individual level data\n", + " --maf 0.0025 (as float)\n", + " MAF and variance of X cutoff\n", + " --xvar-cutoff 0.0 (as float)\n", + " --mac 5 (as int)\n", + " MAC cutoff, on top of MAF cutoff\n", + " --[no-]indel (default to True)\n", + " Remove indels if indel = False\n", + " --event-filter-rules . (as path)\n", + " --skip-analysis-pip-cutoff (as list)\n", + " If this value is not 0, then an initial single effect analysis will be performed to determine if follow up analysis will be continued or to simply return NULL If this is\n", + " negative we use a default way to determine this cutoff which is conservative but still useful\n", + " --skip-sumstats-analysis-pip-cutoff -1.0 (as float)\n", + " --[no-]xqtl-coloc (default to True)\n", + " Perform xQTL colocalization\n", + " --[no-]joint-gwas (default to False)\n", + " Perform joint colocalization with many traits\n", + " --[no-]separate-gwas (default to True)\n", + " Perform separate GWAS targeted anlaysis one at a time\n", + " --[no-]impute (default to False)\n", + " Whether to impute the sumstat for all the snp in LD but not in sumstat.\n", + " --rcond 0.01 (as float)\n", + " --lamb 0.01 (as float)\n", + " --R2-threshold 0.6 (as float)\n", + " --minimum-ld 5 (as int)\n", + " --qc-method NULL\n", + " summary stats QC methods: rss_qc, dentist, slalom\n", + " --extract-sumstats-region-name NULL\n", + " --sumstats-region-name-col NULL\n", + " Either an index number in str format or \"NULL\" as a string\n", + " --comment-string NULL\n", + " --ld-data-name (as list)\n", + " --container ''\n", + " Analysis environment settings\n", + " --job-size 200 (as int)\n", + " For cluster jobs, number commands to run per job\n", + " --walltime 1h\n", + " Wall clock time expected\n", + " --mem 20G\n", + " Memory expected\n", + " --numThreads 1 (as int)\n", + " Number of threads\n", + "\n", + "Sections\n", + " get_analysis_regions:\n", + " get_rss_analysis_regions:\n", + " colocboost:\n", + " ```" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## Workflow implementation" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[global]\n", + "# It is required to input the name of the analysis\n", + "parameter: name = str\n", + "parameter: cwd = path(\"output\")\n", + "# A list of file paths for genotype data, or the genotype data itself. \n", + "parameter: genoFile = path\n", + "# One or multiple lists of file paths for phenotype data.\n", + "parameter: phenoFile = paths\n", + "# One or multiple lists of file paths for phenotype ID mapping file. The first column should be the original ID, the 2nd column should be the ID to be mapped to.\n", + "parameter: phenoIDFile = paths()\n", + "# Covariate file path\n", + "parameter: covFile = paths\n", + "# Summary statistics interface, see `rss_analysis.ipynb` for details\n", + "parameter: gwas_meta_data = path()\n", + "# Default LD reference. Can be overridden per-study via ld_meta_data column in gwas_meta_data.\n", + "parameter: ld_meta_data = path()\n", + "parameter: gwas_name = []\n", + "parameter: gwas_data = []\n", + "parameter: column_mapping = []\n", + "# Optional: if a region list is provide the analysis will be focused on provided region. \n", + "# The LAST column of this list will contain the ID of regions to focus on when region_name is given\n", + "# Otherwise, all regions with both genotype and phenotype files will be analyzed\n", + "parameter: region_list = path()\n", + "# Optional: if a region name is provided \n", + "# the analysis would be focused on the union of provides region list and region names\n", + "parameter: region_name = []\n", + "# Only focus on a subset of samples\n", + "parameter: keep_samples = path()\n", + "# Only focus on a subset of variants\n", + "parameter: keep_variants = path()\n", + "# An optional list documenting the custom association window for each region to analyze, with four column, chr, start, end, region ID (eg gene ID).\n", + "# If this list is not provided, the default `window` parameter (see below) will be used.\n", + "parameter: customized_association_windows = path()\n", + "# Specify the cis window for the up and downstream radius to analyze around the region of interest in units of bp\n", + "# When this is set to negative, we will rely on using customized_association_windows\n", + "parameter: cis_window = -1\n", + "# save data object or not\n", + "parameter: save_data = False\n", + "# Name of phenotypes\n", + "parameter: phenotype_names = [f'{x:bn}' for x in phenoFile]\n", + "# And indicator whether it is trans-analysis, ie, not using phenotypic coordinate information\n", + "parameter: trans_analysis = False\n", + "\n", + "# remove a variant if it has more than imiss missing individual level data\n", + "parameter: imiss = 1.0\n", + "# MAF and variance of X cutoff\n", + "parameter: maf = 0.0025\n", + "parameter: xvar_cutoff = 0.0\n", + "# MAC cutoff, on top of MAF cutoff\n", + "parameter: mac = 5\n", + "# Remove indels if indel = False\n", + "parameter: indel = True\n", + "parameter: event_filter_rules = path()\n", + "# If this value is not 0, then an initial single effect analysis will be performed \n", + "# to determine if follow up analysis will be continued or to simply return NULL\n", + "# If this is negative we use a default way to determine this cutoff which is conservative but still useful\n", + "parameter: skip_analysis_pip_cutoff = []\n", + "parameter: skip_sumstats_analysis_pip_cutoff = -1.0\n", + "# Perform xQTL colocalization\n", + "parameter: xqtl_coloc = True\n", + "# Perform joint colocalization with many traits\n", + "parameter: joint_gwas = False\n", + "# Perform separate GWAS targeted anlaysis one at a time\n", + "parameter: separate_gwas = True\n", + "# Whether to impute the sumstat for all the snp in LD but not in sumstat.\n", + "parameter: impute = False \n", + "parameter: rcond = 0.01\n", + "parameter: lamb = 0.01\n", + "parameter: R2_threshold = 0.6\n", + "parameter: minimum_ld = 5\n", + "# summary stats QC methods: rss_qc, dentist, slalom\n", + "parameter: qc_method = \"NULL\"\n", + "parameter: extract_sumstats_region_name = \"NULL\"\n", + "# Either an index number in str format or \"NULL\" as a string\n", + "parameter: sumstats_region_name_col = \"NULL\"\n", + "parameter: comment_string = \"NULL\"\n", + "# Maps each GWAS study to an LD reference name. If empty, all studies use the first LD reference.\n", + "\n", + "# Analysis environment settings\n", + "parameter: container = \"\"\n", + "# For cluster jobs, number commands to run per job\n", + "parameter: job_size = 200\n", + "# Wall clock time expected\n", + "parameter: walltime = \"1h\"\n", + "# Memory expected\n", + "parameter: mem = \"20G\"\n", + "# Number of threads\n", + "parameter: numThreads = 1\n", + "\n", + "if len(phenoFile) != len(covFile):\n", + " raise ValueError(\"Number of input phenotypes files must match that of covariates files\")\n", + "if len(phenoFile) != len(phenotype_names):\n", + " raise ValueError(\"Number of input phenotypes files must match the number of phenotype names\")\n", + "if len(phenoIDFile) > 0 and len(phenoFile) != len(phenoIDFile):\n", + " raise ValueError(\"Number of input phenotypes files must match the number of phenotype ID mapping files\")\n", + "\n", + "if len(skip_analysis_pip_cutoff) == 0:\n", + " skip_analysis_pip_cutoff = [0.0] * len(phenoFile)\n", + "if len(skip_analysis_pip_cutoff) == 1:\n", + " skip_analysis_pip_cutoff = skip_analysis_pip_cutoff * len(phenoFile)\n", + "if len(skip_analysis_pip_cutoff) != len(phenoFile):\n", + " raise ValueError(f\"``skip_analysis_pip_cutoff`` should have either length 1 or length the same as phenotype files ({len(phenoFile)} in this case)\")\n", + "\n", + "# make it into an R List string\n", + "skip_analysis_pip_cutoff = [f\"'{y}'={x}\" for x,y in zip(skip_analysis_pip_cutoff, phenotype_names)]\n", + " \n", + "def group_by_region(lst, partition):\n", + " # from itertools import accumulate\n", + " # partition = [len(x) for x in partition]\n", + " # Compute the cumulative sums once\n", + " # cumsum_vector = list(accumulate(partition))\n", + " # Use slicing based on the cumulative sums\n", + " # return [lst[(cumsum_vector[i-1] if i > 0 else 0):cumsum_vector[i]] for i in range(len(partition))]\n", + " return partition\n", + "\n", + "import os\n", + "import pandas as pd\n", + "\n", + "def adapt_file_path(file_path, reference_file):\n", + " \"\"\"\n", + " Adapt a single file path based on its existence and a reference file's path.\n", + "\n", + " Args:\n", + " - file_path (str): The file path to adapt.\n", + " - reference_file (str): File path to use as a reference for adaptation.\n", + "\n", + " Returns:\n", + " - str: Adapted file path.\n", + "\n", + " Raises:\n", + " - FileNotFoundError: If no valid file path is found.\n", + " \"\"\"\n", + " reference_path = os.path.dirname(reference_file)\n", + "\n", + " # Check if the file exists\n", + " if os.path.isfile(file_path):\n", + " return file_path\n", + "\n", + " # Check file name without path\n", + " file_name = os.path.basename(file_path)\n", + " if os.path.isfile(file_name):\n", + " return file_name\n", + "\n", + " # Check file name in reference file's directory\n", + " file_in_ref_dir = os.path.join(reference_path, file_name)\n", + " if os.path.isfile(file_in_ref_dir):\n", + " return file_in_ref_dir\n", + "\n", + " # Check original file path prefixed with reference file's directory\n", + " file_prefixed = os.path.join(reference_path, file_path)\n", + " if os.path.isfile(file_prefixed):\n", + " return file_prefixed\n", + "\n", + " # If all checks fail, raise an error\n", + " raise FileNotFoundError(f\"No valid path found for file: {file_path}\")\n", + "\n", + "def adapt_file_path_all(df, column_name, reference_file):\n", + " return df[column_name].apply(lambda x: adapt_file_path(x, reference_file))" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[get_analysis_regions: shared = \"regional_data\"]\n", + "# input is genoFile, phenoFile, covFile and optionally region_list. If region_list presents then we only analyze what's contained in the list.\n", + "# regional_data should be a dictionary like:\n", + "#{'data': [(\"genotype_1.bed\", \"phenotype_1.bed.gz\", \"covariate_1.gz\"), (\"genotype_2.bed\", \"phenotype_1.bed.gz\", \"phenotype_2.bed.gz\", \"covariate_1.gz\", \"covariate_2.gz\") ... ],\n", + "# 'meta_info': [(\"chr12:752578-752579\",\"chr12:752577-752580\", \"gene_1\", \"trait_1\"), (\"chr13:852580-852581\",\"chr13:852579-852580\", \"gene_2\", \"trait_1\", \"trait_2\") ... ]}\n", + "import numpy as np\n", + "\n", + "def preload_id_map(id_map_files):\n", + " id_maps = {}\n", + " missing_files = []\n", + " \n", + " for id_map_file in id_map_files:\n", + " if id_map_file is not None:\n", + " if os.path.isfile(id_map_file):\n", + " df = pd.read_csv(id_map_file, sep=r\"\\s+\", header=None, comment='#', names=['old_ID', 'new_ID'])\n", + " id_maps[id_map_file] = df.set_index('old_ID')['new_ID'].to_dict()\n", + " else:\n", + " missing_files.append(id_map_file)\n", + " \n", + " # If there are missing files, print a summary\n", + " # if missing_files:\n", + " # print(f\"Warning: {len(missing_files)} ID map file(s) specified but not found:\")\n", + " # for file in missing_files:\n", + " # print(f\" - {file}\")\n", + " \n", + " return id_maps\n", + "\n", + "def load_and_apply_id_map(pheno_path, id_map_path, preloaded_id_maps):\n", + " pheno_df = pd.read_csv(pheno_path, sep=r\"\\s+\", header=0)\n", + " pheno_df['Original_ID'] = pheno_df['ID']\n", + " \n", + " # Check if id_map_path is specified\n", + " if id_map_path is not None:\n", + " # Check if the id_map_path exists in preloaded maps\n", + " if id_map_path in preloaded_id_maps:\n", + " id_map = preloaded_id_maps[id_map_path]\n", + " pheno_df['ID'] = pheno_df['ID'].map(id_map).fillna(pheno_df['ID'])\n", + " else:\n", + " # ID map file was specified but doesn't exist or wasn't loaded\n", + " print(f\"Warning: ID map file '{id_map_path}' was specified but does not exist. Using original IDs.\")\n", + " # No mapping is done here, so ID remains the same as Original_ID\n", + " \n", + " return pheno_df\n", + "\n", + "def filter_by_region_ids(data, region_ids):\n", + " if region_ids is not None and len(region_ids) > 0:\n", + " data = data[data['ID'].isin(region_ids)]\n", + " return data\n", + "\n", + "def custom_join(series):\n", + " # Initialize an empty list to hold the processed items\n", + " result = []\n", + " for item in series:\n", + " if ',' in item:\n", + " # If the item contains commas, split by comma and convert to tuple\n", + " result.append(tuple(item.split(',')))\n", + " else:\n", + " # If the item does not contain commas, add it directly\n", + " result.append(item)\n", + " # Convert the list of items to a tuple and return\n", + " return tuple(result)\n", + "\n", + "def aggregate_phenotype_data(accumulated_pheno_df):\n", + " if not accumulated_pheno_df.empty:\n", + " accumulated_pheno_df = accumulated_pheno_df.groupby(['#chr','ID','cond','path','cov_path'], as_index=False).agg({\n", + " '#chr': lambda x: np.unique(x).astype(str)[0],\n", + " 'ID': lambda x: np.unique(x).astype(str)[0],\n", + " 'Original_ID': ','.join,\n", + " 'start': 'min',\n", + " 'end': 'max'\n", + " }).groupby(['#chr','ID'], as_index=False).agg({\n", + " 'cond': ','.join,\n", + " 'path': ','.join,\n", + " 'Original_ID': custom_join,\n", + " 'cov_path': ','.join,\n", + " 'start': 'min',\n", + " 'end': 'max'\n", + " })\n", + " return accumulated_pheno_df\n", + "\n", + "def process_cis_files(pheno_files, cov_files, phenotype_names, pheno_id_files, region_ids, preloaded_id_maps):\n", + " '''\n", + " Example output:\n", + " #chr start end ID Original_ID path cov_path cond\n", + " chr12 752578 752579 ENSG00000060237 Q9H4A3,P62873 protocol_example.protein_1.bed.gz,protocol_example.protein_2.bed.gz covar_1.gz,covar_2.gz trait_A,trait_B\n", + " '''\n", + " accumulated_pheno_df = pd.DataFrame()\n", + " pheno_id_files = [None] * len(pheno_files) if len(pheno_id_files) == 0 else pheno_id_files\n", + "\n", + " for pheno_path, cov_path, phenotype_name, id_map_path in zip(pheno_files, cov_files, phenotype_names, pheno_id_files):\n", + " if not os.path.isfile(cov_path):\n", + " raise FileNotFoundError(f\"No valid path found for file: {cov_path}\")\n", + " pheno_df = load_and_apply_id_map(pheno_path, id_map_path, preloaded_id_maps)\n", + " pheno_df = filter_by_region_ids(pheno_df, region_ids)\n", + " \n", + " if not pheno_df.empty:\n", + " pheno_df.iloc[:, 4] = adapt_file_path_all(pheno_df, pheno_df.columns[4], f\"{pheno_path:a}\")\n", + " pheno_df = pheno_df.assign(cov_path=str(cov_path), cond=phenotype_name) \n", + " accumulated_pheno_df = pd.concat([accumulated_pheno_df, pheno_df], ignore_index=True)\n", + "\n", + " accumulated_pheno_df = aggregate_phenotype_data(accumulated_pheno_df)\n", + " return accumulated_pheno_df\n", + "\n", + "def process_trans_files(pheno_files, cov_files, phenotype_names, pheno_id_files, region_ids, customized_association_windows):\n", + " '''\n", + " Example output:\n", + " #chr start end ID Original_ID path cov_path cond\n", + " chr21 0 0 chr21_18133254_19330300 carnitine,benzoate,hippurate metabolon_1.bed.gz,metabolon_2.bed.gz covar_1.gz,covar_2.gz trait_A,trait_B\n", + " '''\n", + " \n", + " if not os.path.isfile(customized_association_windows):\n", + " raise ValueError(\"Customized association analysis window must be specified for trans analysis.\")\n", + " accumulated_pheno_df = pd.DataFrame()\n", + " pheno_id_files = [None] * len(pheno_files) if len(pheno_id_files) == 0 else pheno_id_files\n", + " genotype_windows = pd.read_csv(customized_association_windows, comment=\"#\", header=None, names=[\"#chr\",\"start\",\"end\",\"ID\"], sep=\"\\t\")\n", + " genotype_windows = filter_by_region_ids(genotype_windows, region_ids)\n", + " if genotype_windows.empty:\n", + " return accumulated_pheno_df\n", + " \n", + " for pheno_path, cov_path, phenotype_name, id_map_path in zip(pheno_files, cov_files, phenotype_names, pheno_id_files):\n", + " if not os.path.isfile(cov_path):\n", + " raise FileNotFoundError(f\"No valid path found for file: {cov_path}\")\n", + " pheno_df = pd.read_csv(pheno_path, sep=r\"\\s+\", header=0, names=['Original_ID', 'path'])\n", + " if not pheno_df.empty:\n", + " pheno_df.iloc[:, -1] = adapt_file_path_all(pheno_df, pheno_df.columns[-1], f\"{pheno_path:a}\")\n", + " pheno_df = pheno_df.assign(cov_path=str(cov_path), cond=phenotype_name)\n", + " # Here we combine genotype_windows which contains \"#chr\" and \"ID\" to pheno_df by creating a cartesian product\n", + " pheno_df = pd.merge(genotype_windows.assign(key=1), pheno_df.assign(key=1), on='key').drop('key', axis=1)\n", + " # then set start and end columns to zero\n", + " pheno_df['start'] = 0\n", + " pheno_df['end'] = 0\n", + " if id_map_path is not None:\n", + " # Filter pheno_df by specific association-window and phenotype pairs\n", + " association_analysis_pair = pd.read_csv(id_map_path, sep=r\"\\s+\", header=None, comment='#', names=['ID', 'Original_ID'])\n", + " pheno_df = pd.merge(pheno_df, association_analysis_pair, on=['ID', 'Original_ID'])\n", + " accumulated_pheno_df = pd.concat([accumulated_pheno_df, pheno_df], ignore_index=True)\n", + "\n", + " accumulated_pheno_df = aggregate_phenotype_data(accumulated_pheno_df)\n", + " return accumulated_pheno_df\n", + "\n", + "def load_regional_data(genoFile, phenoFile, covFile, phenotype_names, phenoIDFile, \n", + " region_list=None, region_name=None, trans_analysis=False, \n", + " customized_association_windows=None, cis_window=-1): \n", + " # Ensure region_name is a list\n", + " if region_name is None:\n", + " region_name = []\n", + " \n", + " # Load genotype meta data\n", + " if f\"{genoFile:x}\" == \".bed\":\n", + " geno_meta_data = pd.DataFrame([(\"chr\"+str(x), f\"{genoFile:a}\") for x in range(1,23)] + [(\"chrX\", f\"{genoFile:a}\")], columns=['#chr', 'geno_path'])\n", + " else:\n", + " geno_meta_data = pd.read_csv(f\"{genoFile:a}\", sep=r\"\\s+\", header=0)\n", + " geno_meta_data.iloc[:, 1] = adapt_file_path_all(geno_meta_data, geno_meta_data.columns[1], f\"{genoFile:a}\")\n", + " geno_meta_data.columns = ['#chr', 'geno_path']\n", + " geno_meta_data['#chr'] = geno_meta_data['#chr'].apply(lambda x: str(x) if str(x).startswith('chr') else f'chr{x}')\n", + "\n", + " # Checking the DataFrame\n", + " valid_chr_values = [f'chr{x}' for x in range(1, 23)] + ['chrX']\n", + " if not all(value in valid_chr_values for value in geno_meta_data['#chr']):\n", + " raise ValueError(\"Invalid chromosome values found. Allowed values are chr1 to chr22 and chrX.\")\n", + "\n", + " region_ids = []\n", + " # If region_list is provided, read the file and extract IDs\n", + " if region_list is not None and region_list.is_file():\n", + " region_list_df = pd.read_csv(region_list, delim_whitespace=True, header=None, comment=\"#\")\n", + " region_ids = region_list_df.iloc[:, -1].unique() # Extracting the last column for IDs\n", + "\n", + " # If region_name is provided, include those IDs as well\n", + " # --region-name A B C will result in a list of [\"A\", \"B\", \"C\"] here\n", + " if len(region_name) > 0:\n", + " region_ids = list(set(region_ids).union(set(region_name)))\n", + "\n", + " if trans_analysis:\n", + " meta_data = process_trans_files(phenoFile, covFile, phenotype_names, phenoIDFile, region_ids, customized_association_windows)\n", + " else:\n", + " meta_data = process_cis_files(phenoFile, covFile, phenotype_names, phenoIDFile, region_ids, preload_id_map(phenoIDFile))\n", + " \n", + " if not meta_data.empty:\n", + " meta_data = meta_data.merge(geno_meta_data, on='#chr', how='inner')\n", + " # Adjust association-window\n", + " if customized_association_windows is not None and os.path.isfile(customized_association_windows):\n", + " print(f\"Loading customized association analysis window from {customized_association_windows}\")\n", + " association_windows_list = pd.read_csv(customized_association_windows, comment=\"#\", header=None, \n", + " names=[\"#chr\",\"start\",\"end\",\"ID\"], sep=\"\\t\")\n", + " meta_data = pd.merge(meta_data, association_windows_list, on=['#chr', 'ID'], how='left', suffixes=('', '_association'))\n", + " mismatches = meta_data[meta_data['start_association'].isna()]\n", + " if not mismatches.empty:\n", + " print(\"First 5 mismatches:\")\n", + " print(mismatches[['ID']].head())\n", + " raise ValueError(f\"{len(mismatches)} regions to analyze cannot be found in ``{customized_association_windows}``. \"\n", + " f\"Please check your ``{customized_association_windows}`` database to make sure it contains all \"\n", + " f\"association-window definitions. \")\n", + " else:\n", + " if cis_window < 0:\n", + " raise ValueError(\"Please either input valid path to association-window file via ``--customized-association-windows``, \"\n", + " \"or set ``--cis-window`` to a non-negative integer.\")\n", + " if cis_window == 0:\n", + " print(\"Warning: only variants within the range of start and end of molecular phenotype will be considered \"\n", + " \"since cis_window is set to zero and no customized association window file was found. \"\n", + " \"Please make sure this is by design.\")\n", + " meta_data['start_association'] = meta_data['start'].apply(lambda x: max(x - cis_window, 0))\n", + " meta_data['end_association'] = meta_data['end'] + cis_window\n", + "\n", + " # Example meta_data:\n", + " # #chr start end start_association end_association ID Original_ID path cov_path cond coordinate geno_path\n", + " # 0 chr12 752578 752579 652578 852579 ENSG00000060237 Q9H4A3,P62873 protocol_example.protein_1.bed.gz,protocol_example.protein_2.bed.gz covar_1.gz,covar_2.gz trait_A,trait_B chr12:752578-752579 protocol_example.genotype.chr21_22.bed \n", + " # Create the final dictionary\n", + " regional_data = {\n", + " 'data': [(row['geno_path'], *row['path'].split(','), *row['cov_path'].split(',')) for _, row in meta_data.iterrows()],\n", + " 'meta_info': [(f\"{row['#chr']}:{row['start']}-{row['end']}\", # this is the phenotypic region to extract data from\n", + " f\"{row['#chr']}:{row['start_association']}-{row['end_association']}\", # this is the association window region\n", + " row['ID'], row['Original_ID'], *row['cond'].split(',')) for _, row in meta_data.iterrows()]\n", + " }\n", + " else:\n", + " regional_data = {'data': [], 'meta_info': []}\n", + " \n", + " return regional_data\n", + "\n", + "\n", + "# Modified main code section to document the key format alignment\n", + "regional_data = load_regional_data(\n", + " genoFile=genoFile, \n", + " phenoFile=phenoFile, \n", + " covFile=covFile, \n", + " phenotype_names=phenotype_names, \n", + " phenoIDFile=phenoIDFile,\n", + " region_list=region_list, \n", + " region_name=region_name, \n", + " trans_analysis=trans_analysis,\n", + " customized_association_windows=customized_association_windows, \n", + " cis_window=cis_window\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "kernel": "SoS" + }, + "outputs": [], + "source": [ + "[get_rss_analysis_regions: shared = \"regional_rss_data\"]\n", + "\n", + "from collections import OrderedDict\n", + "\n", + "def file_exists(file_path, relative_path=None):\n", + " \"\"\"Check if a file exists at the given path or relative to a specified path.\"\"\"\n", + " if os.path.exists(file_path) and os.path.isfile(file_path):\n", + " return True\n", + " elif relative_path:\n", + " relative_file_path = os.path.join(relative_path, file_path)\n", + " return os.path.exists(relative_file_path) and os.path.isfile(relative_file_path)\n", + " return False\n", + "\n", + "def check_required_columns(df, required_columns):\n", + " \"\"\"Check if the required columns are present in the dataframe.\"\"\"\n", + " missing_columns = [col for col in required_columns if col not in df.columns]\n", + " if missing_columns:\n", + " raise ValueError(f\"Missing required columns: {', '.join(missing_columns)}\")\n", + "\n", + "def parse_region(region):\n", + " \"\"\"\n", + " Parse a region string in 'chr:start-end' format into a list [chr, start, end].\n", + " Returns a list containing [chromosome_with_chr_prefix, start_position, end_position].\n", + " \"\"\"\n", + " # Handle both numeric and 'chr' prefixed chromosome formats\n", + " if ':' in region:\n", + " chrom, rest = region.split(':')\n", + " # Store original chrom value with 'chr' prefix preserved\n", + " original_chrom = chrom\n", + " \n", + " # Strip 'chr' prefix if present (for internal processing)\n", + " if chrom.startswith('chr'):\n", + " chrom = chrom[3:] # Remove 'chr' prefix if present\n", + " \n", + " # Handle both hyphen and underscore separators for start-end\n", + " if '-' in rest:\n", + " start, end = rest.split('-')\n", + " elif '_' in rest:\n", + " start, end = rest.split('_')\n", + " else:\n", + " raise ValueError(f\"Invalid region format: {region}. Expected format: chr:start-end or chr:start_end\")\n", + " \n", + " # Return with the original chromosome format that includes 'chr' prefix\n", + " return [original_chrom, int(start), int(end)]\n", + " else:\n", + " raise ValueError(f\"Invalid region format: {region}. Expected format: chr:start-end or chr:start_end\")\n", + "\n", + "def resolve_region_names(region_name, customized_association_windows=None, region_list=None):\n", + " \"\"\"\n", + " Resolve region names by looking them up in either customized_association_windows or region_list.\n", + " Returns a dictionary mapping region keys to parsed regions.\n", + " \"\"\"\n", + " region_dict = {}\n", + " \n", + " # If no region names provided, return empty dict\n", + " if not region_name:\n", + " return region_dict\n", + " \n", + " # Check if either file exists\n", + " if not ((customized_association_windows and os.path.isfile(customized_association_windows)) or \n", + " (region_list and os.path.isfile(region_list))):\n", + " raise ValueError(\"Either customized association window file or region list file must exist when region names are provided\")\n", + " \n", + " # Determine which file to use (prioritize customized_association_windows)\n", + " region_file = None\n", + " if customized_association_windows and os.path.isfile(customized_association_windows):\n", + " region_file = customized_association_windows\n", + " print(f\"Using customized association window file: {region_file}\")\n", + " elif region_list and os.path.isfile(region_list):\n", + " region_file = region_list\n", + " print(f\"Using region list file: {region_file}\")\n", + " \n", + " # Load region definitions from the file\n", + " name_to_region_map = {}\n", + " with open(region_file, 'r') as file:\n", + " for line in file:\n", + " # Skip empty lines and comments\n", + " if not line.strip() or line.startswith(\"#\"):\n", + " continue\n", + " \n", + " parts = line.strip().split()\n", + " if len(parts) >= 4: # Ensure there's at least 4 columns\n", + " # Extract ID column (4th column)\n", + " region_id = parts[3]\n", + " \n", + " # Extract chromosome, start, and end positions\n", + " chrom = parts[0].replace(\"chr\", \"\") # Remove 'chr' prefix if present\n", + " try:\n", + " # Create the region info\n", + " parsed_region = [int(chrom) if chrom.isdigit() else chrom, \n", + " int(parts[1]), \n", + " int(parts[2])]\n", + " \n", + " # Include the ID\n", + " parsed_region.append(parts[3])\n", + " \n", + " # Add to the mapping\n", + " name_to_region_map[region_id] = parsed_region\n", + " except ValueError:\n", + " # Skip lines with non-integer positions\n", + " print(f\"Warning: Skipping invalid region definition: {line.strip()}\")\n", + " \n", + " # Process each region name\n", + " for region in region_name:\n", + " # Check if this region name exists in our mapping\n", + " if region in name_to_region_map:\n", + " parsed_region = name_to_region_map[region]\n", + " else:\n", + " # Try to parse it as a coordinate string as a last resort\n", + " try:\n", + " parsed_region = parse_region(region)\n", + " print(f\"Warning: Region '{region}' not found in file, parsing as coordinate string\")\n", + " except ValueError:\n", + " raise ValueError(f\"Could not parse region '{region}' and it was not found in the region file\")\n", + " \n", + " # Format the region key\n", + " region_key = f\"chr{parsed_region[0]}:{parsed_region[1]}-{parsed_region[2]}\"\n", + " if region_key not in region_dict:\n", + " region_dict[region_key] = parsed_region\n", + " \n", + " return region_dict\n", + "\n", + "def load_regional_rss_data(gwas_meta_data, gwas_name, gwas_data, column_mapping, region_name, region_list):\n", + " \"\"\"\n", + " Extracts data from GWAS metadata files and additional GWAS data provided. \n", + " Uses the same region information as load_regional_data to ensure consistent\n", + " region identification across both data structures.\n", + " \n", + " IMPORTANT: This function processes the same regions as load_regional_data and \n", + " formats the region keys in regional_rss_data[\"regions\"] to exactly match the \n", + " 2nd element of regional_data[\"meta_info\"] tuples (chr:start-end format).\n", + "\n", + " Args:\n", + " - gwas_meta_data (str): File path to the GWAS metadata file.\n", + " - gwas_name (list): Vector of GWAS study names.\n", + " - gwas_data (list): Vector of GWAS data.\n", + " - column_mapping (list, optional): Vector of column mapping files.\n", + " - region_name (list, optional): List of region names (same as in load_regional_data).\n", + " - region_list (str, required): File path to a file containing region mapping between names and coordinates (same file used in load_regional_data).\n", + "\n", + " Returns:\n", + " - GWAS Dictionary: Maps study IDs to a list containing chromosome number, \n", + " GWAS file path, and optional column mapping file path.\n", + " - Region Dictionary: Maps region names to lists [chr, start, end] using the same\n", + " key format as regional_data[\"meta_info\"][1] (chr:start-end).\n", + "\n", + " Raises:\n", + " - FileNotFoundError: If any specified file path does not exist.\n", + " - ValueError: If required columns are missing in the input files or vector lengths mismatch.\n", + " \"\"\"\n", + " # Check vector lengths\n", + " if len(gwas_name) != len(gwas_data):\n", + " raise ValueError(\"gwas_name and gwas_data must be of equal length\")\n", + " \n", + " if len(column_mapping) > 0 and len(column_mapping) != len(gwas_name):\n", + " raise ValueError(\"If column_mapping is provided, it must be of the same length as gwas_name and gwas_data\")\n", + "\n", + " # Required columns for GWAS file type\n", + " required_gwas_columns = ['study_id', 'chrom', 'file_path']\n", + "\n", + " # Base directory of the metadata files\n", + " gwas_base_dir = os.path.dirname(gwas_meta_data)\n", + " \n", + " # Reading the GWAS metadata file\n", + " gwas_df = pd.read_csv(gwas_meta_data, sep=\"\\t\")\n", + " check_required_columns(gwas_df, required_gwas_columns)\n", + " gwas_dict = OrderedDict()\n", + "\n", + " # If one LD name provided, apply to all studies\n", + " # If multiple LD names provided, ensure they match the number of studies\n", + " # Apply specific LD name to each study using vectorized operations\n", + " \n", + " # Process additional GWAS data from vectors\n", + " for name, data, mapping in zip(gwas_name, gwas_data, column_mapping or [None]*len(gwas_name)):\n", + " gwas_dict[name] = {0: [data, mapping]}\n", + "\n", + " for _, row in gwas_df.iterrows():\n", + " file_path = row['file_path']\n", + " mapping_file = row.get('column_mapping_file')\n", + " n_sample = row.get('n_sample')\n", + " n_case = row.get('n_case')\n", + " n_control = row.get('n_control')\n", + "\n", + " # Check if the file and optional mapping file exist\n", + " if not file_exists(file_path, gwas_base_dir) or (mapping_file and not file_exists(mapping_file, gwas_base_dir)):\n", + " raise FileNotFoundError(f\"File {file_path} not found for {row['study_id']}\")\n", + " \n", + " # Adjust paths if necessary\n", + " file_path = file_path if file_exists(file_path) else os.path.join(gwas_base_dir, file_path)\n", + " if mapping_file:\n", + " mapping_file = mapping_file if file_exists(mapping_file) else os.path.join(gwas_base_dir, mapping_file)\n", + " \n", + " # Create or update the entry for the study_id\n", + " if row['study_id'] not in gwas_dict:\n", + " gwas_dict[row['study_id']] = {}\n", + "\n", + " # Expand chrom 0 to chrom 1-22 or use the specified chrom\n", + " chrom_range = range(1, 23) if row['chrom'] == 0 else [row['chrom']]\n", + " for chrom in chrom_range:\n", + " if chrom in gwas_dict[row['study_id']]:\n", + " existing_entry = gwas_dict[row['study_id']][chrom]\n", + " raise ValueError(f\"Duplicate chromosome specification for study_id {row['study_id']}, chrom {chrom}. \"\n", + " f\"Conflicting entries: {existing_entry} and {[file_path, mapping_file]}\")\n", + " gwas_dict[row['study_id']][chrom] = [file_path, mapping_file, n_sample, n_case, n_control, row.get('ld_meta_data', str(ld_meta_data))]\n", + "\n", + " # Process region_list and region_name\n", + " region_dict = dict()\n", + " \n", + " if region_name:\n", + " for region in region_name:\n", + " try:\n", + " # First try parsing the region as a coordinate\n", + " parsed_region = parse_region(region)\n", + " region_key = f\"chr{parsed_region[0]}:{parsed_region[1]}-{parsed_region[2]}\"\n", + " if region_key not in region_dict:\n", + " region_dict[region_key] = parsed_region\n", + " except ValueError:\n", + " # If that fails, it means it's a name that needs to be resolved\n", + " resolved_regions = resolve_region_names([region], customized_association_windows, region_list)\n", + " region_dict.update(resolved_regions)\n", + " else:\n", + " # Analyze all regions found in region list\n", + " with open(region_list, 'r') as file:\n", + " for line in file:\n", + " # Skip empty lines and comments\n", + " if not line.strip() or line.startswith(\"#\"):\n", + " continue\n", + " \n", + " parts = line.strip().split()\n", + " if len(parts) == 1:\n", + " # Format: \"chr:start-end\"\n", + " region = parse_region(parts[0])\n", + " elif len(parts) == 3:\n", + " # Format: chr start end\n", + " chrom = parts[0]\n", + " # Ensure 'chr' prefix is preserved\n", + " if not chrom.startswith('chr'):\n", + " chrom = f\"chr{chrom}\"\n", + " region = [chrom, int(parts[1]), int(parts[2])]\n", + " elif len(parts) >= 4:\n", + " # Format for eQTL: chr start end gene_id gene_name ...\n", + " chrom = parts[0]\n", + " # Ensure 'chr' prefix is preserved\n", + " if not chrom.startswith('chr'):\n", + " chrom = f\"chr{chrom}\"\n", + " region = [chrom, int(parts[1]), int(parts[2]), parts[3]]\n", + " else:\n", + " raise ValueError(f\"Invalid region format in region_list: {line.strip()}\")\n", + " \n", + " # Create key in the same format as regional_data[\"meta_info\"][1]\n", + " region_key = f\"chr{region[0]}:{region[1]}-{region[2]}\"\n", + " region_dict[region_key] = region\n", + "\n", + " return gwas_dict, region_dict\n", + "\n", + "if gwas_meta_data.is_file():\n", + " # Load GWAS data with region keys formatted to match regional_data[\"meta_info\"]\n", + " # Using the same region_list and region_name as load_regional_data ensures consistency\n", + " gwas_dict, region_dict = load_regional_rss_data(gwas_meta_data, gwas_name, gwas_data, column_mapping, region_name, region_list)\n", + " regional_rss_data = dict([(\"GWAS\", gwas_dict), (\"regions\", region_dict)])\n", + " \n", + " # The keys in regional_rss_data[\"regions\"] are now formatted as \"chr:start-end\"\n", + " # to match the second element of each tuple in regional_data[\"meta_info\"],\n", + " # with both formats ensuring the chromosome includes the 'chr' prefix.\n", + " # Since we use the same region_list and region_name parameters for both \n", + " # load_regional_data and load_regional_rss_data, we guarantee consistent\n", + " # region identification across both data structures.\n", + " # The expected format for shared keys is: \"chr1:12345-67890\"\n", + "else:\n", + " regional_rss_data = dict()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "kernel": "SoS" + }, + "source": [ + "## ColocBoost analysis" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": { + "kernel": "SoS", + "tags": [] + }, + "outputs": [], + "source": [ + "[colocboost]\n", + "depends: sos_variable(\"regional_data\"), sos_variable(\"regional_rss_data\")\n", + "# Check if both 'data' and 'meta_info' are empty lists\n", + "stop_if(len(regional_data['data']) == 0, f'Either genotype or phenotype data are not available for region {\", \".join(region_name)}.')\n", + "meta_info = regional_data['meta_info']\n", + "input: regional_data[\"data\"], group_by = lambda x: group_by_region(x, regional_data[\"data\"]), group_with = \"meta_info\"\n", + "\n", + "base_dir = f'{cwd:a}'\n", + "chr_info = _meta_info[0].split(\":\")[0]\n", + "gene_id = _meta_info[2]\n", + "base_filename = f'{name}.{chr_info}_{gene_id}'\n", + "\n", + "# Initialize output_files list\n", + "output_files = []\n", + "\n", + "# Default case: if all analysis flags are False, save data\n", + "if not xqtl_coloc and not joint_gwas and not separate_gwas:\n", + " save_data = True\n", + "\n", + "if save_data:\n", + " output_files.append(f'{base_dir}/data/{base_filename}.cb_data.rds')\n", + "\n", + "if xqtl_coloc:\n", + " output_files.append(f'{base_dir}/colocboost/{base_filename}.cb_xqtl.rds')\n", + "\n", + "if joint_gwas:\n", + " output_files.append(f'{base_dir}/colocboost/{base_filename}.cb_xqtl_joint_gwas.rds')\n", + "\n", + "if separate_gwas and \"GWAS\" in regional_rss_data and regional_rss_data[\"GWAS\"]:\n", + " for study_name in regional_rss_data[\"GWAS\"].keys():\n", + " output_files.append(f'{base_dir}/colocboost/{base_filename}.cb_xqtl_{study_name}.rds')\n", + " \n", + "output: output_files\n", + "task: trunk_workers = 1, trunk_size = job_size, walltime = walltime, mem = mem, cores = numThreads, tags = f'{step_name}_{_output[0]:bnn}'\n", + "R: expand = '${ }', stdout = f\"{_output[0]:nn}.colocboost.stdout\", stderr = f\"{_output[0]:nn}.colocboost.stderr\", container = container\n", + " options(warn=1)\n", + " \n", + " library(colocboost)\n", + " library(pecotmr)\n", + " library(dplyr)\n", + " \n", + " ##### ======= set up all input parameters ====== #####\n", + " \n", + " # individual level data\n", + " region = ${(\"'%s'\" % _meta_info[0]) if int(_meta_info[0].split('-')[-1])>0 else 'NULL'} # if the end position is zero return NULL\n", + " genotype_list = ${_input[0]:anr}\n", + " phenotype_list = c(${\",\".join(['\"%s\"' % x.absolute() for x in _input[1:len(_input)//2+1]])})\n", + " covariate_list = c(${\",\".join(['\"%s\"' % x.absolute() for x in _input[len(_input)//2+1:]])})\n", + " conditions_list_individual = c(${\",\".join(['\"%s\"' % x for x in _meta_info[4:]])})\n", + " maf_cutoff = ${maf}\n", + " mac_cutoff = ${mac}\n", + " imiss_cutoff = ${imiss}\n", + " association_window = \"${_meta_info[1]}\"\n", + " extract_region_name = list(${\",\".join([(\"c('\"+x+\"')\") if isinstance(x, str) else (\"c\"+ str(x)) for x in _meta_info[3]])})\n", + " region_name_col = ${\"4\" if int(_meta_info[0].split('-')[-1])>0 else \"1\"}\n", + " keep_indel = ${\"TRUE\" if indel else \"FALSE\"}\n", + " # extract subset of samples\n", + " keep_samples = NULL\n", + " if (${\"TRUE\" if keep_samples.is_file() else \"FALSE\"}) {\n", + " keep_samples = unlist(strsplit(readLines(${keep_samples:ar}), \"\\\\s+\"))\n", + " message(paste(length(keep_samples), \"samples are selected to be loaded for analysis\"))\n", + " }\n", + " phenotype_header = ${\"4\" if int(_meta_info[0].split('-')[-1])>0 else \"1\"}\n", + " scale_residuals = FALSE\n", + " keep_variants = ${'\"%s\"' % keep_variants if not keep_variants.is_dir() else \"NULL\"} # use \"not keep_variants.is_dir()\" in case user input filename is wrong without realizing\n", + " \n", + " # - summary statistics\n", + " sumstat_path_list = c(${paths([regional_rss_data['GWAS'][x][regional_rss_data['regions'][_meta_info[1]][0]][0] for x in regional_rss_data[\"GWAS\"].keys()] if \"GWAS\" in regional_rss_data else []):r,})\n", + " column_file_path_list = c(${paths([regional_rss_data['GWAS'][x][regional_rss_data['regions'][_meta_info[1]][0]][1] for x in regional_rss_data[\"GWAS\"].keys()] if \"GWAS\" in regional_rss_data else []):r,})\n", + " # Replace _regions with _meta_info[1] which is the association window region\n", + " n_samples = c(${paths([regional_rss_data['GWAS'][x][regional_rss_data['regions'][_meta_info[1]][0]][2] for x in regional_rss_data[\"GWAS\"].keys()] if \"GWAS\" in regional_rss_data else []):r,}) %>% as.numeric\n", + " n_cases = c(${paths([regional_rss_data['GWAS'][x][regional_rss_data['regions'][_meta_info[1]][0]][3] for x in regional_rss_data[\"GWAS\"].keys()] if \"GWAS\" in regional_rss_data else []):r,}) %>% as.numeric\n", + " n_controls = c(${paths([regional_rss_data['GWAS'][x][regional_rss_data['regions'][_meta_info[1]][0]][4] for x in regional_rss_data[\"GWAS\"].keys()] if \"GWAS\" in regional_rss_data else []):r,}) %>% as.numeric\n", + " if(\"${extract_sumstats_region_name}\" == \"NULL\"){\n", + " extract_sumstats_region_name = NULL\n", + " } else {\n", + " extract_sumstats_region_name = \"${extract_sumstats_region_name}\"\n", + " }\n", + " sumstats_region_name_col = ${sumstats_region_name_col}\n", + " comment_string = ${\"NULL\" if comment_string == \"NULL\" else f\"'{comment_string}'\"}\n", + " qc_method = ${\"NULL\" if qc_method == \"NULL\" else \"'%s'\" % qc_method}\n", + " impute = ${\"TRUE\" if impute else \"FALSE\"}\n", + " impute_opts = list(rcond = ${rcond}, R2_threshold = ${R2_threshold}, minimum_ld = ${minimum_ld}, lamb=${lamb})\n", + " keep_indel = ${\"TRUE\" if indel else \"FALSE\"}\n", + " \n", + " # Summary stats data - Using the 2nd element (association window) from _meta_info directly\n", + " studies = c(${', '.join(['\"{}\"'.format(item) for item in regional_rss_data[\"GWAS\"].keys()] if \"GWAS\" in regional_rss_data else [])})\n", + " # Build LD-to-study mapping from per-study ld_meta_data in GWAS metadata.\n", + " # Each study's ld_meta_data path comes from the GWAS meta TSV (index [5]).\n", + " # Falls back to --ld-meta-data default when the column is absent.\n", + " match_LD_sumstat <- list()\n", + " study_ld_paths <- c(${\",\".join(['\"%s\"' % regional_rss_data[\"GWAS\"][x][regional_rss_data[\"regions\"][_meta_info[1]][0]][5] for x in regional_rss_data[\"GWAS\"].keys()] if \"GWAS\" in regional_rss_data else [])})\n", + " names(study_ld_paths) <- studies\n", + " for (ld_path in unique(study_ld_paths)) {\n", + " match_LD_sumstat[[ld_path]] <- names(study_ld_paths)[study_ld_paths == ld_path]\n", + " }\n", + " LD_meta_file_path_list <- unique(study_ld_paths)\n", + "\n", + "\n", + " # If we have any LD data names\n", + " # About analysis\n", + " xqtl_coloc = ${\"TRUE\" if xqtl_coloc else \"FALSE\"}\n", + " joint_gwas = ${\"TRUE\" if joint_gwas else \"FALSE\"}\n", + " separate_gwas = ${\"TRUE\" if separate_gwas else \"FALSE\"}\n", + " # About QC parameters\n", + " pip_cutoff_to_skip_ind = unlist(list(${\",\".join(skip_analysis_pip_cutoff)})[conditions_list_individual])\n", + " pip_cutoff_to_skip_sumstat = rep(${skip_sumstats_analysis_pip_cutoff}, length(studies)) %>% setNames(studies)\n", + " event_filter_rules = ${\"NULL\" if not event_filter_rules.is_file() else \"'%s'\" % event_filter_rules}\n", + " if (!is.null(event_filter_rules)) {\n", + " event_filters <- read.table(\"${event_filter_rules}\")\n", + " event_filters <- lapply(1:nrow(event_filters), function(ii) as.list(event_filters[ii,] %>% unlist))\n", + " } else { event_filters <- NULL }\n", + " \n", + " # Add computational time\n", + " computing_time = list()\n", + " # Load regional multitask data\n", + " t1 <- Sys.time()\n", + " tryCatch({\n", + " fdat = load_multitask_regional_data(region = region, \n", + " # individual\n", + " genotype_list = genotype_list,\n", + " phenotype_list = phenotype_list, \n", + " covariate_list = covariate_list, \n", + " conditions_list_individual = conditions_list_individual,\n", + " maf_cutoff = maf_cutoff, mac_cutoff = mac_cutoff,\n", + " imiss_cutoff = imiss_cutoff,\n", + " association_window = association_window, \n", + " extract_region_name = extract_region_name,\n", + " region_name_col = region_name_col,\n", + " keep_indel = keep_indel, keep_samples = keep_samples,\n", + " phenotype_header = phenotype_header, scale_residuals = scale_residuals,\n", + " keep_variants = keep_variants,\n", + " # summary stats\n", + " sumstat_path_list = sumstat_path_list,\n", + " column_file_path_list = column_file_path_list,\n", + " LD_meta_file_path_list = LD_meta_file_path_list,\n", + " match_LD_sumstat = match_LD_sumstat,\n", + " conditions_list_sumstat = studies,\n", + " n_samples = n_samples, n_cases = n_cases, n_controls = n_controls,\n", + " extract_sumstats_region_name = extract_sumstats_region_name, sumstats_region_name_col = sumstats_region_name_col,\n", + " comment_string = comment_string)\n", + " }, NoSNPsError = function(e) {\n", + " message(\"Error: \", paste(e$message, \"${_meta_info[2] + '@' + _meta_info[1]}\"))\n", + " quit(save=\"no\")\n", + " }) \n", + " t2 <- Sys.time()\n", + " computing_time$Loading = t2 - t1\n", + "\n", + " \n", + " # save data-set\n", + " if (${\"TRUE\" if save_data else \"FALSE\"}) {\n", + " saveRDS(list(${_meta_info[2]} = fdat), \"${_output[0]:ann}.cb_data.rds\", compress='xz')\n", + " }\n", + " if (\"${_meta_info[2]}\" != \"${_meta_info[3]}\") {\n", + " region_name = c(\"${_meta_info[2]}\", c(${\",\".join([(\"c('\"+x+\"')\") if isinstance(x, str) else (\"c\"+ str(x)) for x in _meta_info[3]])}))\n", + " } else {\n", + " region_name = \"${_meta_info[2]}\"\n", + " }\n", + " \n", + " # - analysis parameters\n", + " region_info = list(region_coord=parse_region(\"${_meta_info[0]}\"), grange=parse_region(\"${_meta_info[1]}\"), region_name=region_name)\n", + " res <- colocboost_analysis_pipeline(region_data = fdat, \n", + " event_filters = event_filters,\n", + " # - analysis\n", + " xqtl_coloc = xqtl_coloc,\n", + " joint_gwas = joint_gwas,\n", + " separate_gwas = separate_gwas,\n", + " # - individual QC\n", + " maf_cutoff = maf_cutoff, \n", + " pip_cutoff_to_skip_ind = pip_cutoff_to_skip_ind,\n", + " # - sumstat QC\n", + " pip_cutoff_to_skip_sumstat = pip_cutoff_to_skip_sumstat,\n", + " keep_indel = keep_indel,\n", + " qc_method = qc_method,\n", + " impute = impute, \n", + " impute_opts = impute_opts)\n", + "\n", + " # Reorganize outputs\n", + " computing_time <- c(computing_time, res$computing_time)\n", + " if (xqtl_coloc) {\n", + " res$xqtl_coloc$region_info <- region_info\n", + " res$xqtl_coloc$computing_time <- computing_time\n", + " saveRDS(list(\"${_meta_info[2]}\" = res$xqtl_coloc), \"${_output[0]:ann}.cb_xqtl.rds\", compress='xz')\n", + " }\n", + " if (joint_gwas) {\n", + " res$joint_gwas$region_info <- region_info\n", + " res$joint_gwas$computing_time <- computing_time\n", + " saveRDS(list(\"${_meta_info[2]}\" = res$joint_gwas), \"${_output[0]:ann}.cb_xqtl_joint_gwas.rds\", compress='xz')\n", + " }\n", + " if (separate_gwas) {\n", + " if (any(is.na(names(res$separate_gwas))) || is.null(names(res$separate_gwas))) { names(res$separate_gwas) <- studies[seq_along(res$separate_gwas)] }\n", + " for (name in names(res$separate_gwas)) {\n", + " res$separate_gwas[[name]]$region_info <- region_info\n", + " res$separate_gwas[[name]]$computing_time <- computing_time\n", + " saveRDS(list(\"${_meta_info[2]}\" = res$separate_gwas[[name]]), paste0(\"${_output[0]:ann}.cb_xqtl_\", name, \".rds\"), compress='xz')\n", + " }\n", + " }\n", + " " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "SoS", + "language": "sos", + "name": "sos" + }, + "language_info": { + "codemirror_mode": "sos", + "file_extension": ".sos", + "mimetype": "text/x-sos", + "name": "sos", + "nbconvert_exporter": "sos_notebook.converter.SoS_Exporter", + "pygments_lexer": "sos" + }, + "sos": { + "kernels": [ + [ + "Bash", + "calysto_bash", + "Bash", + "#E6EEFF", + "shell" + ], + [ + "Markdown", + "markdown", + "markdown", + "", + "" + ], + [ + "SoS", + "sos", + "", + "", + "sos" + ] + ], + "version": "0.24.3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file