From 7ce643832b335c5081d197ad994556f10d2b452c Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Fri, 24 Apr 2026 17:03:02 -0400 Subject: [PATCH 1/2] Delete code/mnm_analysis/univariate_fine_mapping_twas_vignette.ipynb --- ...nivariate_fine_mapping_twas_vignette.ipynb | 610 ------------------ 1 file changed, 610 deletions(-) delete mode 100644 code/mnm_analysis/univariate_fine_mapping_twas_vignette.ipynb diff --git a/code/mnm_analysis/univariate_fine_mapping_twas_vignette.ipynb b/code/mnm_analysis/univariate_fine_mapping_twas_vignette.ipynb deleted file mode 100644 index cebb30c83..000000000 --- a/code/mnm_analysis/univariate_fine_mapping_twas_vignette.ipynb +++ /dev/null @@ -1,610 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "5839a8ae-e13c-44ba-b06c-8346a0e65655", - "metadata": {}, - "source": [ - "# Univariate Fine-Mapping and TWAS with SuSiE\n" - ] - }, - { - "cell_type": "markdown", - "id": "eeeb172c-8630-45a9-9203-eb8c11d6d00a", - "metadata": {}, - "source": [ - "## Description" - ] - }, - { - "cell_type": "markdown", - "id": "4b860370-77bb-4095-b9b1-1613ed681bf4", - "metadata": {}, - "source": [ - "Our pipeline is capable of performing univariate fine-mapping with SuSiE with TWAS weights. The TWAS portion makes use of TWAS weights, linkage disequilibrium data and GWAS summary statistics. Preset variants used are taken from the linkage disequilibrium data and used only for TWAS. TWAS cross validation tell us which of the four methods (enet, lasso, mrash, SuSiE) are best to use. By default, we limit to under 5000 variants for cross validation. In cross validation, the data is split into five parts. Training is done on four parts, and prediction is done on the fifth. Linear regression is used to assess the results and get r squared and pvalues. \n", - "\n", - "Fine mapping with SuSiE follows the formulat y=xb+e where x has many highly correlated variables due to linkage disequilibrium. Therefore, true effects (b), are very sparse. The SuSiE wrapper looks for five independent signals in each region to increase convergence speed. However, if five signals are found, then the the upper limit is increased. SuSiE does not allow for the inclusion of covariates. Therefore, covariates are regressed in." - ] - }, - { - "cell_type": "markdown", - "id": "a48d1cb3-d073-44ef-853f-6980199e2b6d", - "metadata": {}, - "source": [ - "## Input:" - ] - }, - { - "cell_type": "markdown", - "id": "c31442aa-01ea-4053-bec6-48deebb724d5", - "metadata": {}, - "source": [ - " \n", - "`--genoFile`: path to a plink bed file containing genotypes. \n", - "`--phenoFile`: at least one tab delimited file containing chr, start, end, ID and path for the regions. For example:\n", - "```\n", - "#chr start end ID path\n", - "chr12 389272 389273 ENSG00000120647 $PATH/Ast.log2cpm.bed.gz\n", - "chr12 389319 389320 ENSG00000073614 $PATH/Ast.log2cpm.bed.gz\n", - "```\n", - "\n", - "`--covFile`: at least one tab delimited file containing covariates in the rows and samples in the columns. \n", - "`--customized-association-windows`: a tab delimited file containing chr, start, end, and gene_id. For example:\n", - "```\n", - "#chr start end gene_id\n", - "chr1 0 6480000 ENSG00000008128\n", - "chr1 0 6480000 ENSG00000008130\n", - "```\n", - "`----phenotype-names`: the names of the phenotypes if multiple are included. There should be one for each phenotype file you include. \n", - "`--max-cv-variants`: maximum number of variants for cross-validation. \n", - "`--ld_reference_meta_file`: path to file containing chrom, start, end and path for linkage disequilibrium region information. For example:\n", - "```\n", - "#chrom start end path\n", - "chr1 101384274 104443097 $PATH/chr1_101384274_104443097.cor.xz.bim\n", - "chr1 104443097 106225286 $PATH/chr1_104443097_106225286.cor.xz.bim\n", - "```\n", - "`--region-name`: if you only wish to analyze one region, then include the gene_id of a region found in the `customized-association-windows` file" - ] - }, - { - "cell_type": "markdown", - "id": "b7a02bb3-0100-4c28-8cd9-bd9f2ad20266", - "metadata": {}, - "source": [ - "## Output" - ] - }, - { - "cell_type": "markdown", - "id": "6021de35-de88-4d69-a2a9-78e2fc9e171e", - "metadata": {}, - "source": [ - "* `*.univariate_bvrs.rds`\n", - "```\n", - "> str(readRDS(\"/restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/mnm_regression/susie_twas/fine_mapping/test_susie_twas.chr11_ENSG00000073921.univariate_bvsr.rds\"))\n", - "List of 1\n", - " $ ENSG00000073921:List of 1\n", - " ..$ test_pheno_ENSG00000073921:List of 10\n", - " .. ..$ variant_names : chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n", - " .. ..$ analysis_script : chr \"options(warn=1)\\nlibrary(pecotmr)\\nphenotype_files = c(\\\"/restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_prot\"| __truncated__\n", - " .. ..$ other_quantities :List of 1\n", - " .. .. ..$ dropped_samples:List of 3\n", - " .. .. .. ..$ X : NULL\n", - " .. .. .. ..$ y : NULL\n", - " .. .. .. ..$ covar: NULL\n", - " .. ..$ sumstats :List of 5\n", - " .. .. ..$ betahat : Named num [1:9970] 0.00479 0.00573 0.09947 0.02035 0.02035 ...\n", - " .. .. .. ..- attr(*, \"names\")= chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n", - " .. .. ..$ sebetahat: Named num [1:9970] 0.0563 0.0568 0.0791 0.0551 0.0551 ...\n", - " .. .. .. ..- attr(*, \"names\")= chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n", - " .. .. ..$ z_scores : Named num [1:9970] 0.085 0.101 1.258 0.369 0.369 ...\n", - " .. .. .. ..- attr(*, \"names\")= chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n", - " .. .. ..$ p_values : Named num [1:9970] 0.932 0.92 0.208 0.712 0.712 ...\n", - " .. .. .. ..- attr(*, \"names\")= chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n", - " .. .. ..$ q_values : Named num [1:9970] 0.922 0.92 0.633 0.859 0.859 ...\n", - " .. .. .. ..- attr(*, \"names\")= chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n", - " .. ..$ sample_names : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n", - " .. ..$ top_loci :'data.frame':\t3 obs. of 9 variables:\n", - " .. .. ..$ variant_id : chr [1:3] \"chr11:85179671:G:A\" \"chr11:85266101:G:C\" \"chr11:85314573:G:A\"\n", - " .. .. ..$ betahat : num [1:3] -0.743 -0.747 -0.747\n", - " .. .. ..$ sebetahat : num [1:3] 0.147 0.147 0.147\n", - " .. .. ..$ z : num [1:3] -5.06 -5.1 -5.1\n", - " .. .. ..$ maf : num [1:3] 0.0235 0.0233 0.0233\n", - " .. .. ..$ pip : num [1:3] 0.292 0.349 0.349\n", - " .. .. ..$ cs_coverage_0.95: num [1:3] 1 1 1\n", - " .. .. ..$ cs_coverage_0.7 : num [1:3] 1 1 1\n", - " .. .. ..$ cs_coverage_0.5 : num [1:3] 0 1 1\n", - " .. ..$ susie_result_trimmed :List of 12\n", - " .. .. ..$ pip : num [1:9970] 0.000431 0.00043 0.000673 0.000431 0.000431 ...\n", - " .. .. ..$ sets :List of 5\n", - " .. .. .. ..$ cs :List of 1\n", - " .. .. .. .. ..$ L1: int [1:3] 653 910 1069\n", - " .. .. .. ..$ purity :'data.frame':\t1 obs. of 3 variables:\n", - " .. .. .. .. ..$ min.abs.corr : num 1\n", - " .. .. .. .. ..$ mean.abs.corr : num 1\n", - " .. .. .. .. ..$ median.abs.corr: num 1\n", - " .. .. .. ..$ cs_index : int 1\n", - " .. .. .. ..$ coverage : num 0.989\n", - " .. .. .. ..$ requested_coverage: num 0.95\n", - " .. .. ..$ cs_corr : logi NA\n", - " .. .. ..$ sets_secondary :List of 2\n", - " .. .. .. ..$ coverage_0.7:List of 2\n", - " .. .. .. .. ..$ sets :List of 5\n", - " .. .. .. .. .. ..$ cs :List of 1\n", - " .. .. .. .. .. .. ..$ L1: int [1:3] 653 910 1069\n", - " .. .. .. .. .. ..$ purity :'data.frame':\t1 obs. of 3 variables:\n", - " .. .. .. .. .. .. ..$ min.abs.corr : num 1\n", - " .. .. .. .. .. .. ..$ mean.abs.corr : num 1\n", - " .. .. .. .. .. .. ..$ median.abs.corr: num 1\n", - " .. .. .. .. .. ..$ cs_index : int 1\n", - " .. .. .. .. .. ..$ coverage : num 0.989\n", - " .. .. .. .. .. ..$ requested_coverage: num 0.7\n", - " .. .. .. .. ..$ cs_corr: logi NA\n", - " .. .. .. ..$ coverage_0.5:List of 2\n", - " .. .. .. .. ..$ sets :List of 5\n", - " .. .. .. .. .. ..$ cs :List of 1\n", - " .. .. .. .. .. .. ..$ L1: int [1:2] 910 1069\n", - " .. .. .. .. .. ..$ purity :'data.frame':\t1 obs. of 3 variables:\n", - " .. .. .. .. .. .. ..$ min.abs.corr : num 1\n", - " .. .. .. .. .. .. ..$ mean.abs.corr : num 1\n", - " .. .. .. .. .. .. ..$ median.abs.corr: num 1\n", - " .. .. .. .. .. ..$ cs_index : int 1\n", - " .. .. .. .. .. ..$ coverage : num 0.697\n", - " .. .. .. .. .. ..$ requested_coverage: num 0.5\n", - " .. .. .. .. ..$ cs_corr: logi NA\n", - " .. .. ..$ alpha : num [1:8, 1:9970] 1.52e-07 6.35e-05 6.05e-05 5.93e-05 5.99e-05 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : NULL\n", - " .. .. .. .. ..$ : chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n", - " .. .. ..$ lbf_variable : num [1:8, 1:9970] -1.697 -0.402 -0.437 -0.451 -0.444 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : NULL\n", - " .. .. .. .. ..$ : chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n", - " .. .. ..$ V : num [1:8] 0.019 0.00097 0.0011 0.00115 0.00112 ...\n", - " .. .. ..$ niter : int 10\n", - " .. .. ..$ max_L : int 8\n", - " .. .. ..$ X_column_scale_factors: Named num [1:9970] 0.517 0.512 0.366 0.528 0.528 ...\n", - " .. .. .. ..- attr(*, \"names\")= chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n", - " .. .. ..$ mu : num [1:8, 1:9970] 0.00227 0.00643 0.00672 0.00683 0.00677 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : NULL\n", - " .. .. .. .. ..$ : chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n", - " .. .. ..$ mu2 : num [1:8, 1:9970] 0.000638 0.000432 0.000455 0.000464 0.00046 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : NULL\n", - " .. .. .. .. ..$ : chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n", - " .. .. ..- attr(*, \"class\")= chr \"susie\"\n", - " .. ..$ total_time_elapsed : 'proc_time' Named num [1:5] 32.189 0.339 43.517 0 0\n", - " .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n", - " .. ..$ region_info :List of 3\n", - " .. .. ..$ region_coord:'data.frame':\t1 obs. of 3 variables:\n", - " .. .. .. ..$ chrom: chr \"11\"\n", - " .. .. .. ..$ start: int 85957174\n", - " .. .. .. ..$ end : int 86069881\n", - " .. .. ..$ grange :'data.frame':\t1 obs. of 3 variables:\n", - " .. .. .. ..$ chrom: chr \"11\"\n", - " .. .. .. ..$ start: int 84957175\n", - " .. .. .. ..$ end : int 87360000\n", - " .. .. ..$ region_name : chr [1:2] \"ENSG00000073921\" \"ENSG00000073921\"\n", - " .. ..$ preset_variants_result:List of 9\n", - " .. .. ..$ susie_fitted :List of 22\n", - " .. .. .. ..$ alpha : num [1:8, 1:6454] 0.00015 0.000148 0.000148 0.000148 0.000149 ...\n", - " .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. .. ..$ : NULL\n", - " .. .. .. .. .. ..$ : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. ..$ mu : num [1:8, 1:6454] 0.000134 0.000169 0.000177 0.000169 0.000152 ...\n", - " .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. .. ..$ : NULL\n", - " .. .. .. .. .. ..$ : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. ..$ mu2 : num [1:8, 1:6454] 5.39e-05 6.83e-05 7.19e-05 6.85e-05 6.13e-05 ...\n", - " .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. .. ..$ : NULL\n", - " .. .. .. .. .. ..$ : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. ..$ Xr : num [1:150] 0.000696 -0.000919 -0.002423 -0.00093 0.000305 ...\n", - " .. .. .. ..$ KL : num [1:8] 0.0345 0.0445 0.047 0.0446 0.0395 ...\n", - " .. .. .. ..$ lbf : num [1:8] 0.000194 0.000317 0.000353 0.000319 0.000253 ...\n", - " .. .. .. ..$ lbf_variable : num [1:8, 1:6454] -0.0334 -0.0427 -0.045 -0.0428 -0.0381 ...\n", - " .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. .. ..$ : NULL\n", - " .. .. .. .. .. ..$ : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. ..$ sigma2 : num 0.124\n", - " .. .. .. ..$ V : num [1:8] 5.76e-05 7.44e-05 7.87e-05 7.46e-05 6.61e-05 ...\n", - " .. .. .. ..$ pi : num [1:6454] 0.000155 0.000155 0.000155 0.000155 0.000155 ...\n", - " .. .. .. ..$ null_index : num 0\n", - " .. .. .. ..$ correct_zR_discrepancy:List of 5\n", - " .. .. .. .. ..$ to_correct : logi FALSE\n", - " .. .. .. .. ..$ outlier_index : logi(0) \n", - " .. .. .. .. ..$ is_init : logi TRUE\n", - " .. .. .. .. ..$ outlier_stabilize : num 5\n", - " .. .. .. .. ..$ outlier_stable_count: num 0\n", - " .. .. .. ..$ force_iterate : logi FALSE\n", - " .. .. .. ..$ converged : logi TRUE\n", - " .. .. .. ..$ elbo : num [1:3] -56.5 -56.5 -56.5\n", - " .. .. .. ..$ niter : int 3\n", - " .. .. .. ..$ intercept : num -0.00273\n", - " .. .. .. ..$ fitted : Named num [1:150] 0.000696 -0.000919 -0.002423 -0.00093 0.000305 ...\n", - " .. .. .. .. ..- attr(*, \"names\")= chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n", - " .. .. .. ..$ sets :List of 3\n", - " .. .. .. .. ..$ cs : NULL\n", - " .. .. .. .. ..$ coverage : NULL\n", - " .. .. .. .. ..$ requested_coverage: num 0.95\n", - " .. .. .. ..$ pip : Named num [1:6454] 0.0012 0.00124 0.0012 0.0012 0.0012 ...\n", - " .. .. .. .. ..- attr(*, \"names\")= chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. ..$ X_column_scale_factors: Named num [1:6454] 0.664 0.438 0.677 0.702 0.677 ...\n", - " .. .. .. .. ..- attr(*, \"names\")= chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. ..$ time_elapsed : 'proc_time' Named num [1:5] 0.851 0.008 0.861 0 0\n", - " .. .. .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n", - " .. .. .. ..- attr(*, \"class\")= chr \"susie\"\n", - " .. .. ..$ variant_names : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. ..$ analysis_script : chr \"options(warn=1)\\nlibrary(pecotmr)\\nphenotype_files = c(\\\"/restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_prot\"| __truncated__\n", - " .. .. ..$ other_quantities :List of 1\n", - " .. .. .. ..$ dropped_samples:List of 3\n", - " .. .. .. .. ..$ X : NULL\n", - " .. .. .. .. ..$ y : NULL\n", - " .. .. .. .. ..$ covar: NULL\n", - " .. .. ..$ sumstats :List of 5\n", - " .. .. .. ..$ betahat : Named num [1:6454] 0.0029 0.06935 0.01237 -0.00911 0.01237 ...\n", - " .. .. .. .. ..- attr(*, \"names\")= chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. ..$ sebetahat: Named num [1:6454] 0.0438 0.0661 0.0429 0.0414 0.0429 ...\n", - " .. .. .. .. ..- attr(*, \"names\")= chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. ..$ z_scores : Named num [1:6454] 0.0662 1.0485 0.288 -0.22 0.288 ...\n", - " .. .. .. .. ..- attr(*, \"names\")= chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. ..$ p_values : Named num [1:6454] 0.947 0.294 0.773 0.826 0.773 ...\n", - " .. .. .. .. ..- attr(*, \"names\")= chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. ..$ q_values : Named num [1:6454] 0.998 0.892 0.991 0.998 0.991 ...\n", - " .. .. .. .. ..- attr(*, \"names\")= chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. ..$ sample_names : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n", - " .. .. ..$ susie_result_trimmed:List of 12\n", - " .. .. .. ..$ pip : num [1:6454] 0.0012 0.00124 0.0012 0.0012 0.0012 ...\n", - " .. .. .. ..$ sets :List of 3\n", - " .. .. .. .. ..$ cs : NULL\n", - " .. .. .. .. ..$ coverage : NULL\n", - " .. .. .. .. ..$ requested_coverage: num 0.95\n", - " .. .. .. ..$ cs_corr : logi NA\n", - " .. .. .. ..$ sets_secondary :List of 2\n", - " .. .. .. .. ..$ coverage_0.7:List of 2\n", - " .. .. .. .. .. ..$ sets :List of 3\n", - " .. .. .. .. .. .. ..$ cs : NULL\n", - " .. .. .. .. .. .. ..$ coverage : NULL\n", - " .. .. .. .. .. .. ..$ requested_coverage: num 0.7\n", - " .. .. .. .. .. ..$ cs_corr: logi NA\n", - " .. .. .. .. ..$ coverage_0.5:List of 2\n", - " .. .. .. .. .. ..$ sets :List of 3\n", - " .. .. .. .. .. .. ..$ cs : NULL\n", - " .. .. .. .. .. .. ..$ coverage : NULL\n", - " .. .. .. .. .. .. ..$ requested_coverage: num 0.5\n", - " .. .. .. .. .. ..$ cs_corr: logi NA\n", - " .. .. .. ..$ alpha : num [1:8, 1:6454] 0.00015 0.000148 0.000148 0.000148 0.000149 ...\n", - " .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. .. ..$ : NULL\n", - " .. .. .. .. .. ..$ : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. ..$ lbf_variable : num [1:8, 1:6454] -0.0334 -0.0427 -0.045 -0.0428 -0.0381 ...\n", - " .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. .. ..$ : NULL\n", - " .. .. .. .. .. ..$ : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. ..$ V : num [1:8] 5.76e-05 7.44e-05 7.87e-05 7.46e-05 6.61e-05 ...\n", - " .. .. .. ..$ niter : int 3\n", - " .. .. .. ..$ max_L : int 8\n", - " .. .. .. ..$ X_column_scale_factors: Named num [1:6454] 0.664 0.438 0.677 0.702 0.677 ...\n", - " .. .. .. .. ..- attr(*, \"names\")= chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. ..$ mu : num [1:8, 1:6454] 0.000134 0.000169 0.000177 0.000169 0.000152 ...\n", - " .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. .. ..$ : NULL\n", - " .. .. .. .. .. ..$ : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. ..$ mu2 : num [1:8, 1:6454] 5.39e-05 6.83e-05 7.19e-05 6.85e-05 6.13e-05 ...\n", - " .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. .. ..$ : NULL\n", - " .. .. .. .. .. ..$ : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. ..- attr(*, \"class\")= chr \"susie\"\n", - " .. .. ..$ total_time_elapsed : 'proc_time' Named num [1:5] 10.991 0.012 11.046 0 0\n", - " .. .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n", - " .. .. ..$ region_info :List of 3\n", - " .. .. .. ..$ region_coord:'data.frame':\t1 obs. of 3 variables:\n", - " .. .. .. .. ..$ chrom: chr \"11\"\n", - " .. .. .. .. ..$ start: int 85957174\n", - " .. .. .. .. ..$ end : int 86069881\n", - " .. .. .. ..$ grange :'data.frame':\t1 obs. of 3 variables:\n", - " .. .. .. .. ..$ chrom: chr \"11\"\n", - " .. .. .. .. ..$ start: int 84957175\n", - " .. .. .. .. ..$ end : int 87360000\n", - " .. .. .. ..$ region_name : chr [1:2] \"ENSG00000073921\" \"ENSG00000073921\"\n", - "```\n", - "\n", - "* `*.univariate_data.rds`\n", - "```\n", - "> str(readRDS(\"output/mnm_regression/susie_twas/data/test_susie_twas.chr11_ENSG00000073921.univariate_data.rds\"))\n", - "List of 1\n", - " $ ENSG00000073921:List of 10\n", - " ..$ residual_Y :List of 1\n", - " .. ..$ test_pheno: num [1:150, 1] 0.1241 0.2981 -0.6054 0.0435 0.0388 ...\n", - " .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n", - " .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " ..$ residual_X :List of 1\n", - " .. ..$ : num [1:150, 1:9970] -0.2386 -0.8293 0.4173 0.7408 -0.0231 ...\n", - " .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n", - " .. .. .. ..$ : chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n", - " ..$ residual_Y_scalar: num 1\n", - " ..$ residual_X_scalar: num 1\n", - " ..$ dropped_sample :List of 3\n", - " .. ..$ X :List of 1\n", - " .. .. ..$ : chr(0) \n", - " .. ..$ Y :List of 1\n", - " .. .. ..$ : chr \"strand\"\n", - " .. ..$ covar:List of 1\n", - " .. .. ..$ : chr(0) \n", - " ..$ maf :List of 1\n", - " .. ..$ : Named num [1:9970] 0.33 0.327 0.103 0.343 0.343 ...\n", - " .. .. ..- attr(*, \"names\")= chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n", - " ..$ X : num [1:150, 1:9970] 0 0 1 1 1 0 0 1 1 1 ...\n", - " .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n", - " .. .. ..$ : chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n", - " ..$ chrom : chr \"chr11\"\n", - " ..$ grange : chr [1:2] \"85957174\" \"86069881\"\n", - " ..$ X_variance :List of 1\n", - " .. ..$ : Named num [1:9970] 0.267 0.262 0.134 0.279 0.279 ...\n", - " .. .. ..- attr(*, \"names\")= chr [1:9970] \"chr11:84957209:G:C\" \"chr11:84957879:A:G\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" ...\n", - "```\n", - "\n", - "* `*.univariate_twas_weights.rds`\n", - "```\n", - "> str(readRDS(\"/restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/mnm_regression/susie_twas/twas_weights/test_susie_twas.chr11_ENSG00000073921.univariate_twas_weights.rds\"))\n", - "List of 1\n", - " $ ENSG00000073921:List of 1\n", - " ..$ test_pheno_ENSG00000073921:List of 7\n", - " .. ..$ susie_weights_intermediate:List of 4\n", - " .. .. ..$ mu : num [1:8, 1:6454] 0.000134 0.000169 0.000177 0.000169 0.000152 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : NULL\n", - " .. .. .. .. ..$ : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. ..$ lbf_variable : num [1:8, 1:6454] -0.0334 -0.0427 -0.045 -0.0428 -0.0381 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : NULL\n", - " .. .. .. .. ..$ : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. ..$ X_column_scale_factors: Named num [1:6454] 0.664 0.438 0.677 0.702 0.677 ...\n", - " .. .. .. ..- attr(*, \"names\")= chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. ..$ pip : Named num [1:6454] 0.0012 0.00124 0.0012 0.0012 0.0012 ...\n", - " .. .. .. ..- attr(*, \"names\")= chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. ..$ twas_weights :List of 6\n", - " .. .. ..$ enet_weights : num [1:6454, 1] 0 0 0 0 0 0 0 0 0 0 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. ..$ lasso_weights : num [1:6454, 1] 0 0 0 0 0 0 0 0 0 0 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. ..$ bayes_r_weights: num [1:6454, 1] 4.31e-05 4.73e-04 2.44e-04 -8.48e-05 2.30e-04 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. ..$ bayes_l_weights: num [1:6454, 1] 2.04e-05 3.66e-04 1.03e-04 -1.54e-04 3.47e-04 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. ..$ mrash_weights : num [1:6454, 1] 6.01e-05 2.05e-04 7.02e-05 -4.71e-05 7.02e-05 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. ..$ susie_weights : num [1:6454, 1] 2.55e-07 5.89e-06 1.00e-06 -7.53e-07 1.00e-06 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. ..$ twas_predictions :List of 6\n", - " .. .. ..$ enet_predicted : num [1:150, 1] 0 0 0 0 0 0 0 0 0 0 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n", - " .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. ..$ lasso_predicted : num [1:150, 1] 0 0 0 0 0 0 0 0 0 0 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n", - " .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. ..$ bayes_r_predicted: num [1:150, 1] 0.1971 0.1333 -0.0176 0.1234 0.2194 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n", - " .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. ..$ bayes_l_predicted: num [1:150, 1] 0.20151 0.13565 -0.00533 0.13823 0.23318 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n", - " .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. ..$ mrash_predicted : num [1:150, 1] 0.08183 0.05465 -0.00379 0.05346 0.09283 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n", - " .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. ..$ susie_predicted : num [1:150, 1] 0.003423 0.001808 0.000304 0.001797 0.003032 ...\n", - " .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n", - " .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. ..$ twas_cv_result :List of 4\n", - " .. .. ..$ sample_partition:'data.frame':\t150 obs. of 2 variables:\n", - " .. .. .. ..$ Sample: chr [1:150] \"sample108\" \"sample52\" \"sample103\" \"sample29\" ...\n", - " .. .. .. ..$ Fold : int [1:150] 1 1 1 1 1 1 1 1 1 1 ...\n", - " .. .. ..$ prediction :List of 4\n", - " .. .. .. ..$ bayes_r_predicted: num [1:150, 1] 0.3606 -0.0263 0.0687 0.0989 0.1158 ...\n", - " .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n", - " .. .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. .. ..$ bayes_l_predicted: num [1:150, 1] 0.2274 -0.0389 0.0379 0.0513 0.0706 ...\n", - " .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n", - " .. .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. .. ..$ mrash_predicted : num [1:150, 1] 0.0692 -0.0021 0.0183 0.0231 0.0353 ...\n", - " .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n", - " .. .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. .. ..$ susie_predicted : num [1:150, 1] 0.05794 -0.00105 -0.01131 0 0 ...\n", - " .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. .. ..$ : chr [1:150] \"sample0\" \"sample1\" \"sample10\" \"sample100\" ...\n", - " .. .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. ..$ performance :List of 4\n", - " .. .. .. ..$ bayes_r_performance: num [1, 1:6] 0.022483 0.000505 -0.006248 0.784784 0.40514 ...\n", - " .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. .. .. .. ..$ : chr [1:6] \"corr\" \"rsq\" \"adj_rsq\" \"pval\" ...\n", - " .. .. .. ..$ bayes_l_performance: num [1, 1:6] -8.04e-03 6.47e-05 -6.69e-03 9.22e-01 3.91e-01 ...\n", - " .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. .. .. .. ..$ : chr [1:6] \"corr\" \"rsq\" \"adj_rsq\" \"pval\" ...\n", - " .. .. .. ..$ mrash_performance : num [1, 1:6] 0.05206 0.00271 -0.00403 0.52697 0.35394 ...\n", - " .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. .. .. .. ..$ : chr [1:6] \"corr\" \"rsq\" \"adj_rsq\" \"pval\" ...\n", - " .. .. .. ..$ susie_performance : num [1, 1:6] -0.03878 0.0015 -0.00524 0.6375 0.35463 ...\n", - " .. .. .. .. ..- attr(*, \"dimnames\")=List of 2\n", - " .. .. .. .. .. ..$ : chr \"ENSG00000073921\"\n", - " .. .. .. .. .. ..$ : chr [1:6] \"corr\" \"rsq\" \"adj_rsq\" \"pval\" ...\n", - " .. .. ..$ time_elapsed : 'proc_time' Named num [1:5] 188.023 0.581 189.175 0 0\n", - " .. .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n", - " .. ..$ total_time_elapsed : 'proc_time' Named num [1:5] 242.761 0.972 249.471 0.013 0.024\n", - " .. .. ..- attr(*, \"names\")= chr [1:5] \"user.self\" \"sys.self\" \"elapsed\" \"user.child\" ...\n", - " .. ..$ variant_names : chr [1:6454] \"chr11:84957209:G:C\" \"chr11:84958236:C:A\" \"chr11:84958954:T:C\" \"chr11:84959602:T:C\" ...\n", - " .. ..$ region_info :List of 3\n", - " .. .. ..$ region_coord:'data.frame':\t1 obs. of 3 variables:\n", - " .. .. .. ..$ chrom: chr \"11\"\n", - " .. .. .. ..$ start: int 85957174\n", - " .. .. .. ..$ end : int 86069881\n", - " .. .. ..$ grange :'data.frame':\t1 obs. of 3 variables:\n", - " .. .. .. ..$ chrom: chr \"11\"\n", - " .. .. .. ..$ start: int 84957175\n", - " .. .. .. ..$ end : int 87360000\n", - " .. .. ..$ region_name : chr [1:2] \"ENSG00000073921\" \"ENSG00000073921\"\n", - "```\n" - ] - }, - { - "cell_type": "markdown", - "id": "397a4dc3-1e4c-4845-ad8b-beed40f1f408", - "metadata": {}, - "source": [ - "## Minimal Working Example Steps" - ] - }, - { - "cell_type": "markdown", - "id": "3bca0962-ae57-422c-a79f-892ef3b7f7ae", - "metadata": {}, - "source": [ - "### i. [Run the Fine-Mapping and TWAS with SuSiE](https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/mnm_methods/mnm_regression.html#i-susie-with-twas-weights-from-multiple-methods)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "dff2c979-52a1-4ec8-a0be-51b289327d37", - "metadata": {}, - "outputs": [], - "source": [ - "sos run pipeline/mnm_regression.ipynb susie_twas \\\n", - " --name test_susie_twas \\\n", - " --genoFile output/genotype_by_chrom/wgs.merged.plink_qc.1.bed \\\n", - " --phenoFile output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \\\n", - " --covFile output/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \\\n", - " --customized-association-windows reference_data/TAD/TADB_enhanced_cis.bed \\\n", - " --phenotype-names test_pheno \\\n", - " --max-cv-variants 5000 --ld_reference_meta_file data/ld_meta_file_with_bim.tsv \\\n", - " --region-name ENSG00000049246 ENSG00000054116 ENSG00000116678 \\\n", - " --save-data \\\n", - " --cwd output/mnm_regression/susie_twas" - ] - }, - { - "cell_type": "markdown", - "id": "1c046946-a374-4a63-93ff-3c71c923c627", - "metadata": {}, - "source": [ - "```\n", - "INFO: Running get_analysis_regions: \n", - "Loading customized association analysis window from reference_data/TAD/TADB_enhanced_cis.bed\n", - "INFO: get_analysis_regions is completed.\n", - "INFO: get_analysis_regions output: regional_data\n", - "INFO: Running susie_twas: \n", - "INFO: susie_twas (index=0) is completed.\n", - "INFO: susie_twas (index=1) is completed.\n", - "INFO: susie_twas (index=2) is completed.\n", - "INFO: susie_twas output: /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/mnm_regression/susie_twas/fine_mapping/test_susie_twas.chr1_ENSG00000049246.univariate_bvsr.rds /restricted/projectnb/xqtl/xqtl_protocol/toy_xqtl_protocol/output/mnm_regression/susie_twas/twas_weights/test_susie_twas.chr1_ENSG00000049246.univariate_twas_weights.rds... (6 items in 3 groups)\n", - "INFO: Workflow susie_twas (ID=w487c342397d09ccb) is executed successfully with 2 completed steps and 4 completed substeps.\n", - "```" - ] - }, - { - "cell_type": "markdown", - "id": "cc7691e6-954c-4dc1-bac3-25d56026ab96", - "metadata": {}, - "source": [ - "## Anticipated Results" - ] - }, - { - "cell_type": "markdown", - "id": "10a535c4-8121-4c89-9958-a8ce88a970b6", - "metadata": {}, - "source": [ - "Univariate finemapping will produce a file containing results for the top hits and a file containing twas weights produced by susie. " - ] - }, - { - "cell_type": "markdown", - "id": "199a2637-cc0c-4aff-b4ea-a2546936f073", - "metadata": {}, - "source": [ - "`ROSMAP_mega_eQTL.chr11_ENSG00000073921.univariate_bvrs.rds`:\n", - "* For each gene of interest and phenotype, this file contains:\n", - " 1. susie_fitted\n", - " 2. variant_names\n", - " 3. analysis_script\n", - " 4. other quantities - information on dropped samples, if any\n", - " 5. summary statistics (beta and se) for each variant\n", - " 6. sample names\n", - " 7. summary statistics for top loci\n", - " 8. susie results trimmed - includes pip, sets, cs_corr, sets_secondary, alpha, lbf_variable, V, niter, X_column_scale_factors, mu, mu2\n", - " 9. total time elapsed\n", - " 10. region info\n", - " 11. preset_variants_results\n", - "\n", - "`ROSMAP_mega_eQTL.chr11_ENSG00000073921.univariate_data.rds` (from the --save-data argument):\n", - "* For each gene of interest, contains residuals for each sample and phenotype\n", - "* see [pecotmr code](https://github.com/statfungen/pecotmr/blob/68d87ca1d0a059022bf4e55339621cbddc8993cc/R/file_utils.R#L461) for description at `load_regional_association_data` function\n", - "\n", - "`ROSMAP_mega_eQTL.chr11_ENSG00000073921.univariate_twas_weights.rds`:\n", - "* For each gene of interest and phenotype, this file contains:\n", - " 1. twas_weights - weights for enet, lasso, bayes r, mrash and susie methods\n", - " 2. twas_predictions - twas predictions for enet, lasso, bayes r, mrash and susie methods\n", - " 3. twas_cv_result\n", - " 4. total_time_elapsed\n", - " 5. variant_names\n", - " 6. region_info\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "894acfcd-54f2-416e-adf7-9432a321ec6c", - "metadata": {}, - "outputs": [], - "source": [] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.10.12" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} From ebc1c5c8295019b938dd37ef56a93d9daa7379e7 Mon Sep 17 00:00:00 2001 From: Jenny Empawi <80464805+jaempawi@users.noreply.github.com> Date: Fri, 24 Apr 2026 17:03:25 -0400 Subject: [PATCH 2/2] Update with peak as input --- ...nivariate_fine_mapping_twas_vignette.ipynb | 292 ++++++++++++++++++ 1 file changed, 292 insertions(+) create mode 100644 code/mnm_analysis/univariate_fine_mapping_twas_vignette.ipynb diff --git a/code/mnm_analysis/univariate_fine_mapping_twas_vignette.ipynb b/code/mnm_analysis/univariate_fine_mapping_twas_vignette.ipynb new file mode 100644 index 000000000..a7751d11e --- /dev/null +++ b/code/mnm_analysis/univariate_fine_mapping_twas_vignette.ipynb @@ -0,0 +1,292 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Univariate Fine-Mapping and TWAS with SuSiE\n", + "\n", + "## Overview\n", + "\n", + "This pipeline performs per-region univariate fine-mapping with SuSiE and computes TWAS\n", + "weights, via the `susie_twas` workflow of `pipeline/mnm_regression.ipynb`. For each\n", + "region the workflow fits SuSiE on the covariate-residualized phenotype × genotype\n", + "matrix and then cross-validates five prediction models (**enet**, **lasso**,\n", + "**bayes_r**, **mrash**, **susie**) to pick TWAS weights.\n", + "\n", + "SuSiE searches for up to five independent signals per region; if all five are found,\n", + "the cap is raised. SuSiE itself does not accept covariates, so covariates are regressed\n", + "out of both the phenotype and the genotype matrix beforehand.\n", + "\n", + "### What is a \"region\"?\n", + "\n", + "A region is **one row in the phenotype manifest** (`--phenoFile`) plus its association\n", + "window (from `--customized-association-windows`, or a symmetric `--cis-window` around\n", + "`start/end`). One `susie_twas` call fits one SuSiE + TWAS model per region.\n", + "\n", + "* **Gene-based phenotypes.** Manifest row = one gene; window = gene's containing TAD (or\n", + " ± cis-window around the TSS).\n", + "* **Peak-based phenotypes (caQTL, mQTL).** Manifest row = one peak/CpG; window = the\n", + " peak's containing [extended-TAD](./../reference_data/generalized_TADB.html). A peak is\n", + " 2–5 kb and cannot serve as its own window.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 1 -- Prepare the phenotype manifest and association-windows BED\n", + "\n", + "The pipeline expects two derived files per experiment:\n", + "\n", + "* `pheno_manifest.tsv` — 5 columns (`#chr start end ID path`), one row per region.\n", + " `path` points to a bgzipped+tabixed phenotype BED restricted to that region.\n", + "* `association_windows.bed` — 4 columns (`chr start end ID`), one row per region.\n", + " `start/end` define the cis search window; `ID` must match a row in the manifest.\n", + "\n", + "For gene-based phenotypes these files are produced by the standard phenotype\n", + "preprocessing + [TADB generation](./../reference_data/generalized_TADB.html) steps.\n", + "\n", + "For peak-based phenotypes (caQTL, mQTL) you can derive both from the raw peak BED with\n", + "the helper below; it renames the peak-ID header, splits the multi-sample BED into one\n", + "bgz+tbi per peak, writes the manifest, and annotates each peak with its smallest\n", + "containing extended-TAD." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "bash code/misc/preprocess/prepare_peak_inputs.sh \\\n", + " \\\n", + " \\\n", + " \n", + "\n", + "# Produces:\n", + "# /pheno_manifest.tsv\n", + "# /association_windows.bed\n", + "# /peaks_split/.bed.gz{,.tbi}\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Step 2 -- Run `susie_twas`\n", + "\n", + "**Gene-based example** (3 genes, default bulk-RNA-seq preprocessing):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sos run pipeline/mnm_regression.ipynb susie_twas \\\n", + " --name test_susie_twas \\\n", + " --cwd output/mnm_regression/susie_twas \\\n", + " --genoFile output/genotype_by_chrom/wgs.merged.plink_qc.1.bed \\\n", + " --phenoFile output/phenotype/phenotype_by_chrom_for_cis/bulk_rnaseq.phenotype_by_chrom_files.region_list.txt \\\n", + " --covFile output/covariate/bulk_rnaseq_tmp_matrix.low_expression_filtered.outlier_removed.tmm.expression.covariates.wgs.merged.plink_qc.plink_qc.prune.pca.Marchenko_PC.gz \\\n", + " --customized-association-windows reference_data/TAD/TADB_enhanced_cis.bed \\\n", + " --phenotype-names test_pheno \\\n", + " --max-cv-variants 5000 \\\n", + " --ld_reference_meta_file data/ld_meta_file_with_bim.tsv \\\n", + " --region-name ENSG00000049246 ENSG00000054116 ENSG00000116678 \\\n", + " --save-data\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "**Peak-based example** (caQTL; 10 chr22 peaks, 1,287 samples):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sos run pipeline/mnm_regression.ipynb susie_twas \\\n", + " --name example_10peaks_twas \\\n", + " --cwd output/susie_twas_out \\\n", + " --genoFile input/example.chr22.bed \\\n", + " --phenoFile output/inputs_ready/pheno_manifest.tsv \\\n", + " --covFile input/example_covariates.tsv \\\n", + " --customized-association-windows output/inputs_ready/association_windows.bed \\\n", + " --phenotype-names caQTL \\\n", + " --max-cv-variants 5000 \\\n", + " --save-data\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Expected log (both cases):\n", + "\n", + "```\n", + "INFO: Running get_analysis_regions:\n", + "Loading customized association analysis window from \n", + "INFO: get_analysis_regions is completed.\n", + "INFO: get_analysis_regions output: regional_data\n", + "INFO: Running susie_twas:\n", + "INFO: susie_twas (index=0) is completed.\n", + "INFO: susie_twas (index=1) is completed.\n", + "...\n", + "INFO: susie_twas output: /fine_mapping/._.univariate_bvsr.rds\n", + " /twas_weights/._.univariate_twas_weights.rds\n", + " ... (2 × N_regions items)\n", + "INFO: Workflow susie_twas is executed successfully.\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Command Interface" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sos run pipeline/mnm_regression.ipynb susie_twas -h\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Flags consumed by `susie_twas`:\n", + "\n", + "* `--genoFile` — path to a PLINK1 `.bed` file (pass the `.bed` path, not the prefix).\n", + " FID/IID in the `.fam` must match the phenotype/covariate sample IDs.\n", + "* `--phenoFile` — 5-column `#chr start end ID path` manifest. One row per region; `path`\n", + " points at a bgzipped+tabixed phenotype BED restricted to that region. Pass multiple\n", + " files for multi-condition analyses.\n", + "* `--covFile` — tab-delimited, literal `ID` header in column 1; covariate names in rows,\n", + " sample IDs across the header row.\n", + "* `--customized-association-windows` — 4-column `chr start end ID` BED. Required unless\n", + " `--cis-window` is non-negative.\n", + "* `--cis-window` — symmetric cis radius (bp) around each region's `start/end`. Default\n", + " `-1` (forces the customized-windows path).\n", + "* `--phenotype-names` — condition name(s), one per `--phenoFile`.\n", + "* `--max-cv-variants` — cap on variants used in cross-validation (default 5,000).\n", + "* `--ld_reference_meta_file` — summary-stats / RSS mode only; omit for individual-level.\n", + "* `--region-name` — restrict to one or more region IDs for sanity checks.\n", + "* `--save-data` — also emit `*.univariate_data.rds` with the residualized X/Y matrices.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Output format\n", + "\n", + "Per region the workflow writes two RDS files (three with `--save-data`), keyed by\n", + "region `ID` and phenotype condition.\n", + "\n", + "### `._.univariate_bvsr.rds`\n", + "\n", + "* `variant_names` — variant IDs kept after QC.\n", + "* `analysis_script` — the R script that was `source()`d (provenance).\n", + "* `other_quantities$dropped_samples` — per-matrix sample drops.\n", + "* `sumstats` — `betahat`, `sebetahat`, `z_scores`, `p_values`, `q_values`.\n", + "* `sample_names` — samples actually used.\n", + "* `top_loci` — data frame: `variant_id, betahat, sebetahat, z, maf, pip, cs_coverage_0.95 / 0.7 / 0.5`.\n", + "* `susie_result_trimmed` — trimmed SuSiE fit: `pip, sets, cs_corr, sets_secondary, alpha, lbf_variable, V, niter, X_column_scale_factors, mu, mu2`.\n", + "* `total_time_elapsed` — `proc_time`.\n", + "* `region_info` — `region_coord` (chrom/start/end), `grange` (expanded window), `region_name` (the manifest `ID`).\n", + "* `preset_variants_result` — SuSiE fit on the preset-variant subset used for TWAS.\n", + "\n", + "### `._.univariate_twas_weights.rds`\n", + "\n", + "* `susie_weights_intermediate` — `mu, lbf_variable, X_column_scale_factors, pip` on the preset variant set.\n", + "* `twas_weights` — per-method weights: `enet_weights, lasso_weights, bayes_r_weights, bayes_l_weights, mrash_weights, susie_weights`.\n", + "* `twas_predictions` — same six methods, each a (sample × condition) matrix.\n", + "* `twas_cv_result` — `sample_partition` (fold assignment), `prediction`, `performance` (per method: `corr, rsq, adj_rsq, pval, ...`), `time_elapsed`.\n", + "* `total_time_elapsed`, `variant_names`, `region_info`.\n", + "\n", + "### `._.univariate_data.rds` (`--save-data`)\n", + "\n", + "* `residual_Y`, `residual_X` — phenotype/genotype residualized for covariates.\n", + "* `residual_Y_scalar`, `residual_X_scalar` — rescaling factors.\n", + "* `dropped_sample` — per-matrix drops.\n", + "* `maf` — named numeric (`variant_id → maf`).\n", + "* `X` — raw genotype matrix (sample × variant).\n", + "* `chrom`, `grange`, `X_variance`.\n", + "\n", + "See [`load_regional_association_data` in pecotmr](https://github.com/statfungen/pecotmr/blob/68d87ca1d0a059022bf4e55339621cbddc8993cc/R/file_utils.R#L461)\n", + "for the full struct.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Expected Results\n", + "\n", + "Quick sanity check — inspect cross-validation performance for the first region:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "Rscript -e '\n", + "fp <- \"output/mnm_regression/susie_twas/twas_weights/test_susie_twas.chr1_ENSG00000049246.univariate_twas_weights.rds\"\n", + "x <- readRDS(fp)\n", + "cat(\"methods:\", names(x[[1]][[1]]$twas_weights), \"\\n\")\n", + "print(x[[1]][[1]]$twas_cv_result$performance)\n", + "'\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Interpretation:\n", + "\n", + "* `twas_weights$` — use for TWAS (`Z ≈ weights · LD · GWAS z-scores`).\n", + "* `twas_cv_result$performance` — pick the method with highest `adj_rsq` before applying.\n", + "* `top_loci` in `univariate_bvsr.rds` — lead variants and credible-set coverages.\n", + "* `susie_result_trimmed$sets$cs` — indices of variants in each 95% credible set.\n", + "\n", + "### Common pitfalls\n", + "\n", + "* **Sample-ID drift.** The phenotype BED's header columns (5+), the bfile FID/IID, and\n", + " the covariate TSV header must all be drawn from the same set. Sos does not warn on\n", + " missing samples — it silently drops them and may report zero overlap.\n", + "* **Empty regions.** If a region has no variants in its association window (e.g. the\n", + " manifest points at a window with no bfile coverage) sos marks it completed with an\n", + " empty output. Verify the expected file count in the `INFO: susie_twas output:` line.\n", + "* **`--cis-window` vs `--customized-association-windows`.** Either/or. If both are\n", + " passed the customized windows win. Pass `--cis-window -1` to force the customized\n", + " path explicitly.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Bash", + "language": "bash", + "name": "bash" + }, + "language_info": { + "name": "bash" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} \ No newline at end of file