diff --git a/DESCRIPTION b/DESCRIPTION index b140565..dd15a13 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -12,7 +12,10 @@ Authors@R: c( person(given = "Gao", family = "Wang", email = "wang.gao@columbia.edu", role = c("aut", "cph")) ) Maintainer: Xuewei Cao -Description: A multi-task learning approach to variable selection regression with highly correlated predictors and sparse effects, based on frequentist statistical inference. It provides statistical evidence to identify which subsets of predictors have non-zero effects on which subsets of response variables, motivated and designed for colocalization analysis of multiple genetic association studies. +Description: A multi-task learning approach to variable selection regression with highly correlated predictors and sparse effects, + based on frequentist statistical inference. It provides statistical evidence to identify which subsets of predictors have non-zero + effects on which subsets of response variables, motivated and designed for colocalization analysis of multiple genetic association studies. + The ColocBoost model is described in Cao etc (2025) . Encoding: UTF-8 LazyDataCompression: xz LazyData: true diff --git a/R/data.R b/R/data.R index 4724c18..4646eef 100644 --- a/R/data.R +++ b/R/data.R @@ -12,6 +12,8 @@ #' @source The Ind_5traits dataset contains 5 simulated phenotypes alongside corresponding genotype matrices. #' The dataset is specifically designed for evaluating and demonstrating the capabilities of ColocBoost in multi-trait colocalization analysis #' with individual-level data. See Cao etc. 2025 for details. +#' Due to the file size limitation of CRAN release, this is a subset of simulated data. +#' See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}. #' #' @family colocboost_data "Ind_5traits" @@ -30,6 +32,8 @@ #' where it is directly derived from the Ind_5traits dataset using marginal association. #' The dataset is specifically designed for evaluating and demonstrating the capabilities of ColocBoost #' in multi-trait colocalization analysis with summary association data. See Cao etc. 2025 for details. +#' Due to the file size limitation of CRAN release, this is a subset of simulated data. +#' See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}. #' #' @family colocboost_data "Sumstat_5traits" @@ -48,7 +52,8 @@ #' } #' @source The Heterogeneous_Effect dataset contains 2 simulated phenotypes alongside corresponding genotype matrices. #' There are two causal variants, both of which have heterogeneous effects on two traits. -#' See Figure 2b in Cao etc. 2025 for details. +#' Due to the file size limitation of CRAN release, this is a subset of simulated data to generate Figure 2b in Cao etc. 2025. +#' See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}. #' #' @family colocboost_data "Heterogeneous_Effect" @@ -67,7 +72,8 @@ #' } #' @source The Weaker_GWAS_Effect dataset contains 2 simulated phenotypes alongside corresponding genotype matrices. #' There are two causal variants, one of which has a weaker effect on the focal trait compared to the other trait. -#' See Figure 2b in Cao etc. 2025 for details. +#' Due to the file size limitation of CRAN release, this is a subset of simulated data to generate Figure 2b in Cao etc. 2025. +#' See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}. #' #' @family colocboost_data "Weaker_GWAS_Effect" @@ -86,7 +92,8 @@ #' } #' @source The Non_Causal_Strongest_Marginal dataset contains 2 simulated phenotypes alongside corresponding genotype matrices. #' There are two causal variants, but the strongest marginal association is not a causal variant. -#' See Figure 2b in Cao etc. 2025 for details. +#' Due to the file size limitation of CRAN release, this is a subset of simulated data to generate Figure 2b in Cao etc. 2025. +#' See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}. #' #' @family colocboost_data "Non_Causal_Strongest_Marginal" diff --git a/README.md b/README.md index 8c441c3..ee5696d 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ [![Codecov test coverage](https://codecov.io/gh/StatFunGen/colocboost/branch/main/graph/badge.svg)](https://codecov.io/gh/StatFunGen/colocboost?branch=main) [![CRAN Version](https://www.r-pkg.org/badges/version/colocboost)](https://cran.r-project.org/package=colocboost) -![](man/figures/colocboost.png) +![](man/figures/colocboost.jpg) This R package implements ColocBoost---motivated and designed for colocalization analysis of multiple genetic association studies---as a multi-task learning approach to variable selection regression with highly correlated predictors and sparse effects, based on frequentist statistical inference. It provides statistical evidence to identify which subsets of predictors have non-zero effects on which subsets of response variables. diff --git a/data/Heterogeneous_Effect.rda b/data/Heterogeneous_Effect.rda index 4bab8bb..a0dc02c 100644 Binary files a/data/Heterogeneous_Effect.rda and b/data/Heterogeneous_Effect.rda differ diff --git a/data/Ind_5traits.rda b/data/Ind_5traits.rda index c2e6117..90c01ed 100644 Binary files a/data/Ind_5traits.rda and b/data/Ind_5traits.rda differ diff --git a/data/Non_Causal_Strongest_Marginal.rda b/data/Non_Causal_Strongest_Marginal.rda index 1ca17c0..ed6920d 100644 Binary files a/data/Non_Causal_Strongest_Marginal.rda and b/data/Non_Causal_Strongest_Marginal.rda differ diff --git a/data/Sumstat_5traits.rda b/data/Sumstat_5traits.rda index 45adba2..a2b90e9 100644 Binary files a/data/Sumstat_5traits.rda and b/data/Sumstat_5traits.rda differ diff --git a/data/Weaker_GWAS_Effect.rda b/data/Weaker_GWAS_Effect.rda index 8330d4a..706f429 100644 Binary files a/data/Weaker_GWAS_Effect.rda and b/data/Weaker_GWAS_Effect.rda differ diff --git a/inst/WORDLIST b/inst/WORDLIST index 82fadaf..f35fb3c 100644 --- a/inst/WORDLIST +++ b/inst/WORDLIST @@ -43,6 +43,7 @@ colocalized conda de decayrate +doi eQTL grey iteratively @@ -67,6 +68,7 @@ rcond recalibrate recalibrated reconciliate +repo rss sQTL subsampled diff --git a/man/Heterogeneous_Effect.Rd b/man/Heterogeneous_Effect.Rd index a8de914..66ecf60 100644 --- a/man/Heterogeneous_Effect.Rd +++ b/man/Heterogeneous_Effect.Rd @@ -18,7 +18,8 @@ A list with 3 elements \source{ The Heterogeneous_Effect dataset contains 2 simulated phenotypes alongside corresponding genotype matrices. There are two causal variants, both of which have heterogeneous effects on two traits. -See Figure 2b in Cao etc. 2025 for details. +Due to the file size limitation of CRAN release, this is a subset of simulated data to generate Figure 2b in Cao etc. 2025. +See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}. } \usage{ Heterogeneous_Effect diff --git a/man/Ind_5traits.Rd b/man/Ind_5traits.Rd index 4026c98..79ad10d 100644 --- a/man/Ind_5traits.Rd +++ b/man/Ind_5traits.Rd @@ -19,6 +19,8 @@ A list with 3 elements The Ind_5traits dataset contains 5 simulated phenotypes alongside corresponding genotype matrices. The dataset is specifically designed for evaluating and demonstrating the capabilities of ColocBoost in multi-trait colocalization analysis with individual-level data. See Cao etc. 2025 for details. +Due to the file size limitation of CRAN release, this is a subset of simulated data. +See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}. } \usage{ Ind_5traits diff --git a/man/Non_Causal_Strongest_Marginal.Rd b/man/Non_Causal_Strongest_Marginal.Rd index 53f654b..132270e 100644 --- a/man/Non_Causal_Strongest_Marginal.Rd +++ b/man/Non_Causal_Strongest_Marginal.Rd @@ -18,7 +18,8 @@ A list with 3 elements \source{ The Non_Causal_Strongest_Marginal dataset contains 2 simulated phenotypes alongside corresponding genotype matrices. There are two causal variants, but the strongest marginal association is not a causal variant. -See Figure 2b in Cao etc. 2025 for details. +Due to the file size limitation of CRAN release, this is a subset of simulated data to generate Figure 2b in Cao etc. 2025. +See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}. } \usage{ Non_Causal_Strongest_Marginal diff --git a/man/Sumstat_5traits.Rd b/man/Sumstat_5traits.Rd index 3ba28c8..e54ce60 100644 --- a/man/Sumstat_5traits.Rd +++ b/man/Sumstat_5traits.Rd @@ -19,6 +19,8 @@ The Sumstat_5traits dataset contains 5 simulated summary statistics, where it is directly derived from the Ind_5traits dataset using marginal association. The dataset is specifically designed for evaluating and demonstrating the capabilities of ColocBoost in multi-trait colocalization analysis with summary association data. See Cao etc. 2025 for details. +Due to the file size limitation of CRAN release, this is a subset of simulated data. +See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}. } \usage{ Sumstat_5traits diff --git a/man/Weaker_GWAS_Effect.Rd b/man/Weaker_GWAS_Effect.Rd index 87ba36b..0920810 100644 --- a/man/Weaker_GWAS_Effect.Rd +++ b/man/Weaker_GWAS_Effect.Rd @@ -18,7 +18,8 @@ A list with 3 elements \source{ The Weaker_GWAS_Effect dataset contains 2 simulated phenotypes alongside corresponding genotype matrices. There are two causal variants, one of which has a weaker effect on the focal trait compared to the other trait. -See Figure 2b in Cao etc. 2025 for details. +Due to the file size limitation of CRAN release, this is a subset of simulated data to generate Figure 2b in Cao etc. 2025. +See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}. } \usage{ Weaker_GWAS_Effect diff --git a/man/figures/colocboost.jpg b/man/figures/colocboost.jpg new file mode 100644 index 0000000..9dc02c1 Binary files /dev/null and b/man/figures/colocboost.jpg differ diff --git a/man/figures/colocboost.png b/man/figures/colocboost.png deleted file mode 100644 index 6c976f8..0000000 Binary files a/man/figures/colocboost.png and /dev/null differ diff --git a/man/figures/missing_representation.png b/man/figures/missing_representation.png index ad21a5c..ec4bc1d 100644 Binary files a/man/figures/missing_representation.png and b/man/figures/missing_representation.png differ diff --git a/vignettes/Ambiguous_Colocalization.Rmd b/vignettes/Ambiguous_Colocalization.Rmd index 50fe8ad..3604bac 100644 --- a/vignettes/Ambiguous_Colocalization.Rmd +++ b/vignettes/Ambiguous_Colocalization.Rmd @@ -10,7 +10,8 @@ vignette: > ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, - comment = "#>" + comment = "#>", + dpi = 80 ) ``` diff --git a/vignettes/Disease_Prioritized_Colocalization.Rmd b/vignettes/Disease_Prioritized_Colocalization.Rmd index d39cf7f..714f390 100644 --- a/vignettes/Disease_Prioritized_Colocalization.Rmd +++ b/vignettes/Disease_Prioritized_Colocalization.Rmd @@ -10,7 +10,8 @@ vignette: > ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, - comment = "#>" + comment = "#>", + dpi = 80 ) ``` @@ -33,10 +34,10 @@ To get started, load both `Ind_5traits` and `Sumstat_5traits` datasets into your - Note that `LD` could be calculated from the `X` data in the `Ind_5traits` dataset, but it is not included in the `Sumstat_5traits` dataset. ### Causal variant structure -The dataset features two causal variants with indices 644 and 2289. +The dataset features two causal variants with indices 194 and 589. -- Causal variant 644 is associated with traits 1, 2, 3, and 4. -- Causal variant 2289 is associated with traits 2, 3, and 5 (summary level data). +- Causal variant 194 is associated with traits 1, 2, 3, and 4. +- Causal variant 589 is associated with traits 2, 3, and 5 (summary level data). ```{r load-mixed-data} # Load example data @@ -54,6 +55,8 @@ For analyze a specific one type of data, you can refer to the following tutorials [Individual Level Data Colocalization](https://statfungen.github.io/colocboost/articles/Individual_Level_Colocalization.html) and [Summary Level Data Colocalization](https://statfungen.github.io/colocboost/articles/Summary_Level_Colocalization.html). +Due to the file size limitation of CRAN release, this is a subset of simulated data. See full dataset in [colocboost paper repo](https://github.com/StatFunGen/colocboost-paper). + # 2. ColocBoost in disease-agnostic mode @@ -78,9 +81,6 @@ res <- colocboost(X = X, Y = Y, sumstat = sumstat, LD = LD) # Identified CoS res$cos_details$cos$cos_index - -# Plotting the results -colocboost_plot(res) ``` ### Results Interpretation diff --git a/vignettes/FineBoost_Special_Case.Rmd b/vignettes/FineBoost_Special_Case.Rmd index 8c19276..af081be 100644 --- a/vignettes/FineBoost_Special_Case.Rmd +++ b/vignettes/FineBoost_Special_Case.Rmd @@ -10,7 +10,8 @@ vignette: > ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, - comment = "#>" + comment = "#>", + dpi = 50 ) ``` diff --git a/vignettes/Individual_Level_Colocalization.Rmd b/vignettes/Individual_Level_Colocalization.Rmd index 864c58a..111dd51 100644 --- a/vignettes/Individual_Level_Colocalization.Rmd +++ b/vignettes/Individual_Level_Colocalization.Rmd @@ -10,7 +10,8 @@ vignette: > ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, - comment = "#>" + comment = "#>", + dpi = 80 ) ``` @@ -38,10 +39,10 @@ The dataset is specifically designed to evaluate and demonstrate the capabilitie - `true_effect_variants`: True effect variants indices for each trait. ### Causal variant structure -The dataset features two causal variants with indices 644 and 2289. +The dataset features two causal variants with indices 194 and 589. -- Causal variant 644 is associated with traits 1, 2, 3, and 4. -- Causal variant 2289 is associated with traits 2, 3, and 5. +- Causal variant 194 is associated with traits 1, 2, 3, and 4. +- Causal variant 589 is associated with traits 2, 3, and 5. This structure creates a realistic scenario where multiple traits are influenced by different but overlapping sets of genetic variants. @@ -53,6 +54,7 @@ names(Ind_5traits) Ind_5traits$true_effect_variants ``` +Due to the file size limitation of CRAN release, this is a subset of simulated data. See full dataset in [colocboost paper repo](https://github.com/StatFunGen/colocboost-paper). # 2. Matched individual level input $X$ and $Y$ diff --git a/vignettes/Interpret_ColocBoost_Output.Rmd b/vignettes/Interpret_ColocBoost_Output.Rmd index 36ac3c7..1c6571e 100644 --- a/vignettes/Interpret_ColocBoost_Output.Rmd +++ b/vignettes/Interpret_ColocBoost_Output.Rmd @@ -10,7 +10,8 @@ vignette: > ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, - comment = "#>" + comment = "#>", + dpi = 80 ) ``` @@ -24,10 +25,10 @@ library(colocboost) ## 1. Summarize ColocBoost results ### Causal variant structure -The dataset features two causal variants with indices 644 and 2289. +The dataset features two causal variants with indices 194 and 589. -- Causal variant 644 is associated with traits 1, 2, 3, and 4. -- Causal variant 2289 is associated with traits 2, 3, and 5. +- Causal variant 194 is associated with traits 1, 2, 3, and 4. +- Causal variant 589 is associated with traits 2, 3, and 5. ```{r run-colocboost} # Loading the Dataset @@ -258,7 +259,7 @@ data(Ind_5traits) data(Heterogeneous_Effect) X <- Ind_5traits$X[1:3] Y <- Ind_5traits$Y[1:3] -X1 <- Heterogeneous_Effect$X[,1:3000] +X1 <- Heterogeneous_Effect$X Y1 <- Heterogeneous_Effect$Y[,1,drop=F] res <- colocboost(X = c(X, list(X1)), Y = c(Y, list(Y1)), output_level = 2) names(res$ucos_details) diff --git a/vignettes/LD_Free_Colocalization.Rmd b/vignettes/LD_Free_Colocalization.Rmd index 730558d..6011eee 100644 --- a/vignettes/LD_Free_Colocalization.Rmd +++ b/vignettes/LD_Free_Colocalization.Rmd @@ -10,7 +10,8 @@ vignette: > ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, - comment = "#>" + comment = "#>", + dpi = 80 ) ``` @@ -45,6 +46,7 @@ ColocBoost provides diagnostic warnings for assessing the consistency of the sum - Estimated residual variance of the model is negative or greater than phenotypic variance (`rtr < 0` or `rtr > var_y`; see details in Supplementary Note S3.5.2). - Change in trait-specific profile log-likelihood according to a CoS is negative (see details in Supplementary Note S3.5.3). +- The trait-specific gradient boosting model fails to converge. ### Example of including LD mismatch @@ -58,8 +60,8 @@ data("Ind_5traits") LD <- get_cormat(Ind_5traits$X[[1]]) # Change sign of Z-score for 1% of variants for each trait by including mismatched LD -set.seed(1) -miss_prop <- 0.01 +set.seed(123) +miss_prop <- 0.005 sumstat <- lapply(Sumstat_5traits$sumstat, function(ss){ p <- nrow(ss) pos_miss <- sample(1:p, ceiling(miss_prop * p)) diff --git a/vignettes/Partial_Overlap_Variants.Rmd b/vignettes/Partial_Overlap_Variants.Rmd index ea06466..6c90f9d 100644 --- a/vignettes/Partial_Overlap_Variants.Rmd +++ b/vignettes/Partial_Overlap_Variants.Rmd @@ -10,7 +10,8 @@ vignette: > ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, - comment = "#>" + comment = "#>", + dpi = 80 ) ``` @@ -29,8 +30,8 @@ library(colocboost) We create an example data from `Ind_5traits` with two causal variants, 644 and 2289, but each of them is only partially overlapping across traits. -- Causal variant 644 is associated with traits 1, 3, and 4, but is missing in trait 2. -- Causal variant 2289 is associated with traits 2 and 3, but is missing in trait 5. +- Causal variant 194 is associated with traits 1, 3, and 4, but is missing in trait 2. +- Causal variant 589 is associated with traits 2 and 3, but is missing in trait 5. This structure creates a realistic scenario in which multiple traits from different datasets are not fully overlapping, and the causal variants are not shared across all traits. @@ -41,8 +42,8 @@ X <- Ind_5traits$X Y <- Ind_5traits$Y # Create causal variants with potentially LD proxies -causal_1 <- c(200:1000) -causal_2 <- c(2000:2600) +causal_1 <- c(100:350) +causal_2 <- c(450:650) # Create missing data X[[2]] <- X[[2]][, -causal_1, drop = FALSE] diff --git a/vignettes/Summary_Statistics_Colocalization.Rmd b/vignettes/Summary_Statistics_Colocalization.Rmd index 30fd224..1a79c8f 100644 --- a/vignettes/Summary_Statistics_Colocalization.Rmd +++ b/vignettes/Summary_Statistics_Colocalization.Rmd @@ -10,7 +10,8 @@ vignette: > ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, - comment = "#>" + comment = "#>", + dpi = 80 ) ``` @@ -33,10 +34,10 @@ The dataset is specifically designed to evaluate and demonstrate the capabilitie - Note that `LD` could be calculated from the `X` data in the `Ind_5traits` dataset, but it is not included in the `Sumstat_5traits` dataset. ### Causal variant structure -The dataset features two causal variants with indices 644 and 2289. +The dataset features two causal variants with indices 194 and 589. -- Causal variant 644 is associated with traits 1, 2, 3, and 4. -- Causal variant 2289 is associated with traits 2, 3, and 5. +- Causal variant 194 is associated with traits 1, 2, 3, and 4. +- Causal variant 589 is associated with traits 2, 3, and 5. This structure creates a realistic scenario in which multiple traits are influenced by different but overlapping sets of genetic variants. @@ -47,6 +48,8 @@ names(Sumstat_5traits) Sumstat_5traits$true_effect_variants ``` +Due to the file size limitation of CRAN release, this is a subset of simulated data. See full dataset in [colocboost paper repo](https://github.com/StatFunGen/colocboost-paper). + ### Important data format for summary data `sumstat` must include the following columns: @@ -135,7 +138,7 @@ It allows for efficient analysis without redundancy. ```{r superset-LD} # Create sumstat with different number of variants - remove 100 variants in each sumstat LD_superset <- LD -sumstat <- lapply(Sumstat_5traits$sumstat, function(x) x[-sample(1:nrow(x), 100), , drop = FALSE]) +sumstat <- lapply(Sumstat_5traits$sumstat, function(x) x[-sample(1:nrow(x), 20), , drop = FALSE]) # Run colocboost res <- colocboost(sumstat = sumstat, LD = LD_superset) @@ -203,9 +206,6 @@ res <- colocboost(effect_est = effect_est, effect_se = effect_se, effect_n = eff # Identified CoS res$cos_details$cos$cos_index - -# Plotting the results -colocboost_plot(res) ``` See more details about data format to implement LD-free ColocBoost and LD-mismatch diagnosis in [LD mismatch and LD-free Colocalization](https://statfungen.github.io/colocboost/articles/LD_Free_Colocalization.html)). diff --git a/vignettes/Visualization_ColocBoost_Output.Rmd b/vignettes/Visualization_ColocBoost_Output.Rmd index 41b1495..1ffd7d9 100644 --- a/vignettes/Visualization_ColocBoost_Output.Rmd +++ b/vignettes/Visualization_ColocBoost_Output.Rmd @@ -10,7 +10,8 @@ vignette: > ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, - comment = "#>" + comment = "#>", + dpi = 80 ) ``` @@ -22,10 +23,10 @@ library(colocboost) **Causal variants (simulated)** -The dataset features two causal variants with indices 644 and 2289. +The dataset features two causal variants with indices 194 and 589. -- Causal variant 644 is associated with traits 1, 2, 3, and 4. -- Causal variant 2289 is associated with traits 2, 3, and 5. +- Causal variant 194 is associated with traits 1, 2, 3, and 4. +- Causal variant 589 is associated with traits 2, 3, and 5. ```{r run-colocboost} # Loading the Dataset @@ -68,7 +69,7 @@ There are several advanced options available for customizing the plot by deepeni You can specify a zoom-in region by providing a `grange` argument, which is a vector indicating the indices of the region to be zoomed in. ```{r zoomin-plot} -colocboost_plot(res, grange = c(1:1000)) +colocboost_plot(res, grange = c(1:400)) ``` @@ -124,7 +125,7 @@ data(Ind_5traits) data(Heterogeneous_Effect) X <- Ind_5traits$X[1:3] Y <- Ind_5traits$Y[1:3] -X1 <- Heterogeneous_Effect$X[,1:3000] +X1 <- Heterogeneous_Effect$X Y1 <- Heterogeneous_Effect$Y[,1,drop=F] # Run colocboost