StatFunGen · xueweic · Apr 24, 2025 · Apr 24, 2025 · Apr 24, 2025 · Apr 24, 2025
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -12,7 +12,10 @@ Authors@R: c(
   person(given = "Gao", family = "Wang", email = "wang.gao@columbia.edu", role = c("aut", "cph"))
   )
 Maintainer: Xuewei Cao <xc2270@cumc.columbia.edu>
-Description: A multi-task learning approach to variable selection regression with highly correlated predictors and sparse effects, based on frequentist statistical inference. It provides statistical evidence to identify which subsets of predictors have non-zero effects on which subsets of response variables, motivated and designed for colocalization analysis of multiple genetic association studies.
+Description: A multi-task learning approach to variable selection regression with highly correlated predictors and sparse effects, 
+  based on frequentist statistical inference. It provides statistical evidence to identify which subsets of predictors have non-zero 
+  effects on which subsets of response variables, motivated and designed for colocalization analysis of multiple genetic association studies.
+  The ColocBoost model is described in Cao etc (2025) <doi:10.1101/2025.04.17.25326042>.
 Encoding: UTF-8
 LazyDataCompression: xz
 LazyData: true

diff --git a/R/data.R b/R/data.R
@@ -12,6 +12,8 @@
 #' @source The Ind_5traits dataset contains 5 simulated phenotypes alongside corresponding genotype matrices.
 #' The dataset is specifically designed for evaluating and demonstrating the capabilities of ColocBoost in multi-trait colocalization analysis 
 #' with individual-level data. See Cao etc. 2025 for details.
+#' Due to the file size limitation of CRAN release, this is a subset of simulated data. 
+#' See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}.
 #' 
 #' @family colocboost_data
 "Ind_5traits"
@@ -30,6 +32,8 @@
 #' where it is directly derived from the Ind_5traits dataset using marginal association. 
 #' The dataset is specifically designed for evaluating and demonstrating the capabilities of ColocBoost 
 #' in multi-trait colocalization analysis with summary association data. See Cao etc. 2025 for details. 
+#' Due to the file size limitation of CRAN release, this is a subset of simulated data. 
+#' See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}.
 #' 
 #' @family colocboost_data
 "Sumstat_5traits"
@@ -48,7 +52,8 @@
 #' }
 #' @source The Heterogeneous_Effect dataset contains 2 simulated phenotypes alongside corresponding genotype matrices.
 #' There are two causal variants, both of which have heterogeneous effects on two traits.
-#' See Figure 2b in Cao etc. 2025 for details. 
+#' Due to the file size limitation of CRAN release, this is a subset of simulated data to generate Figure 2b in Cao etc. 2025. 
+#' See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}.
 #' 
 #' @family colocboost_data
 "Heterogeneous_Effect"
@@ -67,7 +72,8 @@
 #' }
 #' @source The Weaker_GWAS_Effect dataset contains 2 simulated phenotypes alongside corresponding genotype matrices.
 #' There are two causal variants, one of which has a weaker effect on the focal trait compared to the other trait.
-#' See Figure 2b in Cao etc. 2025 for details. 
+#' Due to the file size limitation of CRAN release, this is a subset of simulated data to generate Figure 2b in Cao etc. 2025. 
+#' See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}.
 #' 
 #' @family colocboost_data
 "Weaker_GWAS_Effect"
@@ -86,7 +92,8 @@
 #' }
 #' @source The Non_Causal_Strongest_Marginal dataset contains 2 simulated phenotypes alongside corresponding genotype matrices.
 #' There are two causal variants, but the strongest marginal association is not a causal variant.
-#' See Figure 2b in Cao etc. 2025 for details. 
+#' Due to the file size limitation of CRAN release, this is a subset of simulated data to generate Figure 2b in Cao etc. 2025. 
+#' See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}.
 #' 
 #' @family colocboost_data
 "Non_Causal_Strongest_Marginal"

diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 [![Codecov test coverage](https://codecov.io/gh/StatFunGen/colocboost/branch/main/graph/badge.svg)](https://codecov.io/gh/StatFunGen/colocboost?branch=main)
 [![CRAN Version](https://www.r-pkg.org/badges/version/colocboost)](https://cran.r-project.org/package=colocboost)
 
-![](man/figures/colocboost.png)
+![](man/figures/colocboost.jpg)
 
 
 This R package implements ColocBoost---motivated and designed for colocalization analysis of multiple genetic association studies---as a multi-task learning approach to variable selection regression with highly correlated predictors and sparse effects, based on frequentist statistical inference. It provides statistical evidence to identify which subsets of predictors have non-zero effects on which subsets of response variables.

diff --git a/data/Heterogeneous_Effect.rda b/data/Heterogeneous_Effect.rda
diff --git a/data/Ind_5traits.rda b/data/Ind_5traits.rda
diff --git a/data/Non_Causal_Strongest_Marginal.rda b/data/Non_Causal_Strongest_Marginal.rda
diff --git a/data/Sumstat_5traits.rda b/data/Sumstat_5traits.rda
diff --git a/data/Weaker_GWAS_Effect.rda b/data/Weaker_GWAS_Effect.rda
diff --git a/inst/WORDLIST b/inst/WORDLIST
@@ -43,6 +43,7 @@ colocalized
 conda
 de
 decayrate
+doi
 eQTL
 grey
 iteratively
@@ -67,6 +68,7 @@ rcond
 recalibrate
 recalibrated
 reconciliate
+repo
 rss
 sQTL
 subsampled

diff --git a/man/Heterogeneous_Effect.Rd b/man/Heterogeneous_Effect.Rd
diff --git a/man/Ind_5traits.Rd b/man/Ind_5traits.Rd
diff --git a/man/Non_Causal_Strongest_Marginal.Rd b/man/Non_Causal_Strongest_Marginal.Rd
diff --git a/man/Sumstat_5traits.Rd b/man/Sumstat_5traits.Rd
diff --git a/man/Weaker_GWAS_Effect.Rd b/man/Weaker_GWAS_Effect.Rd
diff --git a/man/figures/colocboost.jpg b/man/figures/colocboost.jpg
diff --git a/man/figures/colocboost.png b/man/figures/colocboost.png
diff --git a/man/figures/missing_representation.png b/man/figures/missing_representation.png
diff --git a/vignettes/Ambiguous_Colocalization.Rmd b/vignettes/Ambiguous_Colocalization.Rmd
@@ -10,7 +10,8 @@ vignette: >
 ```{r, include = FALSE}
 knitr::opts_chunk$set(
   collapse = TRUE,
-  comment = "#>"
+  comment = "#>",
+  dpi = 80
 )
 ```
 

diff --git a/vignettes/Disease_Prioritized_Colocalization.Rmd b/vignettes/Disease_Prioritized_Colocalization.Rmd
@@ -10,7 +10,8 @@ vignette: >
 ```{r, include = FALSE}
 knitr::opts_chunk$set(
   collapse = TRUE,
-  comment = "#>"
+  comment = "#>",
+  dpi = 80
 )
 ```
 
@@ -33,10 +34,10 @@ To get started, load both `Ind_5traits` and `Sumstat_5traits` datasets into your
 - Note that `LD` could be calculated from the `X` data in the `Ind_5traits` dataset, but it is not included in the `Sumstat_5traits` dataset.
 
 ### Causal variant structure
-The dataset features two causal variants with indices 644 and 2289.
+The dataset features two causal variants with indices 194 and 589.
 
-- Causal variant 644 is associated with traits 1, 2, 3, and 4.
-- Causal variant 2289 is associated with traits 2, 3, and 5 (summary level data).
+- Causal variant 194 is associated with traits 1, 2, 3, and 4.
+- Causal variant 589 is associated with traits 2, 3, and 5 (summary level data).
 
 ```{r load-mixed-data}
 # Load example data
@@ -54,6 +55,8 @@ For analyze a specific one type of data, you can refer to the following
 tutorials [Individual Level Data Colocalization](https://statfungen.github.io/colocboost/articles/Individual_Level_Colocalization.html) and 
 [Summary Level Data Colocalization](https://statfungen.github.io/colocboost/articles/Summary_Level_Colocalization.html).
 
+Due to the file size limitation of CRAN release, this is a subset of simulated data. See full dataset in [colocboost paper repo](https://github.com/StatFunGen/colocboost-paper).
+
 
 # 2. ColocBoost in disease-agnostic mode
 
@@ -78,9 +81,6 @@ res <- colocboost(X = X, Y = Y, sumstat = sumstat, LD = LD)
 
 # Identified CoS
 res$cos_details$cos$cos_index
-
-# Plotting the results
-colocboost_plot(res)
 ```
 
 ### Results Interpretation

diff --git a/vignettes/FineBoost_Special_Case.Rmd b/vignettes/FineBoost_Special_Case.Rmd
@@ -10,7 +10,8 @@ vignette: >
 ```{r, include = FALSE}
 knitr::opts_chunk$set(
   collapse = TRUE,
-  comment = "#>"
+  comment = "#>",
+  dpi = 50
 )
 ```
 

diff --git a/vignettes/Individual_Level_Colocalization.Rmd b/vignettes/Individual_Level_Colocalization.Rmd
@@ -10,7 +10,8 @@ vignette: >
 ```{r, include = FALSE}
 knitr::opts_chunk$set(
   collapse = TRUE,
-  comment = "#>"
+  comment = "#>",
+  dpi = 80
 )
 ```
 
@@ -38,10 +39,10 @@ The dataset is specifically designed to evaluate and demonstrate the capabilitie
 - `true_effect_variants`: True effect variants indices for each trait.
 
 ### Causal variant structure
-The dataset features two causal variants with indices 644 and 2289.
+The dataset features two causal variants with indices 194 and 589.
 
-- Causal variant 644 is associated with traits 1, 2, 3, and 4.
-- Causal variant 2289 is associated with traits 2, 3, and 5.
+- Causal variant 194 is associated with traits 1, 2, 3, and 4.
+- Causal variant 589 is associated with traits 2, 3, and 5.
 
 This structure creates a realistic scenario where multiple traits are influenced by different but overlapping sets of genetic variants.
 
@@ -53,6 +54,7 @@ names(Ind_5traits)
 Ind_5traits$true_effect_variants
 ```
 
+Due to the file size limitation of CRAN release, this is a subset of simulated data. See full dataset in [colocboost paper repo](https://github.com/StatFunGen/colocboost-paper).
 
 # 2. Matched individual level input $X$ and $Y$
 

diff --git a/vignettes/Interpret_ColocBoost_Output.Rmd b/vignettes/Interpret_ColocBoost_Output.Rmd
@@ -10,7 +10,8 @@ vignette: >
 ```{r, include = FALSE}
 knitr::opts_chunk$set(
   collapse = TRUE,
-  comment = "#>"
+  comment = "#>",
+  dpi = 80
 )
 ```
 
@@ -24,10 +25,10 @@ library(colocboost)
 ## 1. Summarize ColocBoost results
 
 ### Causal variant structure
-The dataset features two causal variants with indices 644 and 2289.
+The dataset features two causal variants with indices 194 and 589.
 
-- Causal variant 644 is associated with traits 1, 2, 3, and 4.
-- Causal variant 2289 is associated with traits 2, 3, and 5.
+- Causal variant 194 is associated with traits 1, 2, 3, and 4.
+- Causal variant 589 is associated with traits 2, 3, and 5.
 
 ```{r run-colocboost}
 # Loading the Dataset
@@ -258,7 +259,7 @@ data(Ind_5traits)
 data(Heterogeneous_Effect)
 X <- Ind_5traits$X[1:3]
 Y <- Ind_5traits$Y[1:3]
-X1 <- Heterogeneous_Effect$X[,1:3000]
+X1 <- Heterogeneous_Effect$X
 Y1 <- Heterogeneous_Effect$Y[,1,drop=F]
 res <- colocboost(X = c(X, list(X1)), Y = c(Y, list(Y1)), output_level = 2)
 names(res$ucos_details)

diff --git a/vignettes/LD_Free_Colocalization.Rmd b/vignettes/LD_Free_Colocalization.Rmd
@@ -10,7 +10,8 @@ vignette: >
 ```{r, include = FALSE}
 knitr::opts_chunk$set(
   collapse = TRUE,
-  comment = "#>"
+  comment = "#>",
+  dpi = 80
 )
 ```
 
@@ -45,6 +46,7 @@ ColocBoost provides diagnostic warnings for assessing the consistency of the sum
 - Estimated residual variance of the model is negative or greater than phenotypic variance (`rtr < 0` or `rtr > var_y`; 
 see details in Supplementary Note S3.5.2). 
 - Change in trait-specific profile log-likelihood according to a CoS is negative (see details in Supplementary Note S3.5.3).
+- The trait-specific gradient boosting model fails to converge.
 
 
 ### Example of including LD mismatch
@@ -58,8 +60,8 @@ data("Ind_5traits")
 LD <- get_cormat(Ind_5traits$X[[1]])
 
 # Change sign of Z-score for 1% of variants for each trait by including mismatched LD
-set.seed(1)
-miss_prop <- 0.01 
+set.seed(123)
+miss_prop <- 0.005 
 sumstat <- lapply(Sumstat_5traits$sumstat, function(ss){
   p <- nrow(ss)
   pos_miss <- sample(1:p, ceiling(miss_prop * p))

diff --git a/vignettes/Partial_Overlap_Variants.Rmd b/vignettes/Partial_Overlap_Variants.Rmd
@@ -10,7 +10,8 @@ vignette: >
 ```{r, include = FALSE}
 knitr::opts_chunk$set(
   collapse = TRUE,
-  comment = "#>"
+  comment = "#>",
+  dpi = 80
 )
 ```
 
@@ -29,8 +30,8 @@ library(colocboost)
 
 We create an example data from `Ind_5traits` with two causal variants, 644 and 2289, but each of them is only partially overlapping across traits.
 
-- Causal variant 644 is associated with traits 1, 3, and 4, but is missing in trait 2.
-- Causal variant 2289 is associated with traits 2 and 3, but is missing in trait 5.
+- Causal variant 194 is associated with traits 1, 3, and 4, but is missing in trait 2.
+- Causal variant 589 is associated with traits 2 and 3, but is missing in trait 5.
 
 This structure creates a realistic scenario in which multiple traits from different datasets are not fully overlapping, and the causal variants are not shared across all traits.
 
@@ -41,8 +42,8 @@ X <- Ind_5traits$X
 Y <- Ind_5traits$Y
 
 # Create causal variants with potentially LD proxies
-causal_1 <- c(200:1000)
-causal_2 <- c(2000:2600)
+causal_1 <- c(100:350)
+causal_2 <- c(450:650)
 
 # Create missing data
 X[[2]] <- X[[2]][, -causal_1, drop = FALSE]

diff --git a/vignettes/Summary_Statistics_Colocalization.Rmd b/vignettes/Summary_Statistics_Colocalization.Rmd
@@ -10,7 +10,8 @@ vignette: >
 ```{r, include = FALSE}
 knitr::opts_chunk$set(
   collapse = TRUE,
-  comment = "#>"
+  comment = "#>",
+  dpi = 80
 )
 ```
 
@@ -33,10 +34,10 @@ The dataset is specifically designed to evaluate and demonstrate the capabilitie
 - Note that `LD` could be calculated from the `X` data in the `Ind_5traits` dataset, but it is not included in the `Sumstat_5traits` dataset.
 
 ### Causal variant structure
-The dataset features two causal variants with indices 644 and 2289.
+The dataset features two causal variants with indices 194 and 589.
 
-- Causal variant 644 is associated with traits 1, 2, 3, and 4.
-- Causal variant 2289 is associated with traits 2, 3, and 5.
+- Causal variant 194 is associated with traits 1, 2, 3, and 4.
+- Causal variant 589 is associated with traits 2, 3, and 5.
 
 This structure creates a realistic scenario in which multiple traits are influenced by different but overlapping sets of genetic variants.
 
@@ -47,6 +48,8 @@ names(Sumstat_5traits)
 Sumstat_5traits$true_effect_variants
 ```
 
+Due to the file size limitation of CRAN release, this is a subset of simulated data. See full dataset in [colocboost paper repo](https://github.com/StatFunGen/colocboost-paper).
+
 ### Important data format for summary data
 `sumstat` must include the following columns:
 
@@ -135,7 +138,7 @@ It allows for efficient analysis without redundancy.
 ```{r superset-LD}
 # Create sumstat with different number of variants - remove 100 variants in each sumstat
 LD_superset <- LD
-sumstat <- lapply(Sumstat_5traits$sumstat, function(x) x[-sample(1:nrow(x), 100), , drop = FALSE])
+sumstat <- lapply(Sumstat_5traits$sumstat, function(x) x[-sample(1:nrow(x), 20), , drop = FALSE])
 
 # Run colocboost
 res <- colocboost(sumstat = sumstat, LD = LD_superset)
@@ -203,9 +206,6 @@ res <- colocboost(effect_est = effect_est, effect_se = effect_se, effect_n = eff
 
 # Identified CoS
 res$cos_details$cos$cos_index
-
-# Plotting the results
-colocboost_plot(res)
 ```
 
 See more details about data format to implement LD-free ColocBoost and LD-mismatch diagnosis in [LD mismatch and LD-free Colocalization](https://statfungen.github.io/colocboost/articles/LD_Free_Colocalization.html)).