Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,10 @@ Authors@R: c(
person(given = "Gao", family = "Wang", email = "wang.gao@columbia.edu", role = c("aut", "cph"))
)
Maintainer: Xuewei Cao <xc2270@cumc.columbia.edu>
Description: A multi-task learning approach to variable selection regression with highly correlated predictors and sparse effects, based on frequentist statistical inference. It provides statistical evidence to identify which subsets of predictors have non-zero effects on which subsets of response variables, motivated and designed for colocalization analysis of multiple genetic association studies.
Description: A multi-task learning approach to variable selection regression with highly correlated predictors and sparse effects,
based on frequentist statistical inference. It provides statistical evidence to identify which subsets of predictors have non-zero
effects on which subsets of response variables, motivated and designed for colocalization analysis of multiple genetic association studies.
The ColocBoost model is described in Cao etc (2025) <doi:10.1101/2025.04.17.25326042>.
Encoding: UTF-8
LazyDataCompression: xz
LazyData: true
Expand Down
13 changes: 10 additions & 3 deletions R/data.R
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@
#' @source The Ind_5traits dataset contains 5 simulated phenotypes alongside corresponding genotype matrices.
#' The dataset is specifically designed for evaluating and demonstrating the capabilities of ColocBoost in multi-trait colocalization analysis
#' with individual-level data. See Cao etc. 2025 for details.
#' Due to the file size limitation of CRAN release, this is a subset of simulated data.
#' See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}.
#'
#' @family colocboost_data
"Ind_5traits"
Expand All @@ -30,6 +32,8 @@
#' where it is directly derived from the Ind_5traits dataset using marginal association.
#' The dataset is specifically designed for evaluating and demonstrating the capabilities of ColocBoost
#' in multi-trait colocalization analysis with summary association data. See Cao etc. 2025 for details.
#' Due to the file size limitation of CRAN release, this is a subset of simulated data.
#' See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}.
#'
#' @family colocboost_data
"Sumstat_5traits"
Expand All @@ -48,7 +52,8 @@
#' }
#' @source The Heterogeneous_Effect dataset contains 2 simulated phenotypes alongside corresponding genotype matrices.
#' There are two causal variants, both of which have heterogeneous effects on two traits.
#' See Figure 2b in Cao etc. 2025 for details.
#' Due to the file size limitation of CRAN release, this is a subset of simulated data to generate Figure 2b in Cao etc. 2025.
#' See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}.
#'
#' @family colocboost_data
"Heterogeneous_Effect"
Expand All @@ -67,7 +72,8 @@
#' }
#' @source The Weaker_GWAS_Effect dataset contains 2 simulated phenotypes alongside corresponding genotype matrices.
#' There are two causal variants, one of which has a weaker effect on the focal trait compared to the other trait.
#' See Figure 2b in Cao etc. 2025 for details.
#' Due to the file size limitation of CRAN release, this is a subset of simulated data to generate Figure 2b in Cao etc. 2025.
#' See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}.
#'
#' @family colocboost_data
"Weaker_GWAS_Effect"
Expand All @@ -86,7 +92,8 @@
#' }
#' @source The Non_Causal_Strongest_Marginal dataset contains 2 simulated phenotypes alongside corresponding genotype matrices.
#' There are two causal variants, but the strongest marginal association is not a causal variant.
#' See Figure 2b in Cao etc. 2025 for details.
#' Due to the file size limitation of CRAN release, this is a subset of simulated data to generate Figure 2b in Cao etc. 2025.
#' See full dataset in colocboost paper repo \url{https://github.com/StatFunGen/colocboost-paper}.
#'
#' @family colocboost_data
"Non_Causal_Strongest_Marginal"
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[![Codecov test coverage](https://codecov.io/gh/StatFunGen/colocboost/branch/main/graph/badge.svg)](https://codecov.io/gh/StatFunGen/colocboost?branch=main)
[![CRAN Version](https://www.r-pkg.org/badges/version/colocboost)](https://cran.r-project.org/package=colocboost)

![](man/figures/colocboost.png)
![](man/figures/colocboost.jpg)


This R package implements ColocBoost---motivated and designed for colocalization analysis of multiple genetic association studies---as a multi-task learning approach to variable selection regression with highly correlated predictors and sparse effects, based on frequentist statistical inference. It provides statistical evidence to identify which subsets of predictors have non-zero effects on which subsets of response variables.
Expand Down
Binary file modified data/Heterogeneous_Effect.rda
Binary file not shown.
Binary file modified data/Ind_5traits.rda
Binary file not shown.
Binary file modified data/Non_Causal_Strongest_Marginal.rda
Binary file not shown.
Binary file modified data/Sumstat_5traits.rda
Binary file not shown.
Binary file modified data/Weaker_GWAS_Effect.rda
Binary file not shown.
2 changes: 2 additions & 0 deletions inst/WORDLIST
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ colocalized
conda
de
decayrate
doi
eQTL
grey
iteratively
Expand All @@ -67,6 +68,7 @@ rcond
recalibrate
recalibrated
reconciliate
repo
rss
sQTL
subsampled
Expand Down
3 changes: 2 additions & 1 deletion man/Heterogeneous_Effect.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions man/Ind_5traits.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion man/Non_Causal_Strongest_Marginal.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

2 changes: 2 additions & 0 deletions man/Sumstat_5traits.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

3 changes: 2 additions & 1 deletion man/Weaker_GWAS_Effect.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Binary file added man/figures/colocboost.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed man/figures/colocboost.png
Binary file not shown.
Binary file modified man/figures/missing_representation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 2 additions & 1 deletion vignettes/Ambiguous_Colocalization.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ vignette: >
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
comment = "#>",
dpi = 80
)
```

Expand Down
14 changes: 7 additions & 7 deletions vignettes/Disease_Prioritized_Colocalization.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ vignette: >
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
comment = "#>",
dpi = 80
)
```

Expand All @@ -33,10 +34,10 @@ To get started, load both `Ind_5traits` and `Sumstat_5traits` datasets into your
- Note that `LD` could be calculated from the `X` data in the `Ind_5traits` dataset, but it is not included in the `Sumstat_5traits` dataset.

### Causal variant structure
The dataset features two causal variants with indices 644 and 2289.
The dataset features two causal variants with indices 194 and 589.

- Causal variant 644 is associated with traits 1, 2, 3, and 4.
- Causal variant 2289 is associated with traits 2, 3, and 5 (summary level data).
- Causal variant 194 is associated with traits 1, 2, 3, and 4.
- Causal variant 589 is associated with traits 2, 3, and 5 (summary level data).

```{r load-mixed-data}
# Load example data
Expand All @@ -54,6 +55,8 @@ For analyze a specific one type of data, you can refer to the following
tutorials [Individual Level Data Colocalization](https://statfungen.github.io/colocboost/articles/Individual_Level_Colocalization.html) and
[Summary Level Data Colocalization](https://statfungen.github.io/colocboost/articles/Summary_Level_Colocalization.html).

Due to the file size limitation of CRAN release, this is a subset of simulated data. See full dataset in [colocboost paper repo](https://github.com/StatFunGen/colocboost-paper).


# 2. ColocBoost in disease-agnostic mode

Expand All @@ -78,9 +81,6 @@ res <- colocboost(X = X, Y = Y, sumstat = sumstat, LD = LD)

# Identified CoS
res$cos_details$cos$cos_index

# Plotting the results
colocboost_plot(res)
```

### Results Interpretation
Expand Down
3 changes: 2 additions & 1 deletion vignettes/FineBoost_Special_Case.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ vignette: >
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
comment = "#>",
dpi = 50
)
```

Expand Down
10 changes: 6 additions & 4 deletions vignettes/Individual_Level_Colocalization.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ vignette: >
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
comment = "#>",
dpi = 80
)
```

Expand Down Expand Up @@ -38,10 +39,10 @@ The dataset is specifically designed to evaluate and demonstrate the capabilitie
- `true_effect_variants`: True effect variants indices for each trait.

### Causal variant structure
The dataset features two causal variants with indices 644 and 2289.
The dataset features two causal variants with indices 194 and 589.

- Causal variant 644 is associated with traits 1, 2, 3, and 4.
- Causal variant 2289 is associated with traits 2, 3, and 5.
- Causal variant 194 is associated with traits 1, 2, 3, and 4.
- Causal variant 589 is associated with traits 2, 3, and 5.

This structure creates a realistic scenario where multiple traits are influenced by different but overlapping sets of genetic variants.

Expand All @@ -53,6 +54,7 @@ names(Ind_5traits)
Ind_5traits$true_effect_variants
```

Due to the file size limitation of CRAN release, this is a subset of simulated data. See full dataset in [colocboost paper repo](https://github.com/StatFunGen/colocboost-paper).

# 2. Matched individual level input $X$ and $Y$

Expand Down
11 changes: 6 additions & 5 deletions vignettes/Interpret_ColocBoost_Output.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ vignette: >
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
comment = "#>",
dpi = 80
)
```

Expand All @@ -24,10 +25,10 @@ library(colocboost)
## 1. Summarize ColocBoost results

### Causal variant structure
The dataset features two causal variants with indices 644 and 2289.
The dataset features two causal variants with indices 194 and 589.

- Causal variant 644 is associated with traits 1, 2, 3, and 4.
- Causal variant 2289 is associated with traits 2, 3, and 5.
- Causal variant 194 is associated with traits 1, 2, 3, and 4.
- Causal variant 589 is associated with traits 2, 3, and 5.

```{r run-colocboost}
# Loading the Dataset
Expand Down Expand Up @@ -258,7 +259,7 @@ data(Ind_5traits)
data(Heterogeneous_Effect)
X <- Ind_5traits$X[1:3]
Y <- Ind_5traits$Y[1:3]
X1 <- Heterogeneous_Effect$X[,1:3000]
X1 <- Heterogeneous_Effect$X
Y1 <- Heterogeneous_Effect$Y[,1,drop=F]
res <- colocboost(X = c(X, list(X1)), Y = c(Y, list(Y1)), output_level = 2)
names(res$ucos_details)
Expand Down
8 changes: 5 additions & 3 deletions vignettes/LD_Free_Colocalization.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ vignette: >
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
comment = "#>",
dpi = 80
)
```

Expand Down Expand Up @@ -45,6 +46,7 @@ ColocBoost provides diagnostic warnings for assessing the consistency of the sum
- Estimated residual variance of the model is negative or greater than phenotypic variance (`rtr < 0` or `rtr > var_y`;
see details in Supplementary Note S3.5.2).
- Change in trait-specific profile log-likelihood according to a CoS is negative (see details in Supplementary Note S3.5.3).
- The trait-specific gradient boosting model fails to converge.


### Example of including LD mismatch
Expand All @@ -58,8 +60,8 @@ data("Ind_5traits")
LD <- get_cormat(Ind_5traits$X[[1]])

# Change sign of Z-score for 1% of variants for each trait by including mismatched LD
set.seed(1)
miss_prop <- 0.01
set.seed(123)
miss_prop <- 0.005
sumstat <- lapply(Sumstat_5traits$sumstat, function(ss){
p <- nrow(ss)
pos_miss <- sample(1:p, ceiling(miss_prop * p))
Expand Down
11 changes: 6 additions & 5 deletions vignettes/Partial_Overlap_Variants.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ vignette: >
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
comment = "#>",
dpi = 80
)
```

Expand All @@ -29,8 +30,8 @@ library(colocboost)

We create an example data from `Ind_5traits` with two causal variants, 644 and 2289, but each of them is only partially overlapping across traits.

- Causal variant 644 is associated with traits 1, 3, and 4, but is missing in trait 2.
- Causal variant 2289 is associated with traits 2 and 3, but is missing in trait 5.
- Causal variant 194 is associated with traits 1, 3, and 4, but is missing in trait 2.
- Causal variant 589 is associated with traits 2 and 3, but is missing in trait 5.

This structure creates a realistic scenario in which multiple traits from different datasets are not fully overlapping, and the causal variants are not shared across all traits.

Expand All @@ -41,8 +42,8 @@ X <- Ind_5traits$X
Y <- Ind_5traits$Y

# Create causal variants with potentially LD proxies
causal_1 <- c(200:1000)
causal_2 <- c(2000:2600)
causal_1 <- c(100:350)
causal_2 <- c(450:650)

# Create missing data
X[[2]] <- X[[2]][, -causal_1, drop = FALSE]
Expand Down
16 changes: 8 additions & 8 deletions vignettes/Summary_Statistics_Colocalization.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,8 @@ vignette: >
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
comment = "#>",
dpi = 80
)
```

Expand All @@ -33,10 +34,10 @@ The dataset is specifically designed to evaluate and demonstrate the capabilitie
- Note that `LD` could be calculated from the `X` data in the `Ind_5traits` dataset, but it is not included in the `Sumstat_5traits` dataset.

### Causal variant structure
The dataset features two causal variants with indices 644 and 2289.
The dataset features two causal variants with indices 194 and 589.

- Causal variant 644 is associated with traits 1, 2, 3, and 4.
- Causal variant 2289 is associated with traits 2, 3, and 5.
- Causal variant 194 is associated with traits 1, 2, 3, and 4.
- Causal variant 589 is associated with traits 2, 3, and 5.

This structure creates a realistic scenario in which multiple traits are influenced by different but overlapping sets of genetic variants.

Expand All @@ -47,6 +48,8 @@ names(Sumstat_5traits)
Sumstat_5traits$true_effect_variants
```

Due to the file size limitation of CRAN release, this is a subset of simulated data. See full dataset in [colocboost paper repo](https://github.com/StatFunGen/colocboost-paper).

### Important data format for summary data
`sumstat` must include the following columns:

Expand Down Expand Up @@ -135,7 +138,7 @@ It allows for efficient analysis without redundancy.
```{r superset-LD}
# Create sumstat with different number of variants - remove 100 variants in each sumstat
LD_superset <- LD
sumstat <- lapply(Sumstat_5traits$sumstat, function(x) x[-sample(1:nrow(x), 100), , drop = FALSE])
sumstat <- lapply(Sumstat_5traits$sumstat, function(x) x[-sample(1:nrow(x), 20), , drop = FALSE])

# Run colocboost
res <- colocboost(sumstat = sumstat, LD = LD_superset)
Expand Down Expand Up @@ -203,9 +206,6 @@ res <- colocboost(effect_est = effect_est, effect_se = effect_se, effect_n = eff

# Identified CoS
res$cos_details$cos$cos_index

# Plotting the results
colocboost_plot(res)
```

See more details about data format to implement LD-free ColocBoost and LD-mismatch diagnosis in [LD mismatch and LD-free Colocalization](https://statfungen.github.io/colocboost/articles/LD_Free_Colocalization.html)).
Expand Down
Loading