Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions R/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@

## Data

- `data.R`: Contains example datasets for testing and demonstration.

## Source code structure for developer

Implementation of ColocBoost algorithm falls roughly in the structure of ColocBoost paper.
That is, we introduce a multi-task regression problem for L traits, followed by the dynamic coupling strategy with SEC learner,
and post assemble and inference. Implementation-wise,


1. `colocboost.R` implements the main interface function that users interact with directly.

2. `colocboost_workhorse.R`: The core interface of dynamic coupling strategy with SEC learner.
- `colocboost_check_update_jk.R`: The strategy to determine best update variant for the subset of traits.
- `colocboost_update.R`: The single effect learner/coupler (SEC) for the best update variant and traits.
- `colocboost_one_causal.R`: The special case of ColocBoost with per-trait-per-causal assumption with/without LD information.

3. `colocboost_assemble.R` implements the core interface of post assemble and inference SEC learners from 2.
- `colocboost_assemble_cos.R`: The function to create 95% CoS of different colocalization events.
- `colocboost_assemble_ucos.R`: The function to create 95% CS of trait-specific effects.
- `colocboost_inference.R`: Post inference functions includes modularity hierarchical clustering method, remove spurious signals, definitation of colocalization evidence, et al.
- `colocboost_utils.R`: Utility functions includes refining colocalization confidence sets from different SEC and other utilities.
- `colocboost_output.R`: Format and export analysis results

4. `colocboost_plot.R` implements various visualization options for visualize colocboost results.

16 changes: 16 additions & 0 deletions _pkgdown.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,22 @@ url: https://statfungen.github.io/colocboost
template:
bootstrap: 5

navbar:
left:
- text: "Home"
href: index.html
- text: "News"
href: articles/announcements.html
- text: "Installation"
href: articles/installation.html
- text: "Tutorials"
href: articles/index.html
- text: "Functions"
href: reference/index.html
right:
- icon: fa-github
href: https://github.com/StatFunGen/colocboost

reference:
- title: "Example Data"
desc: "Example datasets for demonstration and testing"
Expand Down
43 changes: 43 additions & 0 deletions index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
[![Codecov test coverage](https://codecov.io/gh/StatFunGen/colocboost/branch/main/graph/badge.svg)](https://codecov.io/gh/StatFunGen/colocboost?branch=main)
[![CRAN Version](https://www.r-pkg.org/badges/version/colocboost)](https://cran.r-project.org/package=colocboost)

![](man/figures/colocboost.png)


This R package implements ColocBoost --- motivated and designed for colocalization analysis of multiple genetic association studies --- as a multi-task learning approach to variable selection regression with highly correlated predictors and sparse effects, based on frequentist statistical inference. It provides statistical evidence to identify which subsets of predictors have non-zero effects on which subsets of response variables.

## Quick Start

### CRAN Installation
Install released versions from CRAN - pre-built packages are available on macOS and Windows

```r
install.packages("colocboost")
```

### GitHub Installation
Install the development version from GitHub

```r
devtools::install_github("StatFunGen/colocboost")
```

For a detailed installation guidance, please refer to [https://statfungen.github.io/colocboost/articles/installation.html](https://statfungen.github.io/colocboost/articles/installation.html).



## Tutorial Website

Learn how to perform colocalization analysis with step-by-step examples. For detailed tutorials and use cases, explore our FIXME (link).


## Citation

If you use ColocBoost in your research, please cite:

Cao X, Sun H, Feng R, Mazumder R, Najar CFB, Li YI, de Jager PL, Bennett D, The Alzheimer's Disease Functional Genomics Consortium, Dey KK, Wang G. (2025+). Integrative multi-omics QTL colocalization maps regulatory architecture in aging human brain. bioRxiv. [https://doi.org/](https://doi.org/)


## License

This package is released under the MIT License.
Binary file added man/figures/colocboost.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
104 changes: 64 additions & 40 deletions vignettes/Individual_Level_Colocalization.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,38 +14,55 @@ knitr::opts_chunk$set(
)
```

This vignette demonstrates how to perform colocalization analysis using individual-level data with multiple traits in ColocBoost, specifically focusing on the `Ind_5traits` dataset included in the package.
This vignette demonstrates how to perform multi-trait colocalization analysis using individual-level data in ColocBoost,
specifically focusing on the `Ind_5traits` dataset included in the package.


```{r setup}
library(colocboost)
```


## Example Data – Ind_5traits
# 1. The `Ind_5traits` Dataset


The `Ind_5traits` dataset contains 5 simulated phenotypes alongside corresponding genotype matrices.
The dataset is specifically designed to facilitate the identification of causal variants for complex traits.
The `Ind_5traits` dataset contains 5 simulated phenotypes alongside corresponding genotype matrices. T
he dataset is specifically designed for evaluating and demonstrating the capabilities of ColocBoost in multiple trait colocalization analysis with individual-level data.

- `X`: A list of genotype matrices for different outcomes.
- `Y`: A list of phenotype vectors for different outcomes.
- `true_effect_variants`: True effect variants indices for each trait.

- `X`: A list of genotype matrices for different outcomes. Each matrix with a dimension of (individuals × variables)
- `Y`: A list of phenotype vectors for different outcomes. Each matrix with a dimension of (individuals × 1)
- `true_effect_variants`: True effect variable indices for each trait.
### Causal variant structure
The dataset features two causal variants with indices 644 and 2289.

- Causal variant 644 is associated with traits 1, 2, 3, and 4.
- Causal variant 2289 is associated with traits 2, 3, and 5.

This structure creates a realistic scenario where multiple traits are influenced by different but overlapping sets of genetic variants.


```{r load-example-data}
# Load the example data
# Loading the Dataset
data(Ind_5traits)
names(Ind_5traits)
Ind_5traits$true_effect_variants
```

There are two

## Multiple X and Multiple Y
# 2. Run ColocBoost (Basic usage)


The preferred format for colocalization analysis in ColocBoost using individual level data is where genotype (X) and phenotype (Y) data are properly matched.

- **Basic format**: `X` and `Y` are organized as lists, matched by trait index,
- `(X[1], Y[1])` contains individuals for trait 1,
- `(X[2], Y[2])` contains individuals for trait 2,
- And so on for each trait under analysis.
- **Cross-trait flexibility**:
- There is no requirement for the same individuals across different traits. This allows for the analysis of traits with different sample sizes.
- This is particularly useful when you have a large dataset with many traits and want to focus on specific individuals for each trait.

The default and preferred format for colocalization analysis in ColocBoost is where genotype (X) and phenotype (Y) data are matched by individual:

```{r multiple-matched}
# Extract genotype (X) and phenotype (Y) data
Expand All @@ -55,19 +72,28 @@ Y <- Ind_5traits$Y
# Run colocboost with matched data
res <- colocboost(X = X, Y = Y)

# View results
str(res)
# Identified CoS
res$cos_details$cos$cos_index
```

Key requirements for this format:
- Both X and Y must have the same number of rows (individuals)
- Individuals must be in the same order in both matrices
- Covariates Z should also have the same number of rows
- To analyze multiple traits, you can loop over the columns of Y or use the ColocMultiBoost function

## Single X (List) and Multiple Y (Matrix Form)
### Results Interpretation

For comprehensive tutorials on result interpretation and advanced visualization techniques, please visit our documentation portal at FIXME (link).


# 3. Run ColocBoost (Advance usage)

## 3.1. Single genotype matrix

When studying multiple traits with a common genotype matrix, such as gene expression in different tissues or cell types,
we provide the interface for one single genotype matrix with multiple phenotypes.
This is particularly useful when the same individuals are used for different traits, allowing for efficient analysis without redundancy.

- **Input Format**:
- `X` is a single matrix containing genotype data for all individuals.
- `Y` can be i) a matrix with N * L dimension; ii) a list of phenotype vectors for L traits.

When you want to focus on a single variant across multiple traits:

```{r single-x}
# Extract a single SNP (as a vector)
Expand All @@ -76,43 +102,41 @@ X_single <- X[[1]] # First SNP for all individuals
# Run colocboost
res <- colocboost(X = X, Y = Y)

# View results for the first trait
str(res)
# Identified CoS
res$cos_details$cos$cos_index
```

Key aspects of this approach:
- You can keep X as a single-column matrix or vector
- For multiple traits, you can loop over the columns of Y or use dedicated functions
- Covariates are applied consistently across all analyses
## 3.2. Arbitrary X and Y with dictionary provided

When studying multiple traits with arbitrary genotype matrices for different traits,
we also provide the interface for arbitrary genotype matrices with multiple phenotypes.
This particularly benefits meta-analysis across heterogeneous datasets where, for different subsets of traits,
genotype data comes from different genotyping platforms or sequencing technologies.

## Multiple X and Multiple Y – Dictionary Provided
- **Input Format**:
- `X` is a list of genotype matrices.
- `Y` is a list of phenotype vectors.
- `dict_YX` is a dictionary matrix that index of Y to index of X.

When you need to map between different X and Y variables using a dictionary:

```{r dictionary-mapped}
# Create a simple dictionary for demonstration purposes
X_two <- X[c(1,3)]
X_arbitrary <- X[c(1,3)] # traits 1 and 2 matched to the first genotype matrix; traits 3,4,5 matched to the third genotype matrix.
dict_YX = cbind(c(1:5), c(1,1,2,2,2))

# Display the dictionary
dict_YX

# Run colocboost
res <- colocboost(X = X_two, Y = Y, dict_YX = dict_YX)
res <- colocboost(X = X_arbitrary, Y = Y, dict_YX = dict_YX)

# View results for the first trait
str(res)
# Identified CoS
res$cos_details$cos$cos_index
```

Key features of dictionary-based mapping:
- Allows you to organize and filter your data based on metadata
- Provides a structured way to connect SNPs to genes or other features
- Can incorporate genomic positions, functional annotations, etc.
- Particularly useful for large-scale analyses with many variants and traits


## Conclusion

ColocBoost provides flexible methods for individual-level colocalization analysis across multiple formats. By working directly with raw genotype and phenotype data, you gain greater statistical power and more detailed insights compared to summary statistics-based approaches.
ColocBoost provides flexible methods for individual-level colocalization analysis across multiple formats.
By working directly with raw genotype and phenotype data, you gain greater statistical power and more detailed insights compared to summary statistics-based approaches.

For more advanced usage and detailed explanations, please refer to: