diff --git a/R/README.md b/R/README.md new file mode 100644 index 0000000..94d525e --- /dev/null +++ b/R/README.md @@ -0,0 +1,28 @@ + +## Data + + - `data.R`: Contains example datasets for testing and demonstration. + +## Source code structure for developer + +Implementation of ColocBoost algorithm falls roughly in the structure of ColocBoost paper. +That is, we introduce a multi-task regression problem for L traits, followed by the dynamic coupling strategy with SEC learner, +and post assemble and inference. Implementation-wise, + + +1. `colocboost.R` implements the main interface function that users interact with directly. + +2. `colocboost_workhorse.R`: The core interface of dynamic coupling strategy with SEC learner. + - `colocboost_check_update_jk.R`: The strategy to determine best update variant for the subset of traits. + - `colocboost_update.R`: The single effect learner/coupler (SEC) for the best update variant and traits. + - `colocboost_one_causal.R`: The special case of ColocBoost with per-trait-per-causal assumption with/without LD information. + +3. `colocboost_assemble.R` implements the core interface of post assemble and inference SEC learners from 2. + - `colocboost_assemble_cos.R`: The function to create 95% CoS of different colocalization events. + - `colocboost_assemble_ucos.R`: The function to create 95% CS of trait-specific effects. + - `colocboost_inference.R`: Post inference functions includes modularity hierarchical clustering method, remove spurious signals, definitation of colocalization evidence, et al. + - `colocboost_utils.R`: Utility functions includes refining colocalization confidence sets from different SEC and other utilities. + - `colocboost_output.R`: Format and export analysis results + +4. `colocboost_plot.R` implements various visualization options for visualize colocboost results. + diff --git a/_pkgdown.yml b/_pkgdown.yml index 2015fba..2ffe0fa 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -3,6 +3,22 @@ url: https://statfungen.github.io/colocboost template: bootstrap: 5 +navbar: + left: + - text: "Home" + href: index.html + - text: "News" + href: articles/announcements.html + - text: "Installation" + href: articles/installation.html + - text: "Tutorials" + href: articles/index.html + - text: "Functions" + href: reference/index.html + right: + - icon: fa-github + href: https://github.com/StatFunGen/colocboost + reference: - title: "Example Data" desc: "Example datasets for demonstration and testing" diff --git a/index.md b/index.md new file mode 100644 index 0000000..a76123f --- /dev/null +++ b/index.md @@ -0,0 +1,43 @@ +[![Codecov test coverage](https://codecov.io/gh/StatFunGen/colocboost/branch/main/graph/badge.svg)](https://codecov.io/gh/StatFunGen/colocboost?branch=main) +[![CRAN Version](https://www.r-pkg.org/badges/version/colocboost)](https://cran.r-project.org/package=colocboost) + +![](man/figures/colocboost.png) + + +This R package implements ColocBoost --- motivated and designed for colocalization analysis of multiple genetic association studies --- as a multi-task learning approach to variable selection regression with highly correlated predictors and sparse effects, based on frequentist statistical inference. It provides statistical evidence to identify which subsets of predictors have non-zero effects on which subsets of response variables. + +## Quick Start + +### CRAN Installation +Install released versions from CRAN - pre-built packages are available on macOS and Windows + +```r +install.packages("colocboost") +``` + +### GitHub Installation +Install the development version from GitHub + +```r +devtools::install_github("StatFunGen/colocboost") +``` + +For a detailed installation guidance, please refer to [https://statfungen.github.io/colocboost/articles/installation.html](https://statfungen.github.io/colocboost/articles/installation.html). + + + +## Tutorial Website + +Learn how to perform colocalization analysis with step-by-step examples. For detailed tutorials and use cases, explore our FIXME (link). + + +## Citation + +If you use ColocBoost in your research, please cite: + +Cao X, Sun H, Feng R, Mazumder R, Najar CFB, Li YI, de Jager PL, Bennett D, The Alzheimer's Disease Functional Genomics Consortium, Dey KK, Wang G. (2025+). Integrative multi-omics QTL colocalization maps regulatory architecture in aging human brain. bioRxiv. [https://doi.org/](https://doi.org/) + + +## License + +This package is released under the MIT License. diff --git a/man/figures/colocboost.png b/man/figures/colocboost.png new file mode 100644 index 0000000..6c976f8 Binary files /dev/null and b/man/figures/colocboost.png differ diff --git a/vignettes/Individual_Level_Colocalization.Rmd b/vignettes/Individual_Level_Colocalization.Rmd index d269ff1..467d9d3 100644 --- a/vignettes/Individual_Level_Colocalization.Rmd +++ b/vignettes/Individual_Level_Colocalization.Rmd @@ -14,7 +14,8 @@ knitr::opts_chunk$set( ) ``` -This vignette demonstrates how to perform colocalization analysis using individual-level data with multiple traits in ColocBoost, specifically focusing on the `Ind_5traits` dataset included in the package. +This vignette demonstrates how to perform multi-trait colocalization analysis using individual-level data in ColocBoost, +specifically focusing on the `Ind_5traits` dataset included in the package. ```{r setup} @@ -22,30 +23,46 @@ library(colocboost) ``` -## Example Data – Ind_5traits +# 1. The `Ind_5traits` Dataset -The `Ind_5traits` dataset contains 5 simulated phenotypes alongside corresponding genotype matrices. -The dataset is specifically designed to facilitate the identification of causal variants for complex traits. +The `Ind_5traits` dataset contains 5 simulated phenotypes alongside corresponding genotype matrices. T +he dataset is specifically designed for evaluating and demonstrating the capabilities of ColocBoost in multiple trait colocalization analysis with individual-level data. +- `X`: A list of genotype matrices for different outcomes. +- `Y`: A list of phenotype vectors for different outcomes. +- `true_effect_variants`: True effect variants indices for each trait. -- `X`: A list of genotype matrices for different outcomes. Each matrix with a dimension of (individuals × variables) -- `Y`: A list of phenotype vectors for different outcomes. Each matrix with a dimension of (individuals × 1) -- `true_effect_variants`: True effect variable indices for each trait. +### Causal variant structure +The dataset features two causal variants with indices 644 and 2289. + +- Causal variant 644 is associated with traits 1, 2, 3, and 4. +- Causal variant 2289 is associated with traits 2, 3, and 5. + +This structure creates a realistic scenario where multiple traits are influenced by different but overlapping sets of genetic variants. ```{r load-example-data} -# Load the example data +# Loading the Dataset data(Ind_5traits) names(Ind_5traits) Ind_5traits$true_effect_variants ``` -There are two -## Multiple X and Multiple Y +# 2. Run ColocBoost (Basic usage) + + +The preferred format for colocalization analysis in ColocBoost using individual level data is where genotype (X) and phenotype (Y) data are properly matched. + +- **Basic format**: `X` and `Y` are organized as lists, matched by trait index, + - `(X[1], Y[1])` contains individuals for trait 1, + - `(X[2], Y[2])` contains individuals for trait 2, + - And so on for each trait under analysis. +- **Cross-trait flexibility**: + - There is no requirement for the same individuals across different traits. This allows for the analysis of traits with different sample sizes. + - This is particularly useful when you have a large dataset with many traits and want to focus on specific individuals for each trait. -The default and preferred format for colocalization analysis in ColocBoost is where genotype (X) and phenotype (Y) data are matched by individual: ```{r multiple-matched} # Extract genotype (X) and phenotype (Y) data @@ -55,19 +72,28 @@ Y <- Ind_5traits$Y # Run colocboost with matched data res <- colocboost(X = X, Y = Y) -# View results -str(res) +# Identified CoS +res$cos_details$cos$cos_index ``` -Key requirements for this format: -- Both X and Y must have the same number of rows (individuals) -- Individuals must be in the same order in both matrices -- Covariates Z should also have the same number of rows -- To analyze multiple traits, you can loop over the columns of Y or use the ColocMultiBoost function -## Single X (List) and Multiple Y (Matrix Form) +### Results Interpretation + +For comprehensive tutorials on result interpretation and advanced visualization techniques, please visit our documentation portal at FIXME (link). + + +# 3. Run ColocBoost (Advance usage) + +## 3.1. Single genotype matrix + +When studying multiple traits with a common genotype matrix, such as gene expression in different tissues or cell types, +we provide the interface for one single genotype matrix with multiple phenotypes. +This is particularly useful when the same individuals are used for different traits, allowing for efficient analysis without redundancy. + +- **Input Format**: + - `X` is a single matrix containing genotype data for all individuals. + - `Y` can be i) a matrix with N * L dimension; ii) a list of phenotype vectors for L traits. -When you want to focus on a single variant across multiple traits: ```{r single-x} # Extract a single SNP (as a vector) @@ -76,43 +102,41 @@ X_single <- X[[1]] # First SNP for all individuals # Run colocboost res <- colocboost(X = X, Y = Y) -# View results for the first trait -str(res) +# Identified CoS +res$cos_details$cos$cos_index ``` -Key aspects of this approach: -- You can keep X as a single-column matrix or vector -- For multiple traits, you can loop over the columns of Y or use dedicated functions -- Covariates are applied consistently across all analyses +## 3.2. Arbitrary X and Y with dictionary provided + +When studying multiple traits with arbitrary genotype matrices for different traits, +we also provide the interface for arbitrary genotype matrices with multiple phenotypes. +This particularly benefits meta-analysis across heterogeneous datasets where, for different subsets of traits, +genotype data comes from different genotyping platforms or sequencing technologies. -## Multiple X and Multiple Y – Dictionary Provided +- **Input Format**: + - `X` is a list of genotype matrices. + - `Y` is a list of phenotype vectors. + - `dict_YX` is a dictionary matrix that index of Y to index of X. -When you need to map between different X and Y variables using a dictionary: ```{r dictionary-mapped} # Create a simple dictionary for demonstration purposes -X_two <- X[c(1,3)] +X_arbitrary <- X[c(1,3)] # traits 1 and 2 matched to the first genotype matrix; traits 3,4,5 matched to the third genotype matrix. dict_YX = cbind(c(1:5), c(1,1,2,2,2)) # Display the dictionary dict_YX # Run colocboost -res <- colocboost(X = X_two, Y = Y, dict_YX = dict_YX) +res <- colocboost(X = X_arbitrary, Y = Y, dict_YX = dict_YX) -# View results for the first trait -str(res) +# Identified CoS +res$cos_details$cos$cos_index ``` -Key features of dictionary-based mapping: -- Allows you to organize and filter your data based on metadata -- Provides a structured way to connect SNPs to genes or other features -- Can incorporate genomic positions, functional annotations, etc. -- Particularly useful for large-scale analyses with many variants and traits - ## Conclusion -ColocBoost provides flexible methods for individual-level colocalization analysis across multiple formats. By working directly with raw genotype and phenotype data, you gain greater statistical power and more detailed insights compared to summary statistics-based approaches. +ColocBoost provides flexible methods for individual-level colocalization analysis across multiple formats. +By working directly with raw genotype and phenotype data, you gain greater statistical power and more detailed insights compared to summary statistics-based approaches. -For more advanced usage and detailed explanations, please refer to: