diff --git a/_pkgdown.yml b/_pkgdown.yml index 6af6147..d4a3c28 100644 --- a/_pkgdown.yml +++ b/_pkgdown.yml @@ -27,16 +27,16 @@ articles: - Interpret_ColocBoost_Output - Visualization_ColocBoost_Output - Partial_Overlap_Variants + - ColocBoost_Wrapper_Pipeline + - LD_Free_Colocalization + - ColocBoost_Diagnostics + - FineBoost_Special_Case - title: internal contents: - announcements - installation - - ColocBoost_tutorial_diagnostic - - LD_Free_Colocalization - - ColocBoost_Wrapper - Pairwise_Colocalization - - FineBoost_Special_Case reference: - title: "Example Data" diff --git a/vignettes/ColocBoost_Diagnostics.Rmd b/vignettes/ColocBoost_Diagnostics.Rmd new file mode 100644 index 0000000..da9d886 --- /dev/null +++ b/vignettes/ColocBoost_Diagnostics.Rmd @@ -0,0 +1,49 @@ +--- +title: "Diagnostic for ColocBoost" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Diagnostic for ColocBoost} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + + +This vignette demonstrates how to perform multi-trait colocalization analysis using diagnosic details in ColocBoost, +including model parameters, model fitting, and model performance metrics. There is `diagnostic_details` in ColocBoost output when setting `output_level = 3`: + + +```{r setup} +library(colocboost) +# Run colocboost with diagnostic details +data(Ind_5traits) +res <- colocboost(X = Ind_5traits$X, Y = Ind_5traits$Y, output_level = 3) +names(res) +``` + +# 1. Diagnostic details of the model parameters + +- **`cb_model_para`**: parameters used in fitting ColocBoost model. + + +```{r cb-model} +names(res$diagnostic_details$cb_model_para) +``` + + +# 2. Diagnostic details of the model fitting + +- **`cb_model`**: trait-specific proximity gradient boosting model, including proximity weight at each iteration, residual after gradient boosting, et al. +- **`weights_paths``**: individual trait-specific weights for each iteration. + +```{r cb-model-para} +names(res$diagnostic_details$cb_model) +names(res$diagnostic_details$cb_model$ind_outcome_1) +``` + diff --git a/vignettes/ColocBoost_Wrapper.Rmd b/vignettes/ColocBoost_Wrapper.Rmd deleted file mode 100644 index e69de29..0000000 diff --git a/vignettes/ColocBoost_Wrapper_Pipeline.Rmd b/vignettes/ColocBoost_Wrapper_Pipeline.Rmd new file mode 100644 index 0000000..412641c --- /dev/null +++ b/vignettes/ColocBoost_Wrapper_Pipeline.Rmd @@ -0,0 +1,138 @@ +--- +title: "ColocBoost Wrapper Pipeline" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{ColocBoost Wrapper Pipeline} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + + +This vignette demonstrates how to use the ColocBoost wrapper pipeline to perform colocalization analysis using `colocboost`. +See more details about functions in package `pecotmr` with [link](https://github.com/StatFunGen/pecotmr/tree/main) and +`colocboost_pipeline` with [link](https://github.com/StatFunGen/pecotmr/blob/main/R/colocboost_pipeline.R). +See more etails about input data preparation in `xqtl_protocal` with [https://statfungen.github.io/xqtl-protocol/code/mnm_analysis/mnm_methods/colocboost.html]. + + +# 1. Loading Data use `colocboost_analysis_pipeline` function + + +This function harmonizes the input data and prepares it for colocalization analysis. +In this section, we introduce how to load the regional data required for the ColocBoost analysis using the `load_multitask_regional_data` function. +This function loads a mixture data sets for a specific region, including individual-level data (genotype, phenotype, covariate data) +or summary statistics (sumstats, LD). Run \code{load_regional_univariate_data} and \code{load_rss_data} multiple times for different datasets + + +Below are the input parameters for this function: + +## 1.1. Loading individual level data from multiple corhorts + + +- **`region`** (required): A string of `chr:start-end` for the phenotype region you want to analyzed. +- **`genotype_list`**: A vector of path for PLINK bed files containing genotype data. +- **`phenotype_list`**: A vector of path for phenotype file names. +- **`covariate_list`**: A vector of path for covariate file names corresponding to the phenotype file vector. +- **`conditions_list_individual`**: A vector of strings representing different conditions or groups. +- **`match_geno_pheno`**: A vector of indices of phenotypes matched to genotype if multiple genotype PLINK files are used. +- **`maf_cutoff`**: Minimum minor allele frequency (MAF) cutoff. Default is 0. +- **`mac_cutoff`**: Minimum minor allele count (MAC) cutoff. Default is 0. +- **`xvar_cutoff`**: Minimum variance cutoff. Default is 0. +- **`imiss_cutoff`**: Maximum individual missingness cutoff. Default is 0. +- **`association_window`**: A string of `chr:start-end` for the association analysis window (cis or trans). If not provided, all genotype data will be loaded. +- **`extract_region_name`**: A list of vectors of strings (e.g., gene ID `ENSG00000269699`) to subset the information when there are multiple regions available. Default is `NULL`. +- **`region_name_col`**: Column name containing the region name. Default is `NULL`. +- **`keep_indel`**: Logical indicating whether to keep insertions/deletions (INDELs). Default is `TRUE`. +- **`keep_samples`**: A vector of sample names to keep. Default is `NULL`. +- **`phenotype_header`**: Number of rows to skip at the beginning of the transposed phenotype file (default is 4 for `chr`, `start`, `end`, and `ID`). +- **`scale_residuals`**: Logical indicating whether to scale residuals. Default is `FALSE`. +- **`tabix_header`**: Logical indicating whether the tabix file has a header. Default is `TRUE`. + + +**Illustrated example** + +The following example demonstrates how to set up an input with 3 phenotypes and 2 cohorts, where the first cohort has 2 phenotypes and the second cohort has 1 phenotype. + +```{r, data-loader-individual} +# Example of loading individual-level data +region = "chr1:1000000-2000000" +genotype_list = c("plink_cohort1.1.bed", "plink_cohort1.2.bed") +phenotype_list = c("phenotype1_cohort1.bed.gz", "phenotype2_cohort1.bed.gz", "phenotype1_cohort2.bed.gz") +covariate_list = c("covariate1_cohort1.bed.gz", "covariate2_cohort1.bed.gz", "covariate1_cohort2.bed.gz") +conditions_list_individual = c("phenotype1_cohort1", "phenotype2_cohort1", "phenotype1_cohort2") +match_geno_pheno = c(1,1,2) # indices of phenotypes matched to genotype +association_window = "chr1:1000000-2000000" # same as region for cis-analysis + +# Following parameters need to be set according to your data +maf_cutoff = 0.01 +mac_cutoff = 10 +xvar_cutoff = 0 +imiss_cutoff = 0.9 + +# More advanced parameters see pecotmr::load_multitask_regional_data() +``` + + + +## 1.2. Loading summary statistics from multiple corhorts or data set + +- **`sumstat_path_list`**: A vector of file paths to the summary statistics. +- **`column_file_path_list`**: A vector of file paths to the column mapping files. +- **`LD_meta_file_path_list`**: A vector of paths to LD metadata files. LD metadata is a data frame specifying LD blocks with columns "chrom", "start", "end", and "path". "start" and "end" denote the positions of LD blocks, and "path" is the path of each LD block, optionally including bim file paths. +- **`match_LD_sumstat`**: Logical indicating whether to match LD blocks with summary statistics. +- **`conditions_list_sumstat`**: A vector of strings representing different sumstats. +- **`n_samples`**: User-specified sample size. If unknown, set as 0 to retrieve from the sumstat file. +- **`n_cases`**: User-specified number of cases. +- **`n_controls`**: User-specified number of controls. +- **`region`**: The region where tabix is used to subset the input dataset. +- **`extract_sumstats_region_name`**: User-specified gene/phenotype name used to further subset the phenotype data. +- **`sumstats_region_name_col`**: Filter this specific column for the `extract_sumstats_region_name`. +- **`comment_string`**: Comment sign in the column mapping file, default is `#`. +- **`extract_coordinates`**: Optional data frame with columns "chrom" and "pos" for specific coordinates extraction. + +**Illustrated example** + +The following example demonstrates how to set up an input with 2 summary and one LD reference. + +```{r, data-loader-sumstat} +# Example of loading summary statistics +sumstat_path_list = c("sumstat1.tsv.gz", "sumstat2.tsv.gz") +column_file_path_list = c("mapping_columns_1.yml", "mapping_columns_2.yml") +LD_meta_file_path_list = "ld_meta_file.tsv" +covariate_list = c("covariate1_cohort1.bed.gz", "covariate2_cohort1.bed.gz", "covariate1_cohort2.bed.gz") +conditions_list_sumstat = c("sumstat_1", "sumstat_2") + +# Following parameters need to be set according to your data +n_samples = c(0, 0) +n_cases = c(10000, 20000) +n_controls = c(20000, 40000) + +# More advanced parameters see pecotmr::load_multitask_regional_data() +``` + + + +# Perform ColocBoost using `colocboost_analysis_pipeline` function + +In this section, we perform the colocalization analysis using the `colocboost_analysis_pipeline` function. Below are the input parameters for this function: + + +- **`region_data`**: The output of the `load_multitask_regional_data` function, which contains harmonized summary statistics and LD matrices for the region of interest. +- **`focal_trait`**: Name of trait if performing disease prioritized ColocBoost. +- **`event_filters`**: A list of patterns for filtering events based on context names. +Example: for sQTL, list(type_pattern = ".*clu_(\\d+_[+-?]).*", valid_pattern = "clu_(\\d+_[+-?]):PR:", exclude_pattern = "clu_(\\d+_[+-?]):IN:"). +- **`maf_cutoff`**: A scalar to remove variants with maf < maf_cutoff, default is 0.005. +- **`pip_cutoff_to_skip_ind`**: A vector of cutoff values for skipping analysis based on PIP values for each context. Default is 0. +- **`pip_cutoff_to_skip_sumstat`**: A vector of cutoff values for skipping analysis based on PIP values for each sumstat. Default is 0. +- **`qc_method`**: Quality control method to use. Options are "rss_qc", "dentist", or "slalom" (default: "rss_qc"). +- **`impute`**: Logical; if TRUE, performs imputation for outliers identified in the analysis (default: TRUE). +- **`impute_opts`**: A list of imputation options including rcond, R2_threshold, and minimum_ld (default: list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5)). + + + diff --git a/vignettes/ColocBoost_tutorial_diagnostic.Rmd b/vignettes/ColocBoost_tutorial_diagnostic.Rmd deleted file mode 100644 index 7ed04b9..0000000 --- a/vignettes/ColocBoost_tutorial_diagnostic.Rmd +++ /dev/null @@ -1,34 +0,0 @@ ---- -title: "ColocBoost Tutortial (Diagnostic)" -output: rmarkdown::html_vignette -vignette: > - %\VignetteIndexEntry{ColocBoost Tutortial (Diagnostic)} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -```{r, include = FALSE} -knitr::opts_chunk$set( - collapse = TRUE, - comment = "#>" -) -``` - -ColocBoost: Multi-omics xQTL colocalization improves the discovery of causal variants for complex diseases. - - - -```{r setup} -library(colocboost) -``` - - -## Diagnostic for the post-processing to filter the small effects - -## Diagnostic for add-hoc colocalized sets based on between-purity - -## Diagnostic for checking the updates in each gradient-boosting iteration - - - - diff --git a/vignettes/FineBoost_Special_Case.Rmd b/vignettes/FineBoost_Special_Case.Rmd index cf2216f..7efa73b 100644 --- a/vignettes/FineBoost_Special_Case.Rmd +++ b/vignettes/FineBoost_Special_Case.Rmd @@ -27,6 +27,9 @@ library(colocboost) # 1. Fine-mapping with individual-level data +In this section, we demonstrate how to perform fine-mapping using individual-level genotype (`X`) and phenotype (`Y`) data. +This approach directly uses the raw data to identify causal variants. + ```{r load-example-indiviudal} # Load example data @@ -41,6 +44,8 @@ colocboost_plot(res) # 2. Fine-mapping with summary statistics +This section demonstrates fine-mapping analysis using summary statistics and a proper LD matrix. + ```{r load-example-sumstat} # Load example data data(Sumstat_5traits) @@ -53,6 +58,9 @@ colocboost_plot(res) # 3. LD-free fine-mapping with one causal variant assumption +In scenarios where LD information is unavailable, FineBoost can still perform fine-mapping under the assumption of a single causal variant. +This approach is less computationally intensive but assumes that only one variant in a region is causal. + ```{r ld-free} # Load example data res <- colocboost(sumstat = sumstat) diff --git a/vignettes/LD_Free_Colocalization.Rmd b/vignettes/LD_Free_Colocalization.Rmd index e69de29..6f8d330 100644 --- a/vignettes/LD_Free_Colocalization.Rmd +++ b/vignettes/LD_Free_Colocalization.Rmd @@ -0,0 +1,90 @@ +--- +title: "LD mismatch and LD-free Colocalization" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{LD mismatch and LD-free Colocalization} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +``` + + +This vignette demonstrates LD mismatch diagnosis in the `colocboost` package and how to perform LD-free colocalization analysis, +when some traits completely lack LD information or share only partial variant coverage with other traits. + + +```{r setup} +library(colocboost) +``` + +# 1. LD mismatch diagnosis + +ColocBoost provides the diagnostic warnings for LD matrix is not compatible with the summary statistics. + +- Estimated residual variance of the model is negative or greater than phenotypic variance (`rtr < 0` or `rtr > var_y`; +see details in Supplementary Note S3.5.2). +- Change in trait-specific profile log-likelihood according to a CoS is negative (see details in Supplementary Note S3.5.3). + +In this example, we create a simulated dataset with LD mismatch by changing the sign of Z-scores for 1% variants of each trait. + +```{r LD-mismatch} +# Create a simulated dataset with LD mismatch +data("Sumstat_5traits") +data("Ind_5traits") +LD <- get_cormat(Ind_5traits$X[[1]]) + +# Change sign of Z-score for 1% variants of each trait by including mismatched with LD +set.seed(1) +miss_prop <- 0.01 +sumstat <- lapply(Sumstat_5traits$sumstat, function(ss){ + p <- nrow(ss) + pos_miss <- sample(1:p, ceiling(miss_prop*p)) + ss$z[pos_miss] <- -ss$z[pos_miss] + return(ss) +}) +``` + +There may two types of warnings when running `colocboost` with LD mismatch: + +```{r LD-mismatch-runcode} +res <- colocboost(sumstat = sumstat, LD = LD) +``` + + +Note that the warning messages are not errors, and the analysis will still be performed. But the results may not be reliable due to the LD mismatch. +Additional, the computational time may be longer than expected, as the algorithm may take longer to converge due to the LD mismatch. + + +# 2. LD-free and LD-mismatch colocalization analysis + +ColocBoost can perform LD-free colocalization analysis when some traits completely lack LD information with one causal per trait per region assumption. +There are two ways to perform LD-free and LD-mismatch colocalization analysis: + +- **LD-mismatch** (recommended): when LD matrix is esimated from out-of sample LD reference (smaller sample size or different population), +summary statistics are not compatible with the LD matrix but the LD matrix is still informative for indicating the correlations among different variants. +We recommended to perform colocalization analysis with LD matrix and only 1 iteration of gradient boosting. +In this case, the LD matrix is only used in checking the equivalence among the trait-specific best update variants. +Specially, using the LD matrix for colocalization analysis can help to improve the accuracy of the results. + +```{r LD-mismatch-one-iter} +res_mismatch <- colocboost(sumstat = sumstat, LD = LD, M=1) +``` + + +- **LD-free**: when some traits completely lack LD information, the LD matrix is not required. + +```{r LD-free} +res_free <- colocboost(sumstat = sumstat) +``` + + + + + + diff --git a/vignettes/Pairwise_Colocalization.Rmd b/vignettes/Pairwise_Colocalization.Rmd index e69de29..7b8d6dc 100644 --- a/vignettes/Pairwise_Colocalization.Rmd +++ b/vignettes/Pairwise_Colocalization.Rmd @@ -0,0 +1,15 @@ +--- +title: "Pairwise Colocalization with Flexible Input Formats" +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Pairwise Colocalization with Flexible Input Formats} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +```