diff --git a/vignettes/ColocBoost_Wrapper_Pipeline.Rmd b/vignettes/ColocBoost_Wrapper_Pipeline.Rmd index 6b5a5a4..d9b8cee 100644 --- a/vignettes/ColocBoost_Wrapper_Pipeline.Rmd +++ b/vignettes/ColocBoost_Wrapper_Pipeline.Rmd @@ -15,7 +15,7 @@ knitr::opts_chunk$set( ``` -This vignette demonstrates how to use the bioinformatics pipeline for ColocBoost to perform colocalization analysis with `colocboost` for multiple individual-level datasets and/or summary statistics datasets. +This vignette demonstrates how to use the bioinformatics pipeline for ColocBoost to perform colocalization analysis with `colocboost`. - See more details about functions in the package `pecotmr` with [link](https://github.com/StatFunGen/pecotmr/tree/main) and `colocboost_pipeline` with [link](https://github.com/StatFunGen/pecotmr/blob/main/R/colocboost_pipeline.R). @@ -24,58 +24,47 @@ This vignette demonstrates how to use the bioinformatics pipeline for ColocBoost # 1. Loading Data using `colocboost_analysis_pipeline` function -This function harmonizes the input data and prepares it for colocalization analysis. - +This function harmonizes the input data and prepares it for colocalization analysis. In this section, we introduce how to load the regional data required for the ColocBoost analysis using the `load_multitask_regional_data` function. -This function loads mixed datasets for a specific region, including individual-level data (genotype, phenotype, covariate data), summary statistics -(sumstats, LD), or a combination of both. It runs `load_regional_univariate_data` for each individual-level dataset and `load_rss_data` for each summary statistics dataset. -The output is a list with `individual_data` and `sumstat_data` components, where `individual_data` is a list of individual-level data and `sumstat_data` is a list of summary statistics data. -This list is then passed to the `colocboost_analysis_pipeline` function for the colocalization analysis. +This function loads mixed datasets for a specific region, including individual-level data (genotype, phenotype, covariate data) +or summary statistics (sumstats, LD). Run `load_regional_univariate_data` and `load_rss_data` multiple times for different datasets. -Below are the input parameters for this function for loading individual-level data: +Below are the input parameters for this function: ## 1.1. Loading individual-level data from multiple cohorts -inputs: -- **`region`**: String ; Genomic region of interest in the format of `chr:start-end` for the phenotype region you want to analyze. -- **`genotype_list`**: Character vector; Paths for PLINK bed files containing genotype data (do NOT include .bed suffix). -- **`phenotype_list`**: Character vector; Paths for phenotype file names. -- **`covariate_list`**: Character vector; Paths for covariate file names for each phenotype. Must have the same length as the phenotype file vector. -- **`conditions_list_individual`**: Character vector; Strings representing different conditions or groups used for naming. Must have the same length as the phenotype file vector. -- **`match_geno_pheno`**: Integer vector; Indices of phenotypes matched to genotype if multiple genotype PLINK files are used. For each phenotype file in `phenotype_list`, the index of the genotype file in `genotype_list` it matches with. -- **`association_window`**: String; Genomic region of interest in the format of `chr:start-end` for the association analysis window of variants to test (cis or trans). If not provided, all genotype data will be loaded. -- **`extract_region_name`**: List of character vectors; Phenotype names (e.g., gene ID `ENSG00000269699`) to subset the phenotype data when there are multiple phenotypes availible in the region. Must have the same length as the phenotype file vector. Default is `NULL`, which will use all phenotypes in the region. -- **`region_name_col`**: Integer; 1-based index of the column containing the region name (i.e. 4 for gene ID in a bed file). Required if `extract_region_name` is not `NULL`, or if multiple phenotypes fall into the same region in one phenotype file -- **`keep_indel`**: Logical; indicating whether to keep insertions/deletions (INDELs). Default is `TRUE`. -- **`keep_samples`**: Character vector; Sample names to keep. Default is `NULL`. Currently only supports keeping the same samples from all genotype and phenotype files. -- **`maf_cutoff`**: Numeric; Minimum minor allele frequency (MAF) cutoff. Default is 0. -- **`mac_cutoff`**: Numeric; Minimum minor allele count (MAC) cutoff. Default is 0. -- **`xvar_cutoff`**: Numeric; Minimum genotype variance cutoff. Default is 0. -- **`imiss_cutoff`**: Numeric; Maximum individual missingness cutoff. Default is 0. - -outputs: -- **`region_data`**: List (with `individual_data`, `sumstat_data`); Output of the `load_multitask_regional_data` function. If only individual-level data is loaded, `sumstat_data` will be `NULL`. +- **`region`** (required): A string of `chr:start-end` for the phenotype region you want to analyze. +- **`genotype_list`**: A vector of paths for PLINK bed files containing genotype data. +- **`phenotype_list`**: A vector of paths for phenotype file names. +- **`covariate_list`**: A vector of paths for covariate file names corresponding to the phenotype file vector. +- **`conditions_list_individual`**: A vector of strings representing different conditions or groups. +- **`match_geno_pheno`**: A vector of indices of phenotypes matched to genotype if multiple genotype PLINK files are used. +- **`maf_cutoff`**: Minimum minor allele frequency (MAF) cutoff. Default is 0. +- **`mac_cutoff`**: Minimum minor allele count (MAC) cutoff. Default is 0. +- **`xvar_cutoff`**: Minimum variance cutoff. Default is 0. +- **`imiss_cutoff`**: Maximum individual missingness cutoff. Default is 0. +- **`association_window`**: A string of `chr:start-end` for the association analysis window (cis or trans). If not provided, all genotype data will be loaded. +- **`extract_region_name`**: A list of vectors of strings (e.g., gene ID `ENSG00000269699`) to subset the information when there are multiple regions available. Default is `NULL`. +- **`region_name_col`**: Column name containing the region name. Default is `NULL`. +- **`keep_indel`**: Logical indicating whether to keep insertions/deletions (INDELs). Default is `TRUE`. +- **`keep_samples`**: A vector of sample names to keep. Default is `NULL`. -**Indivudual-level data loading example** +**Illustrated example** -The following example demonstrates how to set up input data with 3 phenotypes and 2 cohorts. The first cohort has 2 phenotypes and the second cohort has 1 phenotype. The first phenotype has 2 genes and the second phenotype has 1 gene. +The following example demonstrates how to set up input data with 3 phenotypes and 2 cohorts, where the first cohort has 2 phenotypes and the second cohort has 1 phenotype. ```{r, data-loader-individual} # Example of loading individual-level data region = "chr1:1000000-2000000" -genotype_list = c("plink_cohort1.1", "plink_cohort1.2") +genotype_list = c("plink_cohort1.1.bed", "plink_cohort1.2.bed") phenotype_list = c("phenotype1_cohort1.bed.gz", "phenotype2_cohort1.bed.gz", "phenotype1_cohort2.bed.gz") covariate_list = c("covariate1_cohort1.bed.gz", "covariate2_cohort1.bed.gz", "covariate1_cohort2.bed.gz") conditions_list_individual = c("phenotype1_cohort1", "phenotype2_cohort1", "phenotype1_cohort2") -match_geno_pheno = c(1,1,2) -association_window = "chr1:1000000-2000000" # set to be the same as region for cis-analysis -extract_region_name = list(c("ENSG00000269699, ENSG00000789633"), c("ENSG00000269699"), c("ENSG00000269699", "ENSG00000789633")) -region_name_col = 4 -keep_indel = TRUE -keep_samples = c("SAMPLE1", "SAMPLE2", "SAMPLE3") +match_geno_pheno = c(1,1,2) # indices of phenotypes matched to genotype +association_window = "chr1:1000000-2000000" # same as region for cis-analysis # Following parameters need to be set according to your data maf_cutoff = 0.01 @@ -84,197 +73,67 @@ xvar_cutoff = 0 imiss_cutoff = 0.9 # More advanced parameters see pecotmr::load_multitask_regional_data() - -region_data_individual <- load_multitask_regional_data( - region = region, - genotype_list = genotype_list, - phenotype_list = phenotype_list, - covariate_list = covariate_list, - conditions_list_individual = conditions_list_individual, - match_geno_pheno = match_geno_pheno, - association_window = association_window, - region_name_col = region_name_col, - extract_region_name = extract_region_name, - keep_indel = keep_indel, - keep_samples = keep_samples, - maf_cutoff = maf_cutoff, - mac_cutoff = mac_cutoff, - xvar_cutoff = xvar_cutoff, - imiss_cutoff = imiss_cutoff -) - ``` ## 1.2. Loading summary statistics from multiple cohorts or datasets -inputs: -- **`sumstat_path_list`**: Character vector; Paths to the summary statistics. -- **`column_file_path_list`**: Character vector; Paths to the column mapping files. See below for expected format. -- **`LD_meta_file_path_list`**: Character vector; Paths to LD metadata files. See below for expected format. -- **`conditions_list_sumstat`**: Character vector; Strings representing different sumstats used for naming. Must have the same length as the sumstat file vector. -- **`match_LD_sumstat`**: List of character vectors; Mapping each LD metadata file to the summary-statistics conditions to pair with it. Length must equal `LD_meta_file_path_list`. Each element is a character vector of names present in `conditions_list_sumstat`. If omitted or an element is empty, defaults to all conditions for the first LD. -- **`association_window`**: String; Genomic region of interest in the format of `chr:start-end` for the association analysis window of variants to test (cis or trans). If not provided, all genotype data will be loaded. -- **`n_samples`**: Integer vector; Sample size. Set a 0 if `n_cases`/`n_controls` are passed explicitly. If unknown, set as 0 and include `n_samples` column in the column mapping file to retrieve from the sumstat file. -- **`n_cases`**: Integer vector; Number of cases. Set a 0 if `n_samples` is passed explicitly. If unknown, set as 0 and include `n_cases` column in the column mapping file to retrieve from the sumstat file. -- **`n_controls`**: Integer vector; Number of controls. Set a 0 if `n_samples` is passed explicitly. If unknown, set as 0 and include `n_controls` column in the column mapping file to retrieve from the sumstat file. - -outputs: -- **`region_data`**: List (with `individual_data`, `sumstat_data`); Output of the `load_multitask_regional_data` function. If only summary statistics data is loaded, `individual_data` will be `NULL`. - -**Summary statistics loading example** +- **`sumstat_path_list`**: A vector of file paths to the summary statistics. +- **`column_file_path_list`**: A vector of file paths to the column mapping files. +- **`LD_meta_file_path_list`**: A vector of paths to LD metadata files. LD metadata is a data frame specifying LD blocks with columns "chrom", "start", "end", and "path". +"start" and "end" denote the positions of LD blocks, and "path" is the path of each LD block, optionally including `bim` file paths. +- **`match_LD_sumstat`**: Logical indicating whether to match LD blocks with summary statistics. +- **`conditions_list_sumstat`**: A vector of strings representing different sumstats. +- **`n_samples`**: User-specified sample size. If unknown, set as 0 to retrieve from the sumstat file. +- **`n_cases`**: User-specified number of cases. +- **`n_controls`**: User-specified number of controls. +- **`region`**: The region where tabix is used to subset the input dataset. +- **`extract_sumstats_region_name`**: User-specified gene/phenotype name used to further subset the phenotype data. +- **`sumstats_region_name_col`**: Filter this specific column for the `extract_sumstats_region_name`. +- **`comment_string`**: Comment sign in the column mapping file, default is `#`. +- **`extract_coordinates`**: Optional data frame with columns "chrom" and "pos" for specific coordinates extraction. + +**Illustrated example** The following example demonstrates how to set up input data with 2 summary statistics and one LD reference. ```{r, data-loader-sumstat} # Example of loading summary statistics sumstat_path_list = c("sumstat1.tsv.gz", "sumstat2.tsv.gz") -column_file_path_list = c("column_mapping_sumstat1.yml", "column_mapping_sumstat2.yml") -LD_meta_file_path_list = c("ld_meta_file.tsv") +column_file_path_list = c("mapping_columns_1.yml", "mapping_columns_2.yml") +LD_meta_file_path_list = "ld_meta_file.tsv" +covariate_list = c("covariate1_cohort1.bed.gz", "covariate2_cohort1.bed.gz", "covariate1_cohort2.bed.gz") conditions_list_sumstat = c("sumstat_1", "sumstat_2") -match_LD_sumstat = c("sumstat_1", "sumstat_2") -association_window = "chr1:1000000-2000000" # Following parameters need to be set according to your data -n_samples = c(300000, 0) -n_cases = c(0, 20000) -n_controls = c(0, 40000) - +n_samples = c(0, 0) +n_cases = c(10000, 20000) +n_controls = c(20000, 40000) # More advanced parameters see pecotmr::load_multitask_regional_data() - -region_data_sumstat <- load_multitask_regional_data( - sumstat_path_list = sumstat_path_list, - column_file_path_list = column_file_path_list, - LD_meta_file_path_list = LD_meta_file_path_list, - conditions_list_sumstat = conditions_list_sumstat, - match_LD_sumstat = match_LD_sumstat, - association_window = association_window, - n_samples = n_samples, - n_cases = n_cases, - n_controls = n_controls -) -``` - - - -**Expected format for column mapping file** -The column mapping file is YAML (`.yml`) with key: value pairs mapping your input column names to the standardized names expected by the loader. -Required columns are `chrom`, `pos`, `A1`, and `A2`, and either `z` or `beta` and `sebeta`. -Either 'n_case' and 'n_control' or 'n_samples' can be passed as part of the column mapping, but will be overwritten by the n_cases and n_controls or n_samples parameterspassed explicitly. -```yaml -# required -chrom: chromosome -pos: position -A1: effect_allele -A2: non_effect_allele - -z: zscore -# or -beta: beta -sebeta: sebeta - -# optional, will be overridden by n_samples or n_cases and n_controls if passed explicitly -n_case: NCASE -n_control: NCONTROL -# or -n_sample: N ``` -**Expected format for LD metadata file** -LD files sould be in the format generated by for instance `plink --r squared`, then xz compressed. -The LD metadata file is a tab-separated file with the following columns: -- `chrom`: chromosome -- `start`: start position -- `end`: end position -- `ld_path, bim_path`: path to the LD block file and bim file -``` -1 1000000 2000000 ld_block_1.ld.gz,ld_block_1.bim -``` - # 2. Perform ColocBoost using `colocboost_analysis_pipeline` function -In this section, we load region data for a combination of individual-level and summary statistics data, then perform the colocalization analysis using the `colocboost_analysis_pipeline` function. -The colocalization analysis can be run in any one of three modes, or in a combination of these modes (names assume that individual-level data is xQTL data and summary statistics data is GWAS data): +In this section, we perform the colocalization analysis using the `colocboost_analysis_pipeline` function. Below are the input parameters for this function: -- **`xQTL-only mode`**: Only perform colocalization analysis on the individual-level data. Summary statistics data is not used. -- **`joint GWAS mode`**: Perform colocalization analysis in disease-agnostic mode on the individual-level and summary statistics data together. -- **`separate GWAS mode`**: Perform colocalization analysis in disease-prioritized mode on the the individual-level data and each summary statistics dataset separately, treating each summary statistics dataset as the focal trait. -inputs: -- **`region_data`**: List (with `individual_data`, `sumstat_data`); Output of the `load_multitask_regional_data` function. -- **`focal_trait`**: String; For xQTL-only mode, the name of the trait to perform disease-prioritized ColocBoost, from `conditions_list_individual`. If not provided, xQTL-only mode will be run without disease-prioritized mode. -- **`event_filters`**: List of character vectors; Patterns for filtering events based on context names. +- **`region_data`**: The output of the `load_multitask_regional_data` function. +- **`focal_trait`**: Name of the trait if performing disease-prioritized ColocBoost. +- **`event_filters`**: A list of patterns for filtering events based on context names. Example: for sQTL, `list(type_pattern = ".*clu_(\\d+_[+-?]).*", valid_pattern = "clu_(\\d+_[+-?]):PR:", exclude_pattern = "clu_(\\d+_[+-?]):IN:")`. -- **`maf_cutoff`**: Numeric; Minor allele frequency cutoff. Default is 0.005. -- **`pip_cutoff_to_skip_ind`**: Integer vector; Cutoff values for skipping analysis based on pre-screening with single-effect SuSiE (L=1). Context is skipped if none of the variants in the context have PIP values greater than the cutoff. Default is 0 (does not run single-effect SuSiE). Passing a negative value sets the cutoff to 3/number of variants. -- **`pip_cutoff_to_skip_sumstat`**: Integer vector; Cutoff values for skipping analysis based on pre-screening with single-effect SuSiE (L=1). Sumstat is skipped if none of the variants in the sumstat have PIP values greater than the cutoff. Default is 0 (does not run single-effect SuSiE). Passing a negative value sets the cutoff to 3/number of variants. -- **`qc_method`**: String; Quality control method to use. Options are "rss_qc", "dentist", or "slalom". Default is `rss_qc`. -- **`impute`**: Logical; if TRUE, performs imputation for outliers identified in the analysis. Default is `TRUE`. -- **`impute_opts`**: List of lists; Imputation options including rcond, R2_threshold, and minimum_ld. Default is `list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5)`. -- **`xqtl_coloc`**: Logical; if TRUE, performs xQTL-only mode. Default is `TRUE`. -- **`joint_gwas`**: Logical; if TRUE, performs joint GWAS mode, mapping all individual-level and sumstat data together.Default is `FALSE`. -- **`separate_gwas`**: Logical; if TRUE, runs separate GWAS mode, where each sumstat dataset is analyzed separately with all individual-level data, treating each sumstat as the focal trait in disease-prioritized mode. Default is `FALSE`. - -outputs: -- **`colocboost_results`**: List of colocboost objects (with `xqtl_coloc`, `joint_gwas`, `separate_gwas`); Output of the `colocboost_analysis_pipeline` function. If the mode is not run, the corresponding element will be `NULL`. - -```{r, colocboost-analysis} -# load in individual-level and sumstat data -region_data_combined <- load_multitask_regional_data( - region = region, - genotype_list = genotype_list, - phenotype_list = phenotype_list, - covariate_list = covariate_list, - conditions_list_individual = conditions_list_individual, - match_geno_pheno = match_geno_pheno, - association_window = association_window, - region_name_col = region_name_col, - extract_region_name = extract_region_name, - keep_indel = keep_indel, - keep_samples = keep_samples, - maf_cutoff = maf_cutoff, - mac_cutoff = mac_cutoff, - xvar_cutoff = xvar_cutoff, - imiss_cutoff = imiss_cutoff, - sumstat_path_list = sumstat_path_list, - column_file_path_list = column_file_path_list, - LD_meta_file_path_list = LD_meta_file_path_list, - conditions_list_sumstat = conditions_list_sumstat, - match_LD_sumstat = match_LD_sumstat, - n_samples = n_samples, - n_cases = n_cases, - n_controls = n_controls -) +- **`maf_cutoff`**: A scalar to remove variants with maf < maf_cutoff, default is 0.005. +- **`pip_cutoff_to_skip_ind`**: A vector of cutoff values for skipping analysis based on PIP values for each context. Default is 0. +- **`pip_cutoff_to_skip_sumstat`**: A vector of cutoff values for skipping analysis based on PIP values for each sumstat. Default is 0. +- **`qc_method`**: Quality control method to use. Options are "rss_qc", "dentist", or "slalom" (default: "rss_qc"). +- **`impute`**: Logical; if TRUE, performs imputation for outliers identified in the analysis (default: TRUE). +- **`impute_opts`**: A list of imputation options including rcond, R2_threshold, and minimum_ld (default: `list(rcond = 0.01, R2_threshold = 0.6, minimum_ld = 5)`). -maf_cutoff = 0.01 -pip_cutoff_to_skip_ind = rep(0, length(phenotype_list)) -pip_cutoff_to_skip_sumstat = rep(0, length(sumstat_path_list)) -qc_method = "rss_qc" - -# run colocboost analysis -colocboost_results <- colocboost_analysis_pipeline( - region_data_combined, - maf_cutoff = maf_cutoff, - pip_cutoff_to_skip_ind = pip_cutoff_to_skip_ind, - pip_cutoff_to_skip_sumstat = pip_cutoff_to_skip_sumstat, - qc_method = qc_method, - xqtl_coloc = TRUE, - joint_gwas = TRUE, - separate_gwas = TRUE -) -# visualize results for xQTL-only mode -colocboost_plot(colocboost_results$xqtl_coloc) - -# visualize results for joint GWAS mode -colocboost_plot(colocboost_results$joint_gwas) - -# visualize results for separate GWAS mode -for (i in 1:length(colocboost_results$separate_gwas)) { - colocboost_plot(colocboost_results$separate_gwas[[i]]) -} +```{r, colocboost-analysis} +# region_data <- load_multitask_regional_data(...) +# res <- colocboost_analysis_pipeline(region_data) ``` \ No newline at end of file