Skip to content

qBioTurin/MSexplorerWorkflow

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

74 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MSexplorerWorkflow

We developed and employed a comprehensive post-processing workflow designed to analyze metagenomic data processed with Kraken2, Bracken, and KrakenBio. The workflow begins with three decontamination steps aimed at minimizing technical and environmental contaminants, ensuring higher data integrity for downstream analyses.

Following decontamination, we performed alpha and beta diversity analyses, as well as stacked bar plots, to facilitate a global visualization of microbial community structure and composition. To identify differentially abundant species (DAS), we utilized two complementary statistical approaches: limma, which employs a linear modeling framework after voom transformation of count data, and LEfSe (Linear discriminant analysis Effect Size), which identifies features characterizing statistical differences between groups.

The results from both DAS tools were subsequently integrated to enhance robustness and biological interpretability. Notably, LEfSe can be executed through two methods: (i) via a pre-configured Docker container, which provides a streamlined and reproducible environment, or (ii) through manual installation involving Python, Conda, and LEfSe and execution of the accompanying Bash script.

Finally, we computed alpha diversity metrics and applied LEAPS (Learning with Embedded and Projected Subspaces) using these metrics as input, enabling feature selection and predictive modeling grounded in microbial diversity.

Requirements

The following tools are strictly necessary for the successful execution of the post-processing workflow.

Docker

You need to have docker installed on your machine, for more info see this document:https://docs.docker.com/engine/installation/.

Ensure your user has the rights to run docker (without the use of sudo). To create the docker group and add your user:

  • Create the docker group.
  $ sudo groupadd docker
  • Add your user to the docker group.
  $ sudo usermod -aG docker $USER
  • Log out and log back in so that your group membership is re-evaluated.

Before running any scripts, make sure that all required R libraries listed in the Packaes file are installed. This script automatically checks for and installs all necessary dependencies in your R environment. The installation process may take some time and requires an active internet connection. If you're using a laptop, make sure it is plugged in to avoid interruptions.

Once you have all the necessary files, begin with the DataImport step. Here, you must set the correct paths to the BIOM and metadata files. Ensure that each metadata column is assigned the correct data type. You may also need to clean the BIOM file column names—for example, by removing suffixes like "_bracken_species"—to ensure they match the sample names in the metadata file.

You can optionally remove samples that are not relevant to your analysis. While this step was necessary for our study, it is not required for general use. After this step, an .RDS file will be created for each taxonomic kingdom, which will be used in subsequent analyses.

  • The BIOM file should be the result of merging multiple Bracken output files. The column names in the BIOM file must match the original filenames used during merging.

In order to install LEfSe, you need Python 2, and depending on your operating system, this could be a bit of a challenge since it has been deprecated for over half a decade now. Anyway, in the following steps, I will guide you through its installation on Debian 12 (this should be quite similar for other Unix-like systems):

First and foremost, you need Conda installed on your machine. To do this, simply follow their Guide

Now, install python2.7 and create a symbolic link called python2(not necesssary but very handy) using the following two commands:

sudo apt-get install python2.7
sudo ln -s /usr/bin/python2.7 /usr/bin/python2

Create the environment for lefse using explicitly python 2.7, then activate it

conda create -n lefse_env python=2.7 -y
conda activate lefse_env

Now, add the default, conda-forge, and bioconda channels, and then install LEfSe with the last command:

conda config --add channels defaults
conda config --add channels conda-forge
conda config --add channels bioconda
conda install -c bioconda lefse -y

You are now done! The following lines are an example of how to use LEfSe:

conda activate lefse_env #only if you havent already
lefse-format_input.py file_input file_input.in -c 1 -s -1 -u 2 -o 1000000
run_lefse.py -l 2  file_input.in file_input.res

For any additional info or issues, please refer to the SegataLab github page or biobackery tutorial

IMPORTANT To avoid running into errors, it is recommended to install all the libraries listed in the Package.R. One last thing: every path is relative, so the paths do not need to be modified unless the user wants to use a custom input. IMPORTANT

This script takes as input the biom file created by bracken and the metadata of your files in CSV format. Once uploaded, the user must define a type for each column. This step is only necessary if the metadata differs from the one provided in this script. Additionally, if a custom input is used, it may be necessary to change the OTU table column names.

Lastly, for this dataset, we remove the naive patients, who will not appear in the future scripts.

Once completed, 4 RDS files will be created:

  • Bacteria
  • Archaea
  • Eukaryota
  • All 3 together

In this step, we take as input the combined RDS generated by the previous step and pass it throughDEseq, a statistical tool used for analyzing DNA sequencing data. It identifies differentially expressed genes between conditions by modeling the counts of sequencing reads and accounting for variability in the data. Finally, we separate the 3 kingdoms to create one RDS file for each of them.

This step takes as input the three RDS files created in the previous step and three lists of microorganisms associated with humans, for which we identified sources using the Encyclopedia of Life. Additionally, we annotated each of the two lists to indicate whether a microorganism was found in only one source or in more than one. For archaea, the process is slightly more complex due to the limited studies available. The few sources, combined with decontamination at the phylum level, leave too many contaminants. To address this, we consulted an expert to obtain a more specific list of genera.

This script takes the aforementioned RDS files, removes controls and rows with ambiguous phylum/genus annotations (such as NULL values), and computes the prevalence for each feature, adding taxonomy and row counts. Lastly, it removes all phyla/genera not present in the tables and applies a relative abundance filter (>0.001) to eliminate additional potential contaminants. The program then saves this updated table as an RDS file (one for each domain).

In these four scripts, we focused on identifying and generating graphs that would not only be explanatory but also easy to interpret. The primary objective of this analysis was to provide a comprehensive description of the population we identified, highlighting any significant differences between patients undergoing glucocorticoid treatment and those who were not. By leveraging visual tools, we aimed to make complex data more accessible, allowing for a clearer understanding of the variations within the population.

Our approach was to select graph types that could effectively capture the nuances of the data while maintaining clarity. In particular, we were interested in how glucocorticoid treatment might influence the observed patterns, both at the individual and group levels. This allowed us to assess the treatment’s impact on various biological and clinical factors in a manner that was both statistically rigorous and visually intuitive.

Through the use of these graphs, we sought to make the results not only informative but also actionable, enabling further hypothesis testing or clinical applications. Ultimately, the goal was to ensure that the findings could be easily understood and communicated to a broad audience, whether researchers, clinicians, or stakeholders involved in the study.

This script takes as input the manually decontaminated RDS file and utilizes ggplot2 to create three distinct graphs. These graphs will be saved as RDS files and serve to visually demonstrate the differences in alpha diversity across various conditions. Specifically, the graphs will illustrate:

  • The comparison in alpha diversity between healthy individuals and those with multiple sclerosis (MS).
  • The impact of glucocorticoid (GC) treatment on the microbial diversity.
  • The correlation between alpha diversity and the number of days since glucocorticoid treatment was administered.

By using ggplot2, the script ensures that these plots are not only informative but also aesthetically clear, allowing for easy interpretation of the relationships between the biological variables and treatment conditions. The resulting RDS files can be used for further analysis or incorporated into reports to communicate the findings effectively.

Similar to the alpha diversity analysis, this script takes the manually decontaminated RDS file as input and uses ggplot2 to generate a beta diversity graph. The graph will be created based on two key factors:

  • The comparison between healthy individuals and those diagnosed with multiple sclerosis (MS).
  • The effect of glucocorticoid (GC) treatment on the microbial community.

The resulting beta diversity plot will visually represent the dissimilarities between samples, providing insights into how these two conditions (health status and GC treatment) influence the microbiome structure. The use of ggplot2 ensures the graph is not only informative but also clear and easy to interpret for further analysis. The final outputs will be two RDS file.

Stackbars are designed to visually represent the top 10 most abundant species found within each domain, making it easier to compare species distributions across different datasets. To construct these stackbars in a clear and comprehensible manner, we created an array that includes each of the top 10 species, along with an additional category labeled 'Other.' This category accounts for the sum of all species that do not appear in the top 10, ensuring that no relevant data is omitted while maintaining visual clarity.

To enhance readability and consistency, we standardized the color scheme by assigning the same shade of grey to the 'Other' category across all stackbars. This allows viewers to quickly differentiate between the most abundant species and the aggregated remaining species. Additionally, our array structure enables us to assign a distinct, predetermined color to each species. By ensuring that each species is consistently represented by a unique color, we improve both the interpretability and overall comprehensibility of the visualization. This approach makes it easier to compare results across different domains, identify patterns, and extract meaningful insights from the data.

This step creates a visually harmonious and easy-to-read composition by integrating the previously generated graphs into a unified structure. It achieves this result by merging identical labels and efficiently managing space to arrange all elements into a structured table of parallel components.

To provide a brief recap, the composition includes a total of nine graphs, categorized as follows:

Bacteria:

  • P1: Stack bar chart for bacterial species
  • P2: Alpha diversity graphs for the bacterial category
  • P3: Beta diversity graphs for the bacterial category

Archaea:

  • P4: Stack bar chart for archaeal species
  • P5: Alpha diversity graphs for the archaeal category
  • P6: Beta diversity graphs for the archaeal category

Eukaryota:

  • P7: Stack bar chart for eukaryotic species
  • P8: Alpha diversity graphs for the eukaryotic category
  • P9: Beta diversity graphs for the eukaryotic category

This structured approach enhances clarity, allowing for a more intuitive comparison across different domains while ensuring a cohesive and well-organized visual representation.

LEfSe is the acronime od Linear discrciminant ananlysis effect size and its used to determines the features most likely to explain difference between classes. The way it was implemented is:

The PreLefSe script is an R-based preprocessing tool designed to prepare data for LEfSe (Linear Discriminant Analysis Effect Size) analysis. It takes as input:

Supervised decontaminated data (processed in Step 2) Metadata (created in Step 3) The script extracts the taxa table from the decontaminated file and assigns levels to a key categorical field, which will be used in the LEfSe pipeline. Based on this selected field, the script generates a structured table where:

The first row contains the chosen metadata field. The second row lists the patient IDs. The following rows contain phylogenetic classifications, followed by the relative abundance of each species. Generated Tables and Field Selection For this study, six different metadata fields are used, resulting in a total of 10 tables per domain, categorized as follows:

Primary Comparisons (Across All Patients)

  • category – Comparison between Healthy vs. MS patients.
  • gc_treatment – Comparison between GC-positive vs. GC-negative patients.

Subset Analyses (Separate Tables for GC-Positive and GC-Negative Patients)These four additional metadata fields are analyzed separately for GC-positive and GC-negative groups, leading to eight tables in total:

  • lesion_burden – Analysis based on lesion burden classification.
  • bone_marrow_lesion – Presence or absence of bone marrow lesions.
  • gadolinium_contrast – Gadolinium contrast enhancement patterns.
  • subtenorial_lesion – Lesions located in the subtenorial region.

This structured approach ensures that each comparison is appropriately categorized, facilitating meaningful insights from the LEfSe analysis.

The lefseEX script is a Bash-based automation tool designed to streamline the execution of LEfSe without manual intervention. It automatically runs the necessary commands to process the input data, perform statistical analysis, and generate results.

Requirements

The LEfSe Conda environment must be activated before running the script. If not activated, the script will not function correctly. Customization Options Users can easily modify input and output locations by adjusting the following script variables:

  • DIRECTORY – Path to the input data folder.
  • OUT1 – Path for intermediate output files.
  • OUT2 – Path for final LEfSe results.

By automating the LEfSe pipeline, this script enhances efficiency, reducing manual errors and ensuring reproducibility across analyses.

In order to execute this file you must be in the postProcess folder(the base of the project) and the command must be :

conda activate lefse_env
bash Workflow/Step4_DiscriminantAnalysis/Lefse/Docker_LefseElaborator/lefseEx.sh

The next few steps are executed by lefseEX and are:

To refine the LEfSe results, which could not handle special characters correctly, we developed a Python script that processes the pre-LEfSe output to extract species-level taxa. The script reads the pre-LEfSe output file (in tab-delimited format), skips the initial header lines, and extracts the species names from the taxonomic annotations (which are provided in a concatenated format). These species names are then sorted alphabetically, and the refined list is saved to a separate file for further analysis.

This step involves the initial manipulation of the LEfSe output, specifically focusing on two tasks:

Trimming unimportant species.

Filtering out species that do not meet the significance criteria. Converting the phylogenetic tree format – Changing the format of the tree from pipe-separated (|) to tab-separated (\t) in order to create a TSV file that can later be merged with limma results. Customization To modify the input and output folder paths, simply adjust the variables input_folder and output_folder inside the script. Like for the previous step you must be in the postProcess folder and execute:

Expected Output

Once the script has been run, it will output a series of files containing the phylogenetic paths of the relevant species. These files are essential for the subsequent analyses and will be used for further processing, including integration with the limma results.

To address issues with special characters in the LEfSe results, we developed a Python script that corrects species names by referencing a dictionary of validated names. This dictionary is created by the create_dict script, which extracts and normalizes species names from the pre-LEfSe output. The correction script reads the LEfSe taxonomy file, normalizes the species names (removing non-alphabetic characters and converting to lowercase), and replaces any erroneous names with the correct counterparts from the dictionary. The resulting corrected file ensures that all species names are properly formatted and standardized, facilitating more accurate downstream analysis.

[Step 4 Docker - lefseEx]

The inner workings of the Docker-based LEfSe implementation are identical to the manual method, but it significantly simplifies installation and execution. To install the Docker container, simply run:

docker pull qbioturin/lefse:0.1

To execute the analysis, navigate to the directory containing the folder(s) with the sample data and run the following command:

docker run -v ./:/input_files/ -it  fpant/lefse  bash Scripts/lefseEx.sh file_folder

Here, folder_name refers to the subdirectory within the current directory that contains the sample(s) to be analyzed. Alternatively, it is also possible to omit the folder name to run the analysis on all eligible folders within the current directory.

[Step 4 Lefese - Output]

The output of the LEfSe analysis was organized into two main directories. The first, named step2, contains the complete output generated by LEfSe, including all detected features and associated statistical metrics. The second directory, final_name, includes only the subset of features classified at the species level that were identified by LEfSe as differentially abundant between the study groups, based on the specified significance thresholds.

Limma is a powerful tool for data analysis, particularly suited for linear models and differential expression analysis in omics data. It leverages a systematic approach to identify significant changes between conditions by applying statistical models.

Input and Processing Limma takes as input the supervised decontaminated data (from previous steps), performing the following steps:

  • Normalization – The data is normalized to adjust for technical variation and ensure comparability across samples.
  • Log Transformation – The data is then log-transformed (base 2) to stabilize variance and make the data more suitable for linear modeling. Metadata Integration The same metadata used in the LEfSe analysis is applied here, enabling the creation of a set of 10 tables per domain. These tables are generated for each of the following categories, mirroring the structure of LEfSe:

Primary comparisons (e.g., Healthy vs. MS, GC-positive vs. GC-negative) Subset analyses (e.g., lesion burden, bone marrow lesion, gadolinium contrast, subtenorial lesion)

Output The result of the Limma analysis will be a set of tables that highlight differential expression across domains, helping to identify key species or features that exhibit significant differences under various conditions. These tables will serve as the foundation for further interpretation and integration with other tools and analysis pipelines.

This script takes the output from LEfSe and Limma and merges them into 10 distinct lists, making a merge of the two. The process works as follows:

The script retrieves all the species identified by LEfSe and Limma for each category (e.g., gc_treatment, lesion_burden, etc.), without filtering out species that appear in only one of the two results. For each of the 10 categories (which correspond to the primary comparisons and subset analyses), it merges the species lists from both LEfSe and Limma into a single comprehensive list. The script then searches for the abundance values of all these species in the supervised decontaminated data based on the coort(eg. in the GC will be selected only patience with gc_treatment postive)for the specific domain, assigning the appropriate abundance values to each species. The final output consists of 10 RDS files, each containing the full list of species for a given category, along with their corresponding abundance values. This process ensures that all species are considered, providing a complete view of the species distribution across various conditions, regardless of whether they appear in both the LEfSe and Limma results.

This script processes the DAS generated during the merge DAS step and creates a series of heatmaps designed to identify significant clusters. These clusters play a crucial role in refining the search for key species across different comparative analyses. By visually highlighting areas of interest, the heatmaps help researchers better interpret patterns and distributions within the dataset.

To run this script, the DAS files produced in Step 5 are required as input. The output consists of a series of PDF document that presents important elements, such as gadolinium and subtentorial regions, each represented with a distinct color. The color scale is carefully structured to enhance interpretability: white indicates zero presence, while progressively more saturated hues correspond to increasing concentrations. This visual representation allows for a more intuitive understanding of the data, making it easier to detect significant variations and trends.

To calculate and store various alpha diversity indices for different taxonomic groups across multiple datasets, we developed the createTab function. This function takes as input the differential abundance results (DAS) from previous steps—represented from Lesion, Spinal Cord, Gadolinium, and Subtentorial.

The function filters the baseline data (filtered_baselines_decB_table) according to specific taxonomic names from each DAS dataset. It then computes the Shannon, Simpson, and Evenness (EH) indices for each condition, and stores the results in separate tables. These indices are calculated for each condition across the filtered taxonomic data, ensuring only relevant taxa are included. The results are saved as CSV files, named according to the diversity index and dataset, facilitating further downstream analysis.

In this step, the script LEAPS.R performs alpha diversity analysis and linear regression modeling on decontaminated microbiome datasets (Bacteria, Archaea, and Eukaryotes). The main steps include:

  1. Data Loading & Merging: Reads in preprocessed .rds files and merges them into a unified phyloseq object.

  2. Sample Filtering: Keeps only samples with IDs starting with "MS" (likely patient samples).

  3. Alpha Diversity Calculation: Computes richness metrics (Observed, Shannon, Simpson, Chao1) using estimate_richness().

  4. Metadata Processing: Merges sample metadata with calculated diversity indices and standardizes data types (e.g., factors, numerics, dates).

  5. Regression Analysis: Uses the LRM_microbiome() function to perform exhaustive linear regression modeling for selected diversity outcomes based on a specified set of clinical and demographic predictors.

Output: Saves model summaries (for models with intercepts) for each diversity metric to individual .txt files in the specified output folder.

The function called LRM_microbiome() performs linear regression modeling on microbiome-related data. Given a metadata dataframe, a set of outcome variables, and predictor variables, it:

1 validates the presence of all specified variables,

2 handles factor variables by generating dummy encodings,

3 uses exhaustive subset selection (via regsubsets) to identify the best models with and without intercepts,

4 selects either the model with the lowest p-value (if none are significant) or the one with the highest adjusted R-squared (if at least one is significant),

5 computes model summaries and variance inflation factors (VIF) for multicollinearity assessment.

It returns a structured list of regression results for each outcome variable.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages