This repository contains an R-based downstream analysis workflow for 16S rRNA amplicon sequencing data, developed during my PhD research.
The workflow is designed primarily for host-associated low-microbial biomass samples and makes extensive use of negative controls and mock communities to assess contamination, detection limits, and data plausibility.
It also makes use of environmental controls - i.e. samples from adjacent biological niches in order to compare taxonomic and phylogenetic profiles of the different sample-types.
This is a research workflow rather than a polished software package, and is shared for transparency, reuse, and reproducibility.
The analysis assumes that upstream processing has already been performed using QIIME2. The R script performs the following major steps:
-
Import of QIIME2 outputs
- Feature table
- Taxonomy assignments (SILVA and/or Greengenes)
- Rooted phylogenetic tree
- Sample metadata
- Construction of a
phyloseqobject
-
Initial quality control
- Removal of non-bacterial features
- Filtering of implausible or low-prevalence taxa
- Basic inspection of sequencing depth and feature counts
-
Contamination assessment
- Identification and removal of contaminant ASVs using
decontam - Use of negative controls where available
- Identification and removal of contaminant ASVs using
-
Mock community analysis
- Evaluation of mock community composition
- Determination of minimum detection thresholds using:
- Cell-based dilution series
- DNA-based logarithmic mock standards
-
Final filtering
- Removal of features below empirically determined thresholds
- Generation of the final analysis-ready dataset
- Analysis of alpha rarefaction and sampling depth
-
Diversity analyses
- Alpha diversity
- Beta diversity (ordination-based analyses)
-
Taxonomic summaries
- Bar plots and other relative abundance summaries
- Taxonomic composition across sample groups
-
Comparative niche analysis
- Comparison of target samples to environmental or adjacent niches
-
Phylogenetic visualisation
- Phylograms and tree-based representations of selected taxa
This workflow is designed to be run interactively in R (e.g. in RStudio), executing sections sequentially and inspecting outputs as they are generated.
Several steps (particularly initial QC, contamination assessment, and mock-based thresholding) are intentionally not fully automated, as they require dataset-specific judgement and biological plausibility checks.
While the full script can be sourced end-to-end, users are strongly encouraged to step through the analysis and review intermediate results before proceeding to downstream filtering and diversity analyses.
The workflow expects the following inputs:
- QIIME2 feature table (
.qza) - Rooted phylogenetic tree (
.qza) - Taxonomy assignments (SILVA and/or Greengenes)
- Sample metadata file (TSV format)
Details on required columns in the metadata file are described in the script comments.
analysis/
main_analysis.R # Main analysis script
config/
config_example.R # Example configuration file (paths & parameters)
data/
README.md # Place input data here (not tracked)
output/
README.md # Analysis outputs are written here
- Clone the repository
- Copy config/config_example.R to config/config.R
- Edit file paths and dataset-specific parameters
- Run the analysis script:
source("analysis/main_analysis.R")
- This workflow was developed for low-biomass 16S datasets
- Some filtering steps and mock-based thresholds are dataset-specific
- Sections of the script are clearly marked where manual intervention or adaptation may be required
- Users are encouraged to read the script comments carefully before reuse.
This workflow draws on methods, ideas, and code patterns from multiple sources, including but not limited to:
- Callahan et al. (DADA2)
- Davis et al. (decontam)
- F1000Research microbiome analysis guidelines
- QIIME2 documentation and tutorials
Any adaptations, errors, or interpretations are my own.