Skip to content

stackr: an R package for the analysis of GBS/RADseq data

Notifications You must be signed in to change notification settings

caitiecollins/stackr

 
 

Repository files navigation

stackr: an R package for the analysis of GBS/RADseq data

Travis-CI Build Status AppVeyor Build Status CRAN_Status_Badge DOI

This is the development page of the stackr, if you want to help, see contributions section

Use stackr to: import, explore, manipulate, filter, impute, visualize and export your GBS/RADseq data

  • Import/Export your GBS/RADseq data with the various supported genomic file formats: tidy, wide, VCF, PLINK , genepop, genind, genlight, hierfstat, gtypes, betadiv, dadi and the haplotype file produced by STACKS. Easy integration with other software or R packages like [adegenet] (https://github.com/thibautjombart/adegenet), [strataG] (https://github.com/EricArcher/strataG.devel/tree/master/strataG.devel), [hierfstat] (https://github.com/jgx65/hierfstat), [pegas] (https://github.com/emmanuelparadis/pegas), [poppr] (https://github.com/grunwaldlab/poppr) and assigner. Conversion functions are integrated with important filters, blacklist and whitelist.

  • Explore and filter important variables caracteristics and statistics:

    • missing data,
    • read depth (coverage) of alleles and genotypes,
    • genotype likelihood,
    • genotyped individuals and populations,
    • minor allele frequency (local and global MAF),
    • observed heterozygosity (Het obs) and inbreeding coefficient (Fis),
    • find duplicate individual and mixed samples.
  • Filter: Most genomic analysis look for patterns and trends with various statistics. Bias, noise and outliers can have bounded influence on estimators and interfere with polymorphism discovery. Avoid bad data exploration and control the impact of filters on your downstream genetic analysis. Alleles, genotypes, markers, individuals and populations can be filtered and/or selected in several ways.

  • Map-independent imputation of missing genotype/alleles using Random Forest or the most frequent category.

  • Visualization: ggplot2-based plotting for publication-ready figures.

Installation

To try out the dev version of stackr, copy/paste the code below:

# Install or load the package **devtools**
if (!require("devtools")) install.packages("devtools") # to install
library(devtools) # to load
# Install **stackr**
# devtools::install_github("thierrygosselin/stackr") # to install without vignettes
devtools::install_github("thierrygosselin/stackr", build_vignettes = TRUE)  # to install WITH vignettes
library(stackr) # to load

Prerequisite - Suggestions - Troubleshooting

  • Parallel computing: Follow the steps in this vignette to install an OpenMP enabled randomForestSRC package to do imputations in parallel.
  • Installation problem: see this vignette
  • Windows users: to have stackr run in parallel use parallelsugar. Easy to install and use (instructions).
  • For a better experience in stackr and in R in general, I recommend using RStudio. The R GUI is unstable with functions using parallel (more info). Below, the combination of packages and how I install/load them :
if (!require("pacman")) install.packages("pacman")
library("pacman")
pacman::p_load(devtools, reshape2, ggplot2, stringr, stringi, plyr, dplyr, tidyr, readr, purrr, data.table, ape, adegenet, parallel, lazyeval, randomForestSRC, hierfstat, strataG)
pacman::p_load(devtools, reshape2, ggplot2, stringr, stringi, plyr, dplyr, tidyr, readr, purrr, data.table, ape, adegenet, parallel, lazyeval, randomForestSRC, hierfstat, strataG)
# install_github("thierrygosselin/stackr", build_vignettes = TRUE) # uncomment to install
library("stackr")

Vignettes and examples

From a browser:

browseVignettes("stackr") # To browse vignettes
vignette("vignette_vcf2dadi") # To open specific vignette

Vignettes are in development, check periodically for updates.

Citation:

To get the citation, inside R:

citation("stackr")

New features

Change log, version, new features and bug history now lives in the [NEWS.md file] (https://github.com/thierrygosselin/stackr/blob/master/NEWS.md)

v.0.3.4

  • updated documentation
  • bug fix in summary_haplotypes introduced by the new version of dplyr::distinct (v.0.5.0)
  • calculations of Pi is done in parallel inside summary_haplotypes

v.0.3.3

  • tidy_genomic_data: added a check that throws an error when pop.levels != the pop.id in strata

v.0.3.2

  • genomic_converter including all the vcf2... function can now use phase/unphase genotypes. Some pyRAD vcf (e.g. v.3.0.64) have a mix of GT format with / and |. e.g. missing GT = ./. and genotyped individuals = 0|0. I'm not sure it follows VCF specification, but stackr can now read those vcf files.
  • vcf2dadi is more user-friendly for scientist with in- and out-group metadata, using STACKS or not.

For previous news: [NEWS.md file] (https://github.com/thierrygosselin/stackr/blob/master/NEWS.md)

Roadmap of future developments:

  • Until publication stackr will change rapidly (see contributions below for bug reports).
  • Updated filters: more efficient, interactive and visualization included: in progress
  • Better integration with other GBS/RADseq approaches, beside STACKS: in progress
  • Integrated converter function to input and output several file formats: done
  • Workflow tutorial that links functions and points to specific vignettes to further explore some problems: in progress
  • Integration of several functions with STACKS and DArT database.
  • Use Shiny and ggvis when subplots or facets becomes available...
  • Suggestions ?

Contributions:

This package has been developed in the open, and it wouldn’t be nearly as good without your contributions. There are a number of ways you can help me make this package even better:

  • If you don’t understand something, please let me know.
  • Your feedback on what is confusing or hard to understand is valuable.
  • If you spot a typo, feel free to edit the underlying page and send a pull request.

New to pull request on github ? The process is very easy:

  • Click the edit this page on the sidebar.
  • Make the changes using github’s in-page editor and save.
  • Submit a pull request and include a brief description of your changes.
  • “Fixing typos” is perfectly adequate.

GBS workflow

The stackr package fits currently at the end of the GBS workflow. Below, a flow chart using [STACKS] (http://catchenlab.life.illinois.edu/stacks/) and other software. You can use the [STACKS] (http://catchenlab.life.illinois.edu/stacks/) workflow [used in the Bernatchez lab] (https://github.com/enormandeau/stacks_workflow).

stackr workflow

Currently under construction. Come back soon!

Table 1: Quality control and filtering RAD/GBS data

Parameter Libraries/Seq.Lanes Allele Genotype Individual Sampling sites Populations Globally
Quality x
Coverage x x
Genotype Likelihood x
Prop. Genotyped x x x x
MAF x x x
HET x
FIS x
SNP number/reads x

Step 1 as a quality insurance step. We need to modify the data to play with it efficiently in R. To have reliable summary statistics, you first need good coverage of your alleles to call your genotypes, good genotype likelihood, enough individuals in each sampling sites and enough putative populations with your markers...

Step 2 is where the actual work is done to remove artifactual and uninformative markers based on summary statistics of your markers.

About

stackr: an R package for the analysis of GBS/RADseq data

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • R 100.0%