Skip to content

The Bio plus plus Project

Laurent Guéguen edited this page Nov 22, 2021 · 4 revisions

The aim of the Bio++ Project is to provide re-usable code for the rapid development of robust applications in the fields of sequence analysis, phylogenetics, molecular evolution and population genetics.

Bio++ is designed in an extensible object-oriented way, in the C++ language.

Available methods

Sequence analysis

  • Sequence and Site objects, with various Alphabet support (DNA, RNA, Proteins, Codons, any 'Word' of a given size).
  • Several containers available for inner storage, with several implementations. Support for alignments.
  • Various I/O formats supported: Fasta, Mase, Clustal, Phylip, DCSE, GenBank (sequence only), etc.
  • Sequence manipulation: truncation, concatenation, sub-sequences, etc.
  • In silico molecular biology: (reverse) transcription, translation, replication.
  • Several genetic codes availables: Standard and mitochondrial (vertebrates, echinoderms, yeast and other invertabrates)
  • Amino acids properties: volume, polarity and charge + physico-chemical distance (Miyata and Grantham) + import from any AAIndex entry.
  • Consensus sequences.
  • Pairwise alignment.
  • Similarity score computation.
  • Sequence bootstrap.
  • Homogeneity test (Bowker's test).
  • NGS tools: sequence quality scores, file formats (Phred, FastQ, MAF, etc).
  • etc.

Phylogenetics and molecular evolution

Data structure and IO

  • Phylogenetic trees.
  • IO from newick files, with support for multiple entries.
  • Support for NHX and Nexus formats.

Substitution models

  • Nucleotide models: JC, K80, T92, F84, HKY85, TN93, GTR, L95, RN95, RN95s
  • Protein models: JC, DSO78, JTT92, Coala, LG08, LG10 set of models, LLG08, LGL08 CAT set of models, WAG01 + any PAML-formated model description, with possibility to estimate equilibrium frequencies.
  • Codon models: MG94, YN98, GY94, DFP07, RELAX, SENCA, CORA, KCM9, KCM17 + models from PAML suite + user-defined + others "a la carte".
  • Word or codon models with multiple substitutions of many kinds.
  • Markov modulated (aka covarion models) TS98 and Gal01
  • Mixed models, including PAML's M1, M2, M3, M7, M8, M10, protein models LGL08 CAT, LLG08, and any mixture of values from a or several parameters in a model, and any mixture of several models.
  • Models with compulsory substitutions (of any kind, or from a given set of possible substitutions) on branches, built on the basis of "simple" models.
  • Extension of "simple" models with specific rate multipliers for given sets of substitutions.
  • Model including gaps RE08
  • Models on discrete numeric alphabets, Binary model, Equiprobable model
  • Global clock tree likelihood processes.
  • Either Stationary or non-stationary processes, available through root frequency definitions.
  • Virtually any kind of non-homogeneous process is supported!
  • Support for rate-across sites process, with virtually any probability distribution, allowing for invariant classes.
  • Scenarios of paths among mixed models along the branches of the trees.
  • Heterogeneity of processes along alignments: partitions, HMM, auto-correlation.

Molecular evolution tools

  • Parameter and branch length estimation under maximum likelihood.
  • Ancestral states reconstructions: Marginal likelihood.
  • (Weighted) substitution mapping (number of events, duration of states per site and/or per branch).
  • Sequences simulation under any substitution process (from most simple to heterogeneous, mixed (with scenarios), site heterogeneous, with several trees, ...).
  • Posterior probabilities of submodels in modellings with mixed models.

Population genetics

  • A new file format to deal with codominant markers and bio-sequence data for individuals.
  • Import and export methods with various population genetics software.
  • Specific containers for polymorphism data.
  • Diversity and polymorphism statistics for codominant and sequence data.
  • Estimation of Wright F-statistics and pairwise genetic distance on codominant markers.
  • Statistics on synonymous and non synonymous sites for coding sequences
  • Various 'Neutrality' statistics on sequence data (Tajima, Fu and Li, Rand and Kann ...).
  • Various measures of linkage disequilibrium.
  • etc.

Genomics tools

  • Efficient parser for read sequences with quality scores (FastQ).
  • Efficient customizable parse for genome alignments in MAF format.
  • Classes and tools for handling features (GFF and GTF).
  • etc.

Numerical calculus

Using Bio++ classes:

  • Numerical tools: extended functions (log, factorial, etc.)
  • Vector tools: element-wise functions, statistics (mean, var, sd, correlation, information theory)
  • Classes for matrices implementation.
  • Linear algebra: eigen decomposition, LU decomposition, inversion, etc.
  • Random number generation: Quick & Dirty (32bits only), Wichmann and Hill, Knuth. Samplers from probability distributions (uniform, normal, gamma, etc.).
  • Function object implementation, with first and second order derivatives.
  • Numerical derivatives computation.
  • Optimization algorithms: Golden section search, Brent's algorithm, Powell's and Downhill simplex method, but also methods using derivatives like conjugate gradient and Newton's method. Object implementation of these methods, using the event-driven Optmizer interface (works with Function objects).
  • Statistics: DataTable object, with I/O from CSV files, probability distributions, sampling and simulations.
  • Kernel density methods.
  • etc.

Access to Eigen3 classes, with Bio++ specific management of underflow

Utils

  • Files: working on file paths, getting file extensions and names, testing existence, open and store in string arrays, etc.
  • Text: convert text to any other type and vice versa, remove spaces, tokenize, switch between upper/lower case, etc.
  • Applications: read options from a file or command line
  • etc.

Graphical components for GUI development

These classes are developped using the Qt library.

  • Tree canvass and controlers

The PhyView phylogenetic editor is the first program to use the library. It features several methods like tree edition, rerooting, branch length editing, subtree sampling, and allows to associate data to a tree PhyView.

Clone this wiki locally