Skip to content

Tree likelihood

Laurent Guéguen edited this page Dec 28, 2021 · 9 revisions

Computing the likelihood of a sequence alignment

Two algorithms exist for computing the likelihood of a sequence alignment given a phylogenetic tree and model of sequence evolution. These algorithms are called "simple recursive" (R) and "double recursive" (DR). They are described in detail in Felsenstein's book. The R algorithm uses less memory than the DR one, and is faster as it involves only one recursion to compute the likelihood (forward recursion, aka from the leaves to the root). The DR algorithm however, enables the direct computation of conditional likelihoods everywhere in the tree using a forward and backward (ie from the root to the leaves) recursion. DR recursion needs more memory than R recursion, but is useful to fetch branch specific posterior values, for the fast computation of ancestral sequences and substitution mapping, while this would involve additional recursion with a R algorithm. It is also useful to fasten computations that involve topology change, like nearest neighbor interchange (NNI).

Since Bio++ v3, all likelihood calculations are performed in a data flow, through LikelihoodCalculationSingleProcess. All objects necessary to all the computations (models, branch lengths, roots, Forward and Backward operators, are set at the construction, from a given SubstitutionProcess. All objects with data are empty, and will be filled and updated on the fly. So there is no distinction between R and DR implementations: R (aka Forward) algorithm is used while computing the likelihoods at the root, and parts of DR (aka Backward) are computed when needed.

The PhyloLikelihoods classes allow the computation of the likelihood of one or several data sets, according to one or several substitution processes. Each computation (process,data) involves a specific LikelihoodCalculationSingleProcess, and several objects will share similar objects from the DataFlow in they are in the same Context. In Bio++, this process is described in a class implementing the SubstitutionProcess interface, as described in the next section.

All parameters from the SubstitutionProcess are COPIED in a DataFlow Context when used in a LikelihoodCalculationSingleProcess. So when the values of those DF parameters are changed (for example during an optimization process), the original parameters are NOT changed. To access to the new values, it is necessary to get them through the LikelihoodCalculationSingleProcess. As so, the getSubstitutionProcess() method in LikelihoodCalculationSingleProcess accesses to the constant SubstitutionProcess, which parameters are not updated during the optimization. To update it, it has to be done explicitly. For example:

Context context;

auto process=std::make_shared<RateAcrossSitesSubstitutionProcess>(model, rDist, partree.release());
auto lik = std::make_shared<LikelihoodCalculationSingleProcess>(context, *aln, *process);
PhyloLikelihood* treeLik = new SingleProcessPhyloLikelihood(context, lik);

treeLik = PhylogeneticsApplicationTools::optimizeParameters(treeLik, treeLik->getParameters(), bpppopstats.getParams(), "", true, true, 2);

process->matchParametersValues(lik->getParameters());

Substitution process

The substitution process describes how sequences evolve along a phylogenetic tree. Several types of modeling are supported, via distinct implementations, which all share the same SubstitutionProcess interface. Whatever the complexity of the process, it will always require a phylogenetic tree to be provided. In Bio++, trees are implemented in the PhyloTree class. Such trees can have branch lengths, which are actually parameters of the substitution process. To ease the manipulation of these parameters, the SubstitutionProcess classes require a ParametrizableTree object, which can be instanciated from a TreeTemplate object:

unique_ptr<PhyloTree> tree(reader.parenthesisToPhyloTree("newik tree description"));
auto parTree = std::make_shared<ParametrizablePhyloTree>(*tree);

The SubstitutionProcess classes are tightly bound to SubstitutionModel classes. They will typically own one or more instances of those. While the SubstitutionModel class computes the so-called transition probabilities between all states of a model (defined after an alphabet), the SubstitutionProcess class will provide such probabilities on each branch of an underlying tree. The SubstitutionProcess class will notably be in charge of computing the probabilities only when necessary (with a call to the corresponding method of the appropriate SubstitutionModel class), and store the result temporarily. The process has to listen to any parameter change and make sure the computed probabilities are always correct. The SubstitutionProcess class also allows the computation of other related values, such as the derivatives of the probabilities according to time, the generator of the Markov model, the equilibrium frequencies, and so on.

Several implementations of SubstitutionProcess are available, and are described in the following subsections.

The three next classes own the objects defining the process, and then inherit from interface AutonomousSubstitutionProcess (which inherits from SubstitutionProcess interface).

Simple substitution process

The SimpleSubstitutionProcess class implements the simplest model of sequence evolution, defined by a single SubstitutionModel object passed as argument. The process assumes that all branches and all sites share the same substitution model (model homogeneous in time and in sites).

RAS substitution process

The RateAcrossSitesSubstitutionProcess class implements Yang's Rate-across-sites model of rate variation. It is very similar to the SimpleSubstitutionProcess, but takes as argument as extra parameter a RateDistribution object, describing the discrete distribution of rates to use (typically a discretized gamma distribution).

Non-homogeneous substitution process

In the NonHomogeneousSubstitutionProcess, different models are applied to branches of the tree, specific root frequencies distribution can be used, and distributions of substitution rates across sites.

SubstitutionProcessCollectionMember

The management of several substitution processes that share objects (models, trees, root frequencies, ...) can be done through SubstitutionProcessCollection, which will own all these objects, plus processes known as SubstitutionProcessCollectionMembers in which objects are referenced through numbers from the SubstitutionProcessCollection. Typically these processes are non-homogeneous).

References

  • Felsenstein, J. 2004. Inferring Phylogenies. Sinauer Associates, Sunderland, Mass.

Clone this wiki locally