Phylogenetics

Phylogenetic Trees

Phylogenetics reconstructs evolutionary relationships among organisms, genes, or proteins. A phylogenetic tree is a graph-theoretical structure representing these relationships.

Tree Terminology

Taxa/Leaves: Terminal nodes representing observed sequences or species
Internal nodes: Hypothetical ancestors
Branches/Edges: Connect nodes; branch lengths represent evolutionary distance (substitutions per site) or time
Topology: The branching pattern independent of branch lengths
Rooted tree: Has a designated root representing the most recent common ancestor; implies direction of evolution
Unrooted tree: Shows relationships without specifying ancestry direction
Clade/Monophyletic group: An ancestor and all its descendants
Polytomy: A node with more than two descendants (hard = true simultaneous divergence; soft = unresolved)

For n taxa, the number of unrooted bifurcating trees is (2n-5)!! = (2n-5)(2n-7)...3*1. For 10 taxa: 2,027,025 trees. For 50 taxa: ~2.84 x 10^76. Exhaustive search is impossible for even moderate numbers of taxa.

Rooting Methods

Outgroup rooting: Include a known outgroup taxon; root placed on the branch separating outgroup from ingroup
Midpoint rooting: Root placed at the midpoint of the longest path (assumes approximate molecular clock)
Minimal ancestor deviation (MAD): Minimizes deviation from molecular clock across all root positions

Distance-Based Methods

Distance Matrix Construction

Compute pairwise evolutionary distances between all sequence pairs. Observed sequence differences (p-distance) underestimate true distances due to multiple substitutions at the same site. Correction models account for this:

Jukes-Cantor: d = -(3/4)ln(1 - 4p/3); assumes equal base frequencies and substitution rates
Kimura 2-parameter: Distinguishes transitions (rate alpha) from transversions (rate beta)
For proteins: Poisson correction, gamma distance, or empirical matrix-based distances

UPGMA (Unweighted Pair Group Method with Arithmetic Mean)

Agglomerative clustering that assumes a molecular clock (ultrametric tree):

Find the pair with smallest distance; merge them
Recompute distances to the merged group as arithmetic means
Repeat until one cluster remains

Produces a rooted tree. Major limitation: The molecular clock assumption is frequently violated, leading to incorrect topologies when rates vary across lineages.

Neighbor Joining (NJ)

Does not assume a molecular clock. Uses a rate-corrected distance criterion:

Compute Q-matrix: Q(i,j) = (n-2)d(i,j) - sum_k d(i,k) - sum_k d(j,k)
Join the pair with minimum Q value
Compute branch lengths and update distance matrix
Repeat until three nodes remain

Properties: O(n^3) time, produces an unrooted tree, consistent estimator under most distance models. Widely used for guide tree construction in MSA and as a starting tree for ML searches.

Maximum Parsimony

Finds the tree(s) requiring the fewest character state changes (substitutions) to explain the observed data.

Fitch Algorithm (Unordered Characters)

For a given tree topology, compute the parsimony score in O(nk) time (n taxa, k sites):

Bottom-up (post-order): At each internal node, take the intersection of child state sets if non-empty; otherwise take the union and increment the score
Top-down (pre-order): Assign states to internal nodes

Weighted Parsimony (Sankoff Algorithm)

Allows different costs for different types of changes via a cost matrix. Uses DP on the tree; each internal node stores the minimum cost of each state.

Tree Search

Since exhaustive enumeration is impossible for large datasets, heuristic search strategies are used:

Stepwise addition: Add taxa one at a time to all possible positions; keep best
Branch swapping: NNI (nearest-neighbor interchange), SPR (subtree pruning and regrafting), TBR (tree bisection and reconnection)
NNI explores O(n) neighbors, SPR O(n^2), TBR O(n^3) per step

Criticisms of parsimony: Susceptible to long-branch attraction (LBA); does not model rate variation; inconsistent under certain conditions (Felsenstein zone).

Maximum Likelihood (ML)

Finds the tree (topology + branch lengths + model parameters) that maximizes the probability of observing the data.

Substitution Models

Nucleotide models form a hierarchy (nested by parameter constraints):

Model	Free Rate Parameters	Base Frequencies
JC69	0 (all rates equal)	Equal
K80 (K2P)	1 (ti/tv ratio)	Equal
HKY85	1 (ti/tv ratio)	Unequal
GTR	5 (6 rate parameters, 1 fixed)	Unequal

Rate heterogeneity across sites:

+G (Gamma): Rates drawn from a discrete gamma distribution with shape parameter alpha; alpha < 1 indicates strong rate variation. Typically 4 rate categories.
+I (Invariant sites): Fraction of sites that cannot change
+G+I: Combined model (GTR+G+I is a common default)

Protein substitution models: WAG, LG, JTT, Dayhoff, cpREV (chloroplast), mtREV (mitochondrial). Model selection via AIC, BIC, or likelihood ratio tests (ModelTest-NG, ModelFinder in IQ-TREE).

Likelihood Computation

For a given tree T with branch lengths and model M, the likelihood for site i:

L_i = sum over all ancestral state assignments of P(data at leaves | ancestral states, T, M)

Computed efficiently using Felsenstein's pruning algorithm (post-order tree traversal). Total log-likelihood: ln L = sum_i ln L_i. Complexity: O(nks^2) per tree evaluation (n taxa, k sites, s states).

Tree Search in ML

Starting tree: NJ or parsimony
Branch length optimization: Newton-Raphson or Brent's method for each branch
Topology search: NNI, SPR, or TBR rearrangements; accept moves that increase likelihood
Stochastic perturbation: Random NNI/SPR to escape local optima

ML Software

RAxML-NG: Fast ML tree search with rapid bootstrapping; optimized for large datasets; uses vectorized likelihood kernels (AVX2/AVX-512)
IQ-TREE 2: Stochastic tree search with perturbation; ultrafast bootstrap (UFBoot2); integrated ModelFinder; supports partition models, mixture models, polymorphism-aware models
PhyML: Classic ML implementation with NNI and SPR search

Bootstrapping

Non-parametric bootstrap assesses branch support:

Resample alignment columns with replacement to create pseudoreplicates
Infer a tree for each pseudoreplicate
Report the fraction of bootstrap trees containing each bipartition

Typically 100-1000 replicates. Ultrafast bootstrap (UFBoot): Uses RELL approximation to evaluate bootstrap replicates without full tree search; 1000+ replicates in minutes.

SH-aLRT: Approximate likelihood ratio test; faster alternative to bootstrap; values >80% generally indicate reliable branches.

Bayesian Phylogenetics

Computes the posterior distribution of trees: P(T, theta | D) proportional to P(D | T, theta) * P(T) * P(theta).

Markov Chain Monte Carlo (MCMC)

Since the posterior cannot be computed analytically, MCMC sampling is used:

Propose new tree/parameter states via perturbation operators
Accept/reject proposals according to the Metropolis-Hastings ratio
After burn-in, the chain of sampled trees approximates the posterior distribution
Posterior probabilities of clades = fraction of post-burn-in trees containing them

Convergence diagnostics: Multiple independent runs; compare split frequencies (average standard deviation of split frequencies < 0.01); ESS (effective sample size) > 200 for all parameters; trace plots for stationarity.

Bayesian Software

MrBayes: General-purpose Bayesian inference; Metropolis-coupled MCMC (MC^3: cold chain + heated chains for better mixing); supports partitioned models, mixed models, topology constraints.

BEAST/BEAST2: Specialized for molecular clock and divergence time estimation:

Strict clock: Single rate across the tree
Relaxed clocks: Uncorrelated lognormal or exponential rate variation among branches
Tip dating: Incorporates sampling dates (essential for rapidly evolving pathogens)
Tree priors: Coalescent (constant, skyline, skygrid), birth-death, fossilized birth-death
Calibration: Fossil calibrations as node age priors; geological events

Coalescent Theory

The coalescent models gene genealogies backward in time as lineages coalesce (merge) in ancestral populations.

Basic Coalescent

For n lineages in a population of effective size N_e, the rate of coalescence is n(n-1)/(4N_e) per generation (diploid). The expected time to the most recent common ancestor of n lineages is 4N_e(1 - 1/n) generations. Gene trees can differ from species trees due to incomplete lineage sorting (ILS).

Multispecies Coalescent

Models gene tree/species tree discordance:

ASTRAL: Summary method; finds the species tree maximizing the number of shared quartet topologies with input gene trees. Statistically consistent under the multispecies coalescent. O(nk^2) where k is the number of taxa.
StarBEAST3: Full Bayesian co-estimation of gene trees and species tree under the multispecies coalescent
*BEAST: Joint inference in BEAST2 framework

Gene Tree Discordance Sources

Beyond ILS: hybridization/introgression (detected by D-statistics/ABBA-BABA, PhyloNetworks), horizontal gene transfer (common in prokaryotes), gene duplication and loss (reconciliation methods), recombination.

Molecular Clock and Divergence Dating

The molecular clock hypothesis posits that molecular evolution occurs at a roughly constant rate. While strict clocks are rarely justified, relaxed clock models allow rate variation and enable divergence time estimation when calibrated with external information (fossils, biogeography, mutation rate estimates).

Key considerations: Choice of clock model, calibration prior specification (uniform vs. lognormal vs. exponential bounds), interaction between tree prior and clock model, and proper assessment of uncertainty in age estimates.

Practical Phylogenetic Workflow

Compile orthologous sequences; assess with alignment quality checks
Multiple sequence alignment (MAFFT, MUSCLE)
Alignment trimming (trimAl) or masking unreliable regions
Model selection (ModelFinder, ModelTest-NG)
Tree inference (IQ-TREE, RAxML-NG, or MrBayes/BEAST for dating)
Support assessment (bootstrap, posterior probabilities, SH-aLRT)
Tree visualization (FigTree, iTOL, ggtree)
Biological interpretation in the context of known taxonomy and biogeography