Phylogenetics
Phylogenetic Trees
Phylogenetics reconstructs evolutionary relationships among organisms, genes, or proteins. A phylogenetic tree is a graph-theoretical structure representing these relationships.
Tree Terminology
- Taxa/Leaves: Terminal nodes representing observed sequences or species
- Internal nodes: Hypothetical ancestors
- Branches/Edges: Connect nodes; branch lengths represent evolutionary distance (substitutions per site) or time
- Topology: The branching pattern independent of branch lengths
- Rooted tree: Has a designated root representing the most recent common ancestor; implies direction of evolution
- Unrooted tree: Shows relationships without specifying ancestry direction
- Clade/Monophyletic group: An ancestor and all its descendants
- Polytomy: A node with more than two descendants (hard = true simultaneous divergence; soft = unresolved)
Counting Trees
For n taxa, the number of unrooted bifurcating trees is (2n-5)!! = (2n-5)(2n-7)...3*1. For 10 taxa: 2,027,025 trees. For 50 taxa: ~2.84 x 10^76. Exhaustive search is impossible for even moderate numbers of taxa.
Rooting Methods
- Outgroup rooting: Include a known outgroup taxon; root placed on the branch separating outgroup from ingroup
- Midpoint rooting: Root placed at the midpoint of the longest path (assumes approximate molecular clock)
- Minimal ancestor deviation (MAD): Minimizes deviation from molecular clock across all root positions
Distance-Based Methods
Distance Matrix Construction
Compute pairwise evolutionary distances between all sequence pairs. Observed sequence differences (p-distance) underestimate true distances due to multiple substitutions at the same site. Correction models account for this:
- Jukes-Cantor: d = -(3/4)ln(1 - 4p/3); assumes equal base frequencies and substitution rates
- Kimura 2-parameter: Distinguishes transitions (rate alpha) from transversions (rate beta)
- For proteins: Poisson correction, gamma distance, or empirical matrix-based distances
UPGMA (Unweighted Pair Group Method with Arithmetic Mean)
Agglomerative clustering that assumes a molecular clock (ultrametric tree):
- Find the pair with smallest distance; merge them
- Recompute distances to the merged group as arithmetic means
- Repeat until one cluster remains
Produces a rooted tree. Major limitation: The molecular clock assumption is frequently violated, leading to incorrect topologies when rates vary across lineages.
Neighbor Joining (NJ)
Does not assume a molecular clock. Uses a rate-corrected distance criterion:
- Compute Q-matrix: Q(i,j) = (n-2)d(i,j) - sum_k d(i,k) - sum_k d(j,k)
- Join the pair with minimum Q value
- Compute branch lengths and update distance matrix
- Repeat until three nodes remain
Properties: O(n^3) time, produces an unrooted tree, consistent estimator under most distance models. Widely used for guide tree construction in MSA and as a starting tree for ML searches.
Maximum Parsimony
Finds the tree(s) requiring the fewest character state changes (substitutions) to explain the observed data.
Fitch Algorithm (Unordered Characters)
For a given tree topology, compute the parsimony score in O(nk) time (n taxa, k sites):
- Bottom-up (post-order): At each internal node, take the intersection of child state sets if non-empty; otherwise take the union and increment the score
- Top-down (pre-order): Assign states to internal nodes
Weighted Parsimony (Sankoff Algorithm)
Allows different costs for different types of changes via a cost matrix. Uses DP on the tree; each internal node stores the minimum cost of each state.
Tree Search
Since exhaustive enumeration is impossible for large datasets, heuristic search strategies are used:
- Stepwise addition: Add taxa one at a time to all possible positions; keep best
- Branch swapping: NNI (nearest-neighbor interchange), SPR (subtree pruning and regrafting), TBR (tree bisection and reconnection)
- NNI explores O(n) neighbors, SPR O(n^2), TBR O(n^3) per step
Criticisms of parsimony: Susceptible to long-branch attraction (LBA); does not model rate variation; inconsistent under certain conditions (Felsenstein zone).
Maximum Likelihood (ML)
Finds the tree (topology + branch lengths + model parameters) that maximizes the probability of observing the data.
Substitution Models
Nucleotide models form a hierarchy (nested by parameter constraints):
| Model | Free Rate Parameters | Base Frequencies | |---|---|---| | JC69 | 0 (all rates equal) | Equal | | K80 (K2P) | 1 (ti/tv ratio) | Equal | | HKY85 | 1 (ti/tv ratio) | Unequal | | GTR | 5 (6 rate parameters, 1 fixed) | Unequal |
Rate heterogeneity across sites:
- +G (Gamma): Rates drawn from a discrete gamma distribution with shape parameter alpha; alpha < 1 indicates strong rate variation. Typically 4 rate categories.
- +I (Invariant sites): Fraction of sites that cannot change
- +G+I: Combined model (GTR+G+I is a common default)
Protein substitution models: WAG, LG, JTT, Dayhoff, cpREV (chloroplast), mtREV (mitochondrial). Model selection via AIC, BIC, or likelihood ratio tests (ModelTest-NG, ModelFinder in IQ-TREE).
Likelihood Computation
For a given tree T with branch lengths and model M, the likelihood for site i:
L_i = sum over all ancestral state assignments of P(data at leaves | ancestral states, T, M)
Computed efficiently using Felsenstein's pruning algorithm (post-order tree traversal). Total log-likelihood: ln L = sum_i ln L_i. Complexity: O(nks^2) per tree evaluation (n taxa, k sites, s states).
Tree Search in ML
- Starting tree: NJ or parsimony
- Branch length optimization: Newton-Raphson or Brent's method for each branch
- Topology search: NNI, SPR, or TBR rearrangements; accept moves that increase likelihood
- Stochastic perturbation: Random NNI/SPR to escape local optima
ML Software
- RAxML-NG: Fast ML tree search with rapid bootstrapping; optimized for large datasets; uses vectorized likelihood kernels (AVX2/AVX-512)
- IQ-TREE 2: Stochastic tree search with perturbation; ultrafast bootstrap (UFBoot2); integrated ModelFinder; supports partition models, mixture models, polymorphism-aware models
- PhyML: Classic ML implementation with NNI and SPR search
Bootstrapping
Non-parametric bootstrap assesses branch support:
- Resample alignment columns with replacement to create pseudoreplicates
- Infer a tree for each pseudoreplicate
- Report the fraction of bootstrap trees containing each bipartition
Typically 100-1000 replicates. Ultrafast bootstrap (UFBoot): Uses RELL approximation to evaluate bootstrap replicates without full tree search; 1000+ replicates in minutes.
SH-aLRT: Approximate likelihood ratio test; faster alternative to bootstrap; values >80% generally indicate reliable branches.
Bayesian Phylogenetics
Computes the posterior distribution of trees: P(T, theta | D) proportional to P(D | T, theta) * P(T) * P(theta).
Markov Chain Monte Carlo (MCMC)
Since the posterior cannot be computed analytically, MCMC sampling is used:
- Propose new tree/parameter states via perturbation operators
- Accept/reject proposals according to the Metropolis-Hastings ratio
- After burn-in, the chain of sampled trees approximates the posterior distribution
- Posterior probabilities of clades = fraction of post-burn-in trees containing them
Convergence diagnostics: Multiple independent runs; compare split frequencies (average standard deviation of split frequencies < 0.01); ESS (effective sample size) > 200 for all parameters; trace plots for stationarity.
Bayesian Software
MrBayes: General-purpose Bayesian inference; Metropolis-coupled MCMC (MC^3: cold chain + heated chains for better mixing); supports partitioned models, mixed models, topology constraints.
BEAST/BEAST2: Specialized for molecular clock and divergence time estimation:
- Strict clock: Single rate across the tree
- Relaxed clocks: Uncorrelated lognormal or exponential rate variation among branches
- Tip dating: Incorporates sampling dates (essential for rapidly evolving pathogens)
- Tree priors: Coalescent (constant, skyline, skygrid), birth-death, fossilized birth-death
- Calibration: Fossil calibrations as node age priors; geological events
Coalescent Theory
The coalescent models gene genealogies backward in time as lineages coalesce (merge) in ancestral populations.
Basic Coalescent
For n lineages in a population of effective size N_e, the rate of coalescence is n(n-1)/(4N_e) per generation (diploid). The expected time to the most recent common ancestor of n lineages is 4N_e(1 - 1/n) generations. Gene trees can differ from species trees due to incomplete lineage sorting (ILS).
Multispecies Coalescent
Models gene tree/species tree discordance:
- ASTRAL: Summary method; finds the species tree maximizing the number of shared quartet topologies with input gene trees. Statistically consistent under the multispecies coalescent. O(nk^2) where k is the number of taxa.
- StarBEAST3: Full Bayesian co-estimation of gene trees and species tree under the multispecies coalescent
- *BEAST: Joint inference in BEAST2 framework
Gene Tree Discordance Sources
Beyond ILS: hybridization/introgression (detected by D-statistics/ABBA-BABA, PhyloNetworks), horizontal gene transfer (common in prokaryotes), gene duplication and loss (reconciliation methods), recombination.
Molecular Clock and Divergence Dating
The molecular clock hypothesis posits that molecular evolution occurs at a roughly constant rate. While strict clocks are rarely justified, relaxed clock models allow rate variation and enable divergence time estimation when calibrated with external information (fossils, biogeography, mutation rate estimates).
Key considerations: Choice of clock model, calibration prior specification (uniform vs. lognormal vs. exponential bounds), interaction between tree prior and clock model, and proper assessment of uncertainty in age estimates.
Practical Phylogenetic Workflow
- Compile orthologous sequences; assess with alignment quality checks
- Multiple sequence alignment (MAFFT, MUSCLE)
- Alignment trimming (trimAl) or masking unreliable regions
- Model selection (ModelFinder, ModelTest-NG)
- Tree inference (IQ-TREE, RAxML-NG, or MrBayes/BEAST for dating)
- Support assessment (bootstrap, posterior probabilities, SH-aLRT)
- Tree visualization (FigTree, iTOL, ggtree)
- Biological interpretation in the context of known taxonomy and biogeography