Sequence Alignment
Foundations
Sequence alignment identifies regions of similarity between biological sequences (DNA, RNA, protein) that may reflect functional, structural, or evolutionary relationships. Alignment is a core operation in bioinformatics, underpinning homology detection, phylogenetics, variant calling, and structure prediction.
Scoring Schemes
An alignment maps positions in one sequence to positions in another, allowing matches, mismatches, and gaps (insertions/deletions). The quality of an alignment is quantified by a scoring function:
Score = sum of substitution scores for aligned positions + gap penalties
The choice of scoring parameters dramatically affects alignment sensitivity and specificity.
Pairwise Alignment
Needleman-Wunsch (Global Alignment)
Aligns two sequences end-to-end using dynamic programming. Given sequences of length m and n, construct an (m+1) x (n+1) matrix F where F(i,j) is the optimal score aligning the first i characters of sequence 1 with the first j of sequence 2.
Recurrence:
F(i,j) = max { F(i-1, j-1) + s(a_i, b_j), // match/mismatch
F(i-1, j) + gap_penalty, // gap in sequence 2
F(i, j-1) + gap_penalty } // gap in sequence 1
Time and space complexity: O(mn). Traceback from F(m,n) recovers the optimal alignment. Hirschberg's algorithm reduces space to O(min(m,n)) using divide-and-conquer while maintaining O(mn) time.
Smith-Waterman (Local Alignment)
Finds the highest-scoring local subsequence alignment. Modification: add a zero option to the recurrence (F(i,j) >= 0), and traceback starts from the maximum scoring cell, terminating at any cell with score 0.
This is essential for finding conserved domains within otherwise divergent sequences, or detecting homologous regions between sequences of different lengths.
Gap Penalty Models
- Linear: gamma(g) = -gd, where d is the per-position penalty. Computationally simple but biologically unrealistic (opening a gap is costlier than extending it).
- Affine: gamma(g) = -d - (g-1)e, with gap-open penalty d and gap-extend penalty e (d > e). Requires three DP matrices (M, Ix, Iy). Standard in most tools.
- Convex/logarithmic: gamma(g) = -d - e*ln(g). More biologically realistic but computationally expensive; rarely used in practice.
Substitution Matrices
PAM (Point Accepted Mutation)
Derived by Dayhoff from closely related protein sequences (>85% identity). PAM1 represents 1% divergence. Higher PAM matrices (PAM250) are obtained by matrix exponentiation, modeling longer evolutionary distances. PAM matrices assume a Markov model of amino acid substitution.
Entries are log-odds scores: s(i,j) = log2(q_ij / (p_i * p_j)), where q_ij is the observed substitution frequency and p_i, p_j are background frequencies.
BLOSUM (BLocks Substitution Matrix)
Derived from ungapped aligned blocks in the BLOCKS database. BLOSUM62 uses sequences clustered at 62% identity to reduce sampling bias. Unlike PAM, each BLOSUM matrix is computed directly from observed data at the target evolutionary distance.
| Matrix | Best For | Evolutionary Distance | |---|---|---| | BLOSUM80 | Closely related sequences | Short | | BLOSUM62 | General purpose (default) | Medium | | BLOSUM45 | Distant homologs | Long | | PAM120 | Close relationships | Short | | PAM250 | Distant relationships | Long |
DNA scoring: Simpler matrices suffice; +1/-1 or +2/-3 match/mismatch schemes are common. Transition/transversion-aware matrices improve sensitivity for divergent sequences.
Heuristic Alignment Methods
Exact DP algorithms are too slow for database searches (O(mn) per query-database pair). Heuristic methods sacrifice guaranteed optimality for speed.
BLAST (Basic Local Alignment Search Tool)
BLAST achieves near-linear time through a seed-and-extend strategy:
- Seeding: Identify exact word matches (default: 11 for nucleotide, 3 for protein) between query and database
- Extension: Extend seeds in both directions using ungapped alignment until score drops below threshold X
- Evaluation: Score extensions; merge nearby HSPs (High-Scoring Pairs); perform gapped alignment on promising hits
- Statistics: E-value = Kmn * e^(-lambda*S), where K and lambda are Karlin-Altschul parameters, m and n are sequence/database lengths, S is the raw score
BLAST variants: blastn (nt vs nt), blastp (protein vs protein), blastx (translated nt vs protein), tblastn (protein vs translated nt), tblastx (translated vs translated), PSI-BLAST (iterative profile search), DELTA-BLAST (domain-enhanced).
E-value interpretation: Expected number of alignments with score >= S by chance. E < 1e-5 typically indicates homology; values depend on database size.
DIAMOND
Optimized for protein/translated alignment against large databases. Uses double indexing (seeds from both query and reference), spaced seeds for sensitivity, and SIMD-accelerated ungapped extension. 100-1000x faster than BLAST with comparable sensitivity.
Other Fast Aligners
- MMseqs2: Iterative k-mer matching with prefiltering; clustering and searching
- LAST: Adaptive seeds based on sequence composition; handles genomes with repetitive elements
- minimap2: Minimizer-based seeding; fast genome-to-genome and long-read alignment
Multiple Sequence Alignment (MSA)
MSA extends pairwise alignment to three or more sequences simultaneously. The exact solution via DP is O(L^N) for N sequences of length L, making it intractable for more than a handful of sequences.
Progressive Alignment
The most common heuristic approach:
- Compute all pairwise distances (or similarity scores)
- Build a guide tree (typically NJ or UPGMA)
- Align sequences progressively following the tree topology, from leaves to root
Limitation: Errors in early alignments propagate ("once a gap, always a gap"). Guide tree quality strongly influences final alignment quality.
ClustalOmega
Uses mBed for O(N log N) guide tree construction (embedding sequences into a vector space via distances to seed sequences), then HHalign for profile-profile alignment at each internal node. Scales to millions of sequences.
MUSCLE (v5 / MUSCLE5)
Uses ensemble alignment strategy: generates multiple alternative alignments by perturbing parameters and guide trees, then selects the best by column confidence scoring. Earlier versions (v3) used iterative refinement with kmer-distance guide trees.
MAFFT
Offers multiple algorithms trading speed for accuracy:
- FFT-NS-1/2: Fast Fourier transform for rapid homology detection; progressive
- L-INS-i: Iterative refinement with local pairwise alignment (most accurate for <200 sequences)
- E-INS-i: For sequences with multiple conserved domains separated by long gaps
- G-INS-i: Global iterative refinement
Consistency-Based Methods
T-Coffee: Combines multiple pairwise alignment libraries (local + global) into a position-specific scoring library. The consistency principle: if position i in seq A aligns with position k in seq C, and k aligns with position j in seq B, this provides evidence that i should align with j. Produces highly accurate alignments but is slow for large datasets.
ProbCons: Uses pair-HMMs to compute posterior match probabilities, applies consistency transformation, then progressive alignment weighted by these probabilities.
Profile Methods and HMMs
Position-Specific Scoring Matrices (PSSMs)
A PSSM (or position-weight matrix) encodes the amino acid preferences at each position of an MSA. For position j and amino acid a:
PSSM(j, a) = log2(f_ja / p_a)
where f_ja is the observed (or pseudocount-adjusted) frequency and p_a is the background frequency. PSI-BLAST iteratively constructs PSSMs from search results to detect remote homologs.
Profile Hidden Markov Models (HMMs)
Profile HMMs model MSA columns as a sequence of match (M), insert (I), and delete (D) states:
- Match states: Emit amino acids according to position-specific distributions
- Insert states: Model insertions between conserved positions
- Delete states: Silent states modeling gaps in the alignment
The forward algorithm computes P(sequence|model) in O(LK) time (L = sequence length, K = model length). The Viterbi algorithm finds the most probable state path (equivalent to alignment).
HMMER
The standard tool for profile HMM searching. HMMER3 uses:
- MSV (Multi-Segment Viterbi): Fast SIMD-accelerated filter using ungapped diagonal model
- Viterbi filter: Gapped alignment filter
- Forward algorithm: Full probabilistic scoring for surviving sequences
- E-value calibration: Fits score distributions to determine statistical significance
HMMER powers the Pfam database of protein domain families and is central to genome annotation pipelines.
Alignment Quality and Benchmarking
Benchmarks
- BAliBASE: Curated reference alignments for proteins based on structural superposition
- PREFAB: Pairwise reference alignments from structure
- HomFam: Large protein family alignments
Quality Metrics
- Sum-of-pairs score (SP): Fraction of correctly aligned residue pairs
- Total column score (TC): Fraction of entirely correctly aligned columns
- Modeler score: Fraction of reference pairs recovered
Practical Considerations
- Alignment quality degrades below ~20% sequence identity (the "twilight zone")
- Structure-based alignment is the gold standard for distant homologs
- Alignment filtering (trimAl, Gblocks) removes poorly aligned regions before downstream analysis
- Codon-aware alignment (MACSE, PRANK) is essential for selection analyses
Statistical Significance
The score distribution of random local alignments follows an extreme value distribution (EVD/Gumbel distribution) with parameters K and lambda determined by scoring matrix and sequence composition. The E-value framework allows meaningful comparison of alignment scores across different database sizes and query lengths, distinguishing true homology from chance similarity.