Sequence Alignment

Foundations

Sequence alignment identifies regions of similarity between biological sequences (DNA, RNA, protein) that may reflect functional, structural, or evolutionary relationships. Alignment is a core operation in bioinformatics, underpinning homology detection, phylogenetics, variant calling, and structure prediction.

Scoring Schemes

An alignment maps positions in one sequence to positions in another, allowing matches, mismatches, and gaps (insertions/deletions). The quality of an alignment is quantified by a scoring function:

Score = sum of substitution scores for aligned positions + gap penalties

The choice of scoring parameters dramatically affects alignment sensitivity and specificity.

Pairwise Alignment

Needleman-Wunsch (Global Alignment)

Aligns two sequences end-to-end using dynamic programming. Given sequences of length m and n, construct an (m+1) x (n+1) matrix F where F(i,j) is the optimal score aligning the first i characters of sequence 1 with the first j of sequence 2.

Recurrence:

F(i,j) = max { F(i-1, j-1) + s(a_i, b_j),   // match/mismatch
               F(i-1, j)   + gap_penalty,      // gap in sequence 2
               F(i,   j-1) + gap_penalty }      // gap in sequence 1

Time and space complexity: O(mn). Traceback from F(m,n) recovers the optimal alignment. Hirschberg's algorithm reduces space to O(min(m,n)) using divide-and-conquer while maintaining O(mn) time.

Smith-Waterman (Local Alignment)

Finds the highest-scoring local subsequence alignment. Modification: add a zero option to the recurrence (F(i,j) >= 0), and traceback starts from the maximum scoring cell, terminating at any cell with score 0.

This is essential for finding conserved domains within otherwise divergent sequences, or detecting homologous regions between sequences of different lengths.

Gap Penalty Models

Linear: gamma(g) = -gd, where d is the per-position penalty. Computationally simple but biologically unrealistic (opening a gap is costlier than extending it).
Affine: gamma(g) = -d - (g-1)e, with gap-open penalty d and gap-extend penalty e (d > e). Requires three DP matrices (M, Ix, Iy). Standard in most tools.
Convex/logarithmic: gamma(g) = -d - e*ln(g). More biologically realistic but computationally expensive; rarely used in practice.

Substitution Matrices

PAM (Point Accepted Mutation)

Derived by Dayhoff from closely related protein sequences (>85% identity). PAM1 represents 1% divergence. Higher PAM matrices (PAM250) are obtained by matrix exponentiation, modeling longer evolutionary distances. PAM matrices assume a Markov model of amino acid substitution.

Entries are log-odds scores: s(i,j) = log2(q_ij / (p_i * p_j)), where q_ij is the observed substitution frequency and p_i, p_j are background frequencies.

BLOSUM (BLocks Substitution Matrix)

Derived from ungapped aligned blocks in the BLOCKS database. BLOSUM62 uses sequences clustered at 62% identity to reduce sampling bias. Unlike PAM, each BLOSUM matrix is computed directly from observed data at the target evolutionary distance.

Matrix	Best For	Evolutionary Distance
BLOSUM80	Closely related sequences	Short
BLOSUM62	General purpose (default)	Medium
BLOSUM45	Distant homologs	Long
PAM120	Close relationships	Short
PAM250	Distant relationships	Long

DNA scoring: Simpler matrices suffice; +1/-1 or +2/-3 match/mismatch schemes are common. Transition/transversion-aware matrices improve sensitivity for divergent sequences.

Heuristic Alignment Methods

Exact DP algorithms are too slow for database searches (O(mn) per query-database pair). Heuristic methods sacrifice guaranteed optimality for speed.

BLAST (Basic Local Alignment Search Tool)

BLAST achieves near-linear time through a seed-and-extend strategy:

Seeding: Identify exact word matches (default: 11 for nucleotide, 3 for protein) between query and database
Extension: Extend seeds in both directions using ungapped alignment until score drops below threshold X
Evaluation: Score extensions; merge nearby HSPs (High-Scoring Pairs); perform gapped alignment on promising hits
Statistics: E-value = Kmn * e^(-lambda*S), where K and lambda are Karlin-Altschul parameters, m and n are sequence/database lengths, S is the raw score

BLAST variants: blastn (nt vs nt), blastp (protein vs protein), blastx (translated nt vs protein), tblastn (protein vs translated nt), tblastx (translated vs translated), PSI-BLAST (iterative profile search), DELTA-BLAST (domain-enhanced).

E-value interpretation: Expected number of alignments with score >= S by chance. E < 1e-5 typically indicates homology; values depend on database size.

DIAMOND

Optimized for protein/translated alignment against large databases. Uses double indexing (seeds from both query and reference), spaced seeds for sensitivity, and SIMD-accelerated ungapped extension. 100-1000x faster than BLAST with comparable sensitivity.

Other Fast Aligners

MMseqs2: Iterative k-mer matching with prefiltering; clustering and searching
LAST: Adaptive seeds based on sequence composition; handles genomes with repetitive elements
minimap2: Minimizer-based seeding; fast genome-to-genome and long-read alignment

Multiple Sequence Alignment (MSA)

MSA extends pairwise alignment to three or more sequences simultaneously. The exact solution via DP is O(L^N) for N sequences of length L, making it intractable for more than a handful of sequences.

Progressive Alignment

The most common heuristic approach:

Compute all pairwise distances (or similarity scores)
Build a guide tree (typically NJ or UPGMA)
Align sequences progressively following the tree topology, from leaves to root

Limitation: Errors in early alignments propagate ("once a gap, always a gap"). Guide tree quality strongly influences final alignment quality.

ClustalOmega

Uses mBed for O(N log N) guide tree construction (embedding sequences into a vector space via distances to seed sequences), then HHalign for profile-profile alignment at each internal node. Scales to millions of sequences.

MUSCLE (v5 / MUSCLE5)

Uses ensemble alignment strategy: generates multiple alternative alignments by perturbing parameters and guide trees, then selects the best by column confidence scoring. Earlier versions (v3) used iterative refinement with kmer-distance guide trees.

MAFFT

Offers multiple algorithms trading speed for accuracy:

FFT-NS-1/2: Fast Fourier transform for rapid homology detection; progressive
L-INS-i: Iterative refinement with local pairwise alignment (most accurate for <200 sequences)
E-INS-i: For sequences with multiple conserved domains separated by long gaps
G-INS-i: Global iterative refinement

Consistency-Based Methods

T-Coffee: Combines multiple pairwise alignment libraries (local + global) into a position-specific scoring library. The consistency principle: if position i in seq A aligns with position k in seq C, and k aligns with position j in seq B, this provides evidence that i should align with j. Produces highly accurate alignments but is slow for large datasets.

ProbCons: Uses pair-HMMs to compute posterior match probabilities, applies consistency transformation, then progressive alignment weighted by these probabilities.

Profile Methods and HMMs

Position-Specific Scoring Matrices (PSSMs)

A PSSM (or position-weight matrix) encodes the amino acid preferences at each position of an MSA. For position j and amino acid a:

PSSM(j, a) = log2(f_ja / p_a)

where f_ja is the observed (or pseudocount-adjusted) frequency and p_a is the background frequency. PSI-BLAST iteratively constructs PSSMs from search results to detect remote homologs.

Profile Hidden Markov Models (HMMs)

Profile HMMs model MSA columns as a sequence of match (M), insert (I), and delete (D) states:

Match states: Emit amino acids according to position-specific distributions
Insert states: Model insertions between conserved positions
Delete states: Silent states modeling gaps in the alignment

The forward algorithm computes P(sequence|model) in O(LK) time (L = sequence length, K = model length). The Viterbi algorithm finds the most probable state path (equivalent to alignment).

HMMER

The standard tool for profile HMM searching. HMMER3 uses:

MSV (Multi-Segment Viterbi): Fast SIMD-accelerated filter using ungapped diagonal model
Viterbi filter: Gapped alignment filter
Forward algorithm: Full probabilistic scoring for surviving sequences
E-value calibration: Fits score distributions to determine statistical significance

HMMER powers the Pfam database of protein domain families and is central to genome annotation pipelines.

Alignment Quality and Benchmarking

Benchmarks

BAliBASE: Curated reference alignments for proteins based on structural superposition
PREFAB: Pairwise reference alignments from structure
HomFam: Large protein family alignments

Quality Metrics

Sum-of-pairs score (SP): Fraction of correctly aligned residue pairs
Total column score (TC): Fraction of entirely correctly aligned columns
Modeler score: Fraction of reference pairs recovered

Practical Considerations

Alignment quality degrades below ~20% sequence identity (the "twilight zone")
Structure-based alignment is the gold standard for distant homologs
Alignment filtering (trimAl, Gblocks) removes poorly aligned regions before downstream analysis
Codon-aware alignment (MACSE, PRANK) is essential for selection analyses

Statistical Significance

The score distribution of random local alignments follows an extreme value distribution (EVD/Gumbel distribution) with parameters K and lambda determined by scoring matrix and sequence composition. The E-value framework allows meaningful comparison of alignment scores across different database sizes and query lengths, distinguishing true homology from chance similarity.