Genome Annotation

Overview

Genome annotation identifies and characterizes functional elements within a genome assembly. Structural annotation locates genes and their components (exons, introns, UTRs, regulatory regions). Functional annotation assigns biological roles to predicted gene products. Annotation quality directly impacts all downstream biological interpretation.

Gene Prediction

Ab Initio Gene Prediction

Ab initio methods predict genes solely from sequence composition using statistical models trained on known genes from the target organism or related species.

Signal sensors detect specific sequence motifs:

Splice sites (GT-AG donor-acceptor, GC-AG minor), modeled by position weight matrices or neural networks
Translation initiation sites (Kozak consensus: GCC(A/G)CCATGG)
Polyadenylation signals (AATAAA and variants)
Promoter elements (TATA box, CpG islands, CCAAT box)

Content sensors distinguish coding from non-coding regions:

Codon usage frequency (3-periodic Markov chains capture codon bias)
Hexamer composition differs between coding and intergenic regions
GC content variation across isochores

Gene model integration: Hidden Markov Models (HMMs) combine signal and content sensors into a unified probabilistic framework. States represent exon types (initial, internal, terminal, single), introns, intergenic regions. The Viterbi algorithm finds the most probable gene parse.

Key tools:

Augustus: Generalized HMM (GHMM) with flexible state architecture; can incorporate hints from external evidence
GeneMark-ES/EP+: Self-training HMM that estimates parameters from the genome itself; does not require pre-existing training data
SNAP: Simple and fast HMM gene finder
GlimmerHMM: Interpolated Markov models with decision trees for splice sites

Homology-Based Gene Prediction

Uses similarity to known proteins or transcripts from other organisms to guide gene structure prediction.

Protein-to-genome alignment:

Exonerate: Smith-Waterman-based alignment allowing introns in genomic sequence; splice-aware models
GeMoMa: Homology-based using annotated reference gene models; transfers intron-exon structures
miniprot: Fast protein-to-genome aligner optimized for large genomes

Transcript-to-genome alignment:

BLAT: Rapid genomic alignment using indexed k-mers; splice-aware
GMAP: Genomic mapping and alignment of mRNA/EST sequences

Evidence-Based and Hybrid Approaches

MAKER: Iterative annotation pipeline that integrates ab initio predictions, protein homology, and transcript evidence using a weighted evidence combination. Produces annotation edit distance (AED) quality scores. MAKER can iteratively retrain ab initio predictors using its own output.

BRAKER: Combines GeneMark and Augustus with RNA-seq and/or protein evidence for fully automated training and prediction.

GALBA: Uses miniprot protein alignments to train Augustus for annotation of novel genomes.

RNA-seq-Based Annotation

RNA-seq provides direct evidence of expressed genes, alternative splicing, and transcript boundaries.

Read Alignment

STAR (Spliced Transcripts Alignment to a Reference):

Two-pass alignment: first pass discovers novel junctions; second pass uses the complete junction database
Uses suffix array-based seed finding followed by seed clustering and stitching
Handles reads spanning multiple exon junctions
Very fast (>100M reads/hour) but memory-intensive (~30 GB for human genome)

HISAT2:

Graph-based FM index incorporating known SNPs and splice sites into the reference
Hierarchical indexing: global whole-genome index + local indexes for rapid alignment
Lower memory footprint (~8 GB for human) than STAR
Successor to TopHat2/Bowtie2 pipeline

Transcript Assembly

StringTie:

Network flow algorithm on a splice graph: nodes are exons, edges are observed junctions or intron-spanning read pairs
Finds minimum path cover weighted by read support to determine transcript isoforms
Quantifies expression as TPM (transcripts per million) and FPKM
Merge mode combines assemblies from multiple samples for unified annotation

Scallop: Conservation of flow algorithm that provably produces the minimum number of transcripts explaining the data.

Trinity (de novo): For organisms without a reference genome; constructs de Bruijn graphs from RNA-seq reads, partitions into transcript components, and reconstructs full-length transcripts.

Long-Read Transcript Sequencing

PacBio Iso-Seq and ONT direct RNA-seq capture full-length transcripts without assembly:

Eliminates assembly artifacts and ambiguous isoform reconstruction
Directly reveals alternative splicing, fusion transcripts, poly(A) site usage
Lower throughput; often combined with short-read data for quantification
Tools: IsoQuant, FLAIR, TALON

Functional Annotation

Gene Ontology (GO)

Structured vocabulary organized as a directed acyclic graph (DAG) with three root domains:

Molecular Function: Biochemical activity (e.g., "protein kinase activity")
Biological Process: Higher-level cellular program (e.g., "apoptotic process")
Cellular Component: Subcellular localization (e.g., "mitochondrial matrix")

GO terms are associated with evidence codes indicating annotation quality: EXP (experimental), IDA (direct assay), IMP (mutant phenotype), IEA (electronic annotation, lowest confidence). Enrichment analysis (topGO, clusterProfiler) identifies over-represented terms in gene sets.

KEGG (Kyoto Encyclopedia of Genes and Genomes)

Pathway maps linking genes to metabolic and signaling pathways. KEGG Orthology (KO) numbers provide a hierarchical classification. KEGG Mapper and BlastKOALA assign KO numbers to protein sequences.

InterPro and Domain Annotation

InterProScan integrates 14+ member databases:

Pfam: Protein families defined by profile HMMs
PROSITE: Patterns and profiles for functional sites
CDD: Conserved Domain Database (NCBI, uses PSI-BLAST)
SMART: Signaling and extracellular domains
PANTHER: Protein families for evolutionary and functional classification
SUPERFAMILY: Structural domain assignments (SCOP-based)

Other Functional Annotation

eggNOG-mapper: Fast orthology assignment and functional annotation using precomputed eggNOG clusters
SignalP/TargetP: Signal peptide and subcellular targeting prediction
TMHMM/DeepTMHMM: Transmembrane helix prediction
dbCAN: Carbohydrate-active enzyme annotation (CAZymes)
antiSMASH: Biosynthetic gene cluster prediction for secondary metabolites

Non-Coding RNA Annotation

Infernal/Rfam: Covariance models (profile SCFGs) for structured RNA families; detects tRNAs, rRNAs, snoRNAs, riboswitches
tRNAscan-SE: Specialized tRNA detection combining covariance models with heuristic filters
miRDeep2: MicroRNA prediction from small RNA-seq data using precursor structure
RNAmmer/barrnap: Ribosomal RNA prediction

Repetitive Element Annotation

Repeats must be identified and masked before gene prediction to avoid spurious predictions:

RepeatModeler: De novo repeat family identification using RECON and RepeatScout
RepeatMasker: Classifies and masks repeats against RepBase/Dfam libraries
EDTA: Comprehensive de novo TE annotation pipeline

Comparative Genomics

Orthology and Paralogy

Orthologs: Genes in different species diverged by speciation (expected to retain similar function)
Paralogs: Genes diverged by duplication within a species (may neofunctionalize or subfunctionalize)
OrthoFinder: Graph-based ortholog clustering using normalized BLAST scores and MCL; infers orthogroups, gene trees, and rooted species tree
BUSCO: Uses curated single-copy ortholog sets for completeness assessment

Synteny Analysis

Synteny measures the conservation of gene order between genomes:

Microsynteny: Local conservation of a few genes; used for ortholog validation
Macrosynteny: Chromosome-scale conservation; reveals ancestral genome organization
Tools: MCScanX (collinear block detection using DAGchainer-like algorithm), GENESPACE (integrates orthology and synteny), SynVisio/SynMap (visualization)

Synteny analysis reveals whole-genome duplications (WGDs), chromosomal rearrangements (inversions, translocations, fusions), and the evolutionary history of genome structure.

Whole-Genome Alignment

Progressive Cactus: Reference-free whole-genome aligner constructing a cactus graph representing all rearrangements between multiple genomes. Basis for pangenome construction.

minimap2/AnchorWave: Pairwise genome alignment handling structural variation and WGDs.

Annotation Standards and Pipelines

Community Standards

INSDC feature table: Standardized feature types and qualifiers for GenBank/ENA/DDBJ submissions
GFF3: Standard format for genome annotations; hierarchical parent-child relationships (gene -> mRNA -> exon/CDS)
Annotation Edit Distance (AED): Measures concordance between predicted and evidence-supported gene models; AED = 0 is perfect agreement

Production Pipelines

NCBI Eukaryotic Genome Annotation Pipeline: Combines Gnomon gene prediction with RefSeq transcript/protein evidence
Ensembl Gene Annotation: Multi-evidence pipeline producing the Ensembl gene sets
Prokka/Bakta: Rapid prokaryotic annotation integrating Prodigal gene prediction with functional databases

Annotation is iterative: initial automated predictions are refined with new evidence (RNA-seq, proteomics, manual curation) over time. Community annotation efforts (Apollo, WebApollo) enable collaborative manual curation at scale.