6 min read
On this page

Genome Annotation

Overview

Genome annotation identifies and characterizes functional elements within a genome assembly. Structural annotation locates genes and their components (exons, introns, UTRs, regulatory regions). Functional annotation assigns biological roles to predicted gene products. Annotation quality directly impacts all downstream biological interpretation.

Gene Prediction

Ab Initio Gene Prediction

Ab initio methods predict genes solely from sequence composition using statistical models trained on known genes from the target organism or related species.

Signal sensors detect specific sequence motifs:

  • Splice sites (GT-AG donor-acceptor, GC-AG minor), modeled by position weight matrices or neural networks
  • Translation initiation sites (Kozak consensus: GCC(A/G)CCATGG)
  • Polyadenylation signals (AATAAA and variants)
  • Promoter elements (TATA box, CpG islands, CCAAT box)

Content sensors distinguish coding from non-coding regions:

  • Codon usage frequency (3-periodic Markov chains capture codon bias)
  • Hexamer composition differs between coding and intergenic regions
  • GC content variation across isochores

Gene model integration: Hidden Markov Models (HMMs) combine signal and content sensors into a unified probabilistic framework. States represent exon types (initial, internal, terminal, single), introns, intergenic regions. The Viterbi algorithm finds the most probable gene parse.

Key tools:

  • Augustus: Generalized HMM (GHMM) with flexible state architecture; can incorporate hints from external evidence
  • GeneMark-ES/EP+: Self-training HMM that estimates parameters from the genome itself; does not require pre-existing training data
  • SNAP: Simple and fast HMM gene finder
  • GlimmerHMM: Interpolated Markov models with decision trees for splice sites

Homology-Based Gene Prediction

Uses similarity to known proteins or transcripts from other organisms to guide gene structure prediction.

Protein-to-genome alignment:

  • Exonerate: Smith-Waterman-based alignment allowing introns in genomic sequence; splice-aware models
  • GeMoMa: Homology-based using annotated reference gene models; transfers intron-exon structures
  • miniprot: Fast protein-to-genome aligner optimized for large genomes

Transcript-to-genome alignment:

  • BLAT: Rapid genomic alignment using indexed k-mers; splice-aware
  • GMAP: Genomic mapping and alignment of mRNA/EST sequences

Evidence-Based and Hybrid Approaches

MAKER: Iterative annotation pipeline that integrates ab initio predictions, protein homology, and transcript evidence using a weighted evidence combination. Produces annotation edit distance (AED) quality scores. MAKER can iteratively retrain ab initio predictors using its own output.

BRAKER: Combines GeneMark and Augustus with RNA-seq and/or protein evidence for fully automated training and prediction.

GALBA: Uses miniprot protein alignments to train Augustus for annotation of novel genomes.

RNA-seq-Based Annotation

RNA-seq provides direct evidence of expressed genes, alternative splicing, and transcript boundaries.

Read Alignment

STAR (Spliced Transcripts Alignment to a Reference):

  • Two-pass alignment: first pass discovers novel junctions; second pass uses the complete junction database
  • Uses suffix array-based seed finding followed by seed clustering and stitching
  • Handles reads spanning multiple exon junctions
  • Very fast (>100M reads/hour) but memory-intensive (~30 GB for human genome)

HISAT2:

  • Graph-based FM index incorporating known SNPs and splice sites into the reference
  • Hierarchical indexing: global whole-genome index + local indexes for rapid alignment
  • Lower memory footprint (~8 GB for human) than STAR
  • Successor to TopHat2/Bowtie2 pipeline

Transcript Assembly

StringTie:

  • Network flow algorithm on a splice graph: nodes are exons, edges are observed junctions or intron-spanning read pairs
  • Finds minimum path cover weighted by read support to determine transcript isoforms
  • Quantifies expression as TPM (transcripts per million) and FPKM
  • Merge mode combines assemblies from multiple samples for unified annotation

Scallop: Conservation of flow algorithm that provably produces the minimum number of transcripts explaining the data.

Trinity (de novo): For organisms without a reference genome; constructs de Bruijn graphs from RNA-seq reads, partitions into transcript components, and reconstructs full-length transcripts.

Long-Read Transcript Sequencing

PacBio Iso-Seq and ONT direct RNA-seq capture full-length transcripts without assembly:

  • Eliminates assembly artifacts and ambiguous isoform reconstruction
  • Directly reveals alternative splicing, fusion transcripts, poly(A) site usage
  • Lower throughput; often combined with short-read data for quantification
  • Tools: IsoQuant, FLAIR, TALON

Functional Annotation

Gene Ontology (GO)

Structured vocabulary organized as a directed acyclic graph (DAG) with three root domains:

  • Molecular Function: Biochemical activity (e.g., "protein kinase activity")
  • Biological Process: Higher-level cellular program (e.g., "apoptotic process")
  • Cellular Component: Subcellular localization (e.g., "mitochondrial matrix")

GO terms are associated with evidence codes indicating annotation quality: EXP (experimental), IDA (direct assay), IMP (mutant phenotype), IEA (electronic annotation, lowest confidence). Enrichment analysis (topGO, clusterProfiler) identifies over-represented terms in gene sets.

KEGG (Kyoto Encyclopedia of Genes and Genomes)

Pathway maps linking genes to metabolic and signaling pathways. KEGG Orthology (KO) numbers provide a hierarchical classification. KEGG Mapper and BlastKOALA assign KO numbers to protein sequences.

InterPro and Domain Annotation

InterProScan integrates 14+ member databases:

  • Pfam: Protein families defined by profile HMMs
  • PROSITE: Patterns and profiles for functional sites
  • CDD: Conserved Domain Database (NCBI, uses PSI-BLAST)
  • SMART: Signaling and extracellular domains
  • PANTHER: Protein families for evolutionary and functional classification
  • SUPERFAMILY: Structural domain assignments (SCOP-based)

Other Functional Annotation

  • eggNOG-mapper: Fast orthology assignment and functional annotation using precomputed eggNOG clusters
  • SignalP/TargetP: Signal peptide and subcellular targeting prediction
  • TMHMM/DeepTMHMM: Transmembrane helix prediction
  • dbCAN: Carbohydrate-active enzyme annotation (CAZymes)
  • antiSMASH: Biosynthetic gene cluster prediction for secondary metabolites

Non-Coding RNA Annotation

  • Infernal/Rfam: Covariance models (profile SCFGs) for structured RNA families; detects tRNAs, rRNAs, snoRNAs, riboswitches
  • tRNAscan-SE: Specialized tRNA detection combining covariance models with heuristic filters
  • miRDeep2: MicroRNA prediction from small RNA-seq data using precursor structure
  • RNAmmer/barrnap: Ribosomal RNA prediction

Repetitive Element Annotation

Repeats must be identified and masked before gene prediction to avoid spurious predictions:

  • RepeatModeler: De novo repeat family identification using RECON and RepeatScout
  • RepeatMasker: Classifies and masks repeats against RepBase/Dfam libraries
  • EDTA: Comprehensive de novo TE annotation pipeline

Comparative Genomics

Orthology and Paralogy

  • Orthologs: Genes in different species diverged by speciation (expected to retain similar function)
  • Paralogs: Genes diverged by duplication within a species (may neofunctionalize or subfunctionalize)
  • OrthoFinder: Graph-based ortholog clustering using normalized BLAST scores and MCL; infers orthogroups, gene trees, and rooted species tree
  • BUSCO: Uses curated single-copy ortholog sets for completeness assessment

Synteny Analysis

Synteny measures the conservation of gene order between genomes:

  • Microsynteny: Local conservation of a few genes; used for ortholog validation
  • Macrosynteny: Chromosome-scale conservation; reveals ancestral genome organization
  • Tools: MCScanX (collinear block detection using DAGchainer-like algorithm), GENESPACE (integrates orthology and synteny), SynVisio/SynMap (visualization)

Synteny analysis reveals whole-genome duplications (WGDs), chromosomal rearrangements (inversions, translocations, fusions), and the evolutionary history of genome structure.

Whole-Genome Alignment

Progressive Cactus: Reference-free whole-genome aligner constructing a cactus graph representing all rearrangements between multiple genomes. Basis for pangenome construction.

minimap2/AnchorWave: Pairwise genome alignment handling structural variation and WGDs.

Annotation Standards and Pipelines

Community Standards

  • INSDC feature table: Standardized feature types and qualifiers for GenBank/ENA/DDBJ submissions
  • GFF3: Standard format for genome annotations; hierarchical parent-child relationships (gene -> mRNA -> exon/CDS)
  • Annotation Edit Distance (AED): Measures concordance between predicted and evidence-supported gene models; AED = 0 is perfect agreement

Production Pipelines

  • NCBI Eukaryotic Genome Annotation Pipeline: Combines Gnomon gene prediction with RefSeq transcript/protein evidence
  • Ensembl Gene Annotation: Multi-evidence pipeline producing the Ensembl gene sets
  • Prokka/Bakta: Rapid prokaryotic annotation integrating Prodigal gene prediction with functional databases

Annotation is iterative: initial automated predictions are refined with new evidence (RNA-seq, proteomics, manual curation) over time. Community annotation efforts (Apollo, WebApollo) enable collaborative manual curation at scale.