6 min read
On this page

Genome Assembly

Sequencing and Assembly Pipeline

Sequencing Technologies

Genome assembly reconstructs contiguous genomic sequences from shorter sequenced fragments. The characteristics of sequencing platforms fundamentally shape assembly algorithms.

Illumina (Short-Read Sequencing)

Sequencing by synthesis on a flow cell. Cluster generation amplifies single molecules via bridge PCR. Fluorescently labeled reversible terminators are incorporated one base at a time, imaged, then cleaved.

  • Read length: 75-300 bp (paired-end: 2x150 most common)
  • Error rate: ~0.1-1% (substitution-dominated, quality decreases toward 3' end)
  • Throughput: Up to 6 Tb per run (NovaSeq 6000)
  • Insert sizes: 300-800 bp (standard), 2-20 kb (mate-pair libraries)
  • Cost: ~$5-10 per Gb

Pacific Biosciences (PacBio)

Single-molecule real-time (SMRT) sequencing in zero-mode waveguides. Polymerase incorporates fluorescent nucleotides observed in real time.

  • CLR (Continuous Long Reads): 10-100 kb, ~10-15% error rate (random insertion/deletion errors)
  • HiFi (Circular Consensus Sequencing): Multiple passes around a circularized template; 10-25 kb reads with >99.9% accuracy (Q30+)
  • Throughput: 15-30 Gb per SMRT cell (Revio)

Oxford Nanopore Technologies (ONT)

Measures ionic current changes as single-stranded DNA translocates through a protein nanopore. Base calling via recurrent neural networks (Bonito, Dorado).

  • Read length: Routinely 10-100 kb; ultra-long reads >1 Mb possible
  • Error rate: ~3-5% (systematic homopolymer errors; improving with R10.4 chemistry and duplex sequencing)
  • Throughput: 50-100 Gb (PromethION flow cell)
  • Unique features: Direct RNA sequencing, real-time base modification detection, adaptive sampling

Other Technologies

  • Hi-C: Chromatin conformation capture; provides long-range (Mb-scale) proximity information for scaffolding and phasing
  • Linked reads (10x Genomics, discontinued): Short reads with barcode tags from same long molecule
  • Optical mapping (Bionano): Restriction enzyme site patterns on long DNA molecules

Assembly Graph Theory

Overlap-Layout-Consensus (OLC)

Classical approach suited for long reads (Sanger, PacBio CLR, Nanopore):

  1. Overlap: Find all pairwise overlaps between reads using all-vs-all comparison (minimap2, DALIGNER). Store as overlap graph where nodes = reads, edges = overlaps.
  2. Layout: Identify a consistent path through the overlap graph (Hamiltonian path problem). Remove transitive edges, resolve bubbles from heterozygosity and errors.
  3. Consensus: Generate consensus sequence from the layout using multiple alignment or statistical models.

Complexity: The overlap step is O(n^2) in reads; heuristic filtering (minimizers, seed-and-chain) makes this tractable. The layout phase must handle repeat-induced ambiguities.

De Bruijn Graph Assembly

Dominant approach for short-read assembly:

  1. Decompose reads into all k-mers (subsequences of length k)
  2. Construct a de Bruijn graph: nodes are (k-1)-mers, edges represent k-mers connecting consecutive (k-1)-mers
  3. Find an Eulerian path through the graph (polynomial time, unlike the Hamiltonian path in OLC)

Key parameters and challenges:

  • k-mer size: Small k increases connectivity but introduces ambiguity at repeats; large k resolves repeats but fragments the graph at low-coverage regions. Typical range: 21-127.
  • Multi-k assembly: SPAdes uses multiple k values and merges resulting graphs
  • Error handling: Erroneous k-mers create tips (dead-end paths) and bubbles; removed by coverage-based filtering
  • Repeat resolution: Repeats longer than k create tangles; paired-end information and coverage depth help resolve them

String Graph

A variant of OLC where reads are nodes but transitive overlaps are removed, yielding a compact graph. Used by assemblers like SGA and hifiasm. Conceptually cleaner than de Bruijn graphs for long reads since it avoids k-mer decomposition artifacts.

Assemblers

SPAdes (Short Reads)

Multi-k de Bruijn graph assembler with error correction:

  1. BayesHammer: Read error correction using Bayesian subclustering of k-mer neighborhoods
  2. Graph construction: Build de Bruijn graphs for multiple k values; merge into a unified assembly graph
  3. Paired-end resolution: Use paired reads and coverage to resolve repeats
  4. Mismatch and chimera correction: Post-assembly polishing

Variants: metaSPAdes (metagenomes), rnaSPAdes (transcriptomes), coronaSPAdes (viral genomes).

Hifiasm (HiFi Long Reads)

Produces phased assemblies from PacBio HiFi reads:

  1. All-vs-all overlap using minimizer-based seeding
  2. Haplotype-aware string graph construction: preserves both haplotypes as separate paths
  3. Graph cleaning: Removes errors while retaining genuine heterozygosity
  4. Trio binning (optional): Uses parental short reads to phase haplotypes

Produces near-complete, highly accurate diploid assemblies. Telomere-to-telomere assemblies are now routine for human-sized genomes with sufficient HiFi + ultra-long ONT coverage.

Other Notable Assemblers

  • Canu: OLC assembler for noisy long reads; correction + trimming + assembly pipeline
  • Flye: Repeat graph assembly for long reads; handles high-error data well
  • MEGAHIT: Memory-efficient de Bruijn graph assembler for metagenomes using succinct data structures
  • Verkko: Hybrid assembler combining HiFi and ultra-long ONT reads for telomere-to-telomere assembly
  • wtdbg2: Fuzzy de Bruijn graph approach for ultra-fast long-read assembly

Scaffolding and Gap Filling

Scaffolding

Scaffolding orders and orients contigs into larger scaffolds using long-range information:

  • Mate-pair/linked reads: Bridge gaps of known approximate size
  • Hi-C: Chromatin proximity ligation data; SALSA2, 3D-DNA, YaHS use contact frequency (decaying as ~1/distance) to order contigs into chromosome-scale scaffolds
  • Optical maps: Provide restriction site pattern scaffolding (Bionano Solve)
  • Genetic/physical maps: Independent positional information

Gap Filling

Fills N-runs in scaffolds using reads spanning the gap (GapFiller, TGS-GapCloser for long reads).

Polishing

Correct residual errors in consensus sequences:

  • Short-read polishing: Pilon maps Illumina reads back to assembly; corrects SNPs, small indels
  • Long-read polishing: Medaka (ONT), DeepConsensus (PacBio)
  • Iterative polishing: Multiple rounds may be needed, especially for ONT assemblies

Assembly Quality Assessment

Contiguity Metrics

  • N50: The contig/scaffold length such that contigs of this length or longer cover 50% of the total assembly. Higher is better. NG50 normalizes to expected genome size.
  • L50: The number of contigs needed to reach 50% of total assembly length
  • Largest contig: Single longest assembled sequence
  • Total span: Sum of all contig lengths; should approximate expected genome size
  • Gap content: Percentage of N bases in scaffolds

Completeness Metrics

BUSCO (Benchmarking Universal Single-Copy Orthologs): Searches the assembly for conserved single-copy genes expected in the taxonomic lineage. Reports percentages as Complete (single + duplicated), Fragmented, and Missing. A high-quality assembly typically shows >95% complete BUSCOs.

k-mer completeness (Merqury): Compares k-mer spectrum of the assembly to that of raw reads. Measures QV (consensus quality value), completeness, and phasing accuracy without requiring a reference genome.

Correctness Assessment

  • QUAST: Compares assembly to a reference genome; reports misassemblies, mismatches, indels, NGA50
  • Inspector: Long-read based assembly evaluation detecting structural and base errors
  • Read mapping rate: Fraction of reads that align back to the assembly

Metagenome Assembly

Metagenome assembly reconstructs genomes from mixed microbial communities, introducing unique challenges:

  • Uneven coverage: Dominant species at 1000x, rare species at 0.1x
  • Inter-species repeats: Conserved genes create chimeric assemblies between related organisms
  • Strain variation: Closely related strains create tangles in the assembly graph

Approaches

  • Co-assembly: Pool samples for better coverage; risk of chimeras increases
  • Sample-specific assembly: Reduces chimeras but may fragment low-abundance genomes
  • Binning: Group contigs into metagenome-assembled genomes (MAGs) using composition (tetranucleotide frequency, GC content) and differential coverage across samples (MetaBAT2, MaxBin2, CONCOCT). DAS Tool integrates multiple binning results.

MAG Quality Standards (MIMAG)

  • High quality: >90% completeness, <5% contamination, presence of 23S/16S/5S rRNA and >18 tRNAs
  • Medium quality: >50% completeness, <10% contamination
  • Assessed by CheckM/CheckM2 using lineage-specific marker gene sets

Current State and Challenges

The T2T (Telomere-to-Telomere) Consortium completed the first gap-free human genome in 2022, resolving centromeres, segmental duplications, and rDNA arrays. Key remaining challenges include polyploid genome assembly (e.g., hexaploid wheat), highly repetitive genomes, population-scale pangenome construction (via minigraph-cactus), and real-time assembly for clinical and field applications.