7 min read
On this page

Genomic Data Analysis

Variant Calling

Variant calling identifies differences between a sequenced genome and a reference genome, encompassing single nucleotide variants (SNVs), small insertions/deletions (indels), and structural variants (SVs).

GATK (Genome Analysis Toolkit) Best Practices

The standard pipeline for germline short variant discovery:

  1. Read preprocessing: Map reads with BWA-MEM2; sort and mark duplicates (Picard/sambamba). Duplicates arise from PCR amplification and must be flagged to avoid inflated confidence.
  2. Base Quality Score Recalibration (BQSR): Machine learning recalibration of base qualities using known variant sites (dbSNP) as truth. Models error as a function of read group, quality score, machine cycle, and dinucleotide context.
  3. HaplotypeCaller: Local de novo assembly of haplotypes in active regions. Constructs a de Bruijn graph from reads, extracts candidate haplotypes, realigns reads to haplotypes via pair-HMM, and genotypes using Bayesian model. Outputs per-sample GVCFs.
  4. Joint genotyping (GenomicsDBImport + GenotypeGVCFs): Combine GVCFs across samples for population-level calling. Joint calling leverages population-level information for rare variant detection.
  5. Variant Quality Score Recalibration (VQSR): Gaussian mixture model trained on known true variants (HapMap, 1000G) and false positives. Annotations used: QualByDepth, StrandOddsRatio, FisherStrand, MappingQuality, ReadPosRankSum. Assigns a VQSLOD score and sensitivity tranche.

DeepVariant

CNN-based variant caller from Google. Encodes read pileups as multi-channel images (read bases, quality scores, mapping quality, strand, etc.) and classifies each candidate site as homozygous reference, heterozygous, or homozygous variant. Achieves higher accuracy than GATK for Illumina and PacBio HiFi data. PEPPER-Margin-DeepVariant extends to ONT long reads.

Structural Variant Calling

SVs (>50 bp): deletions, duplications, inversions, translocations, insertions.

  • Short-read: Delly, Manta, LUMPY (use split reads, discordant pairs, read depth signals)
  • Long-read: Sniffles2, cuteSV, SVIM (use split alignments and within-read signatures)
  • Ensemble approaches: SURVIVOR merges calls from multiple callers; reduces false positives

Genome-Wide Association Studies (GWAS)

GWAS tests associations between genetic variants and phenotypic traits across large populations.

Statistical Framework

For each variant, test the null hypothesis of no association. Linear model for quantitative traits: y = Xbeta + Ggamma + epsilon, where y is the phenotype vector, X contains covariates (age, sex, principal components), G is the genotype vector (0/1/2 for additive model), and gamma is the effect size.

For binary traits (case/control): logistic regression. Test statistic: Wald test, score test, or likelihood ratio test. P-values must survive Bonferroni correction for ~1M independent tests (genome-wide significance threshold: p < 5 x 10^-8).

Key Considerations

  • Population stratification: Confounding by ancestry differences between cases and controls. Corrected using genomic principal components (top 10-20 PCs as covariates) or linear mixed models (BOLT-LMM, SAIGE)
  • Linkage disequilibrium (LD): Non-random association of alleles at nearby loci. Causal variant is often not the most significant SNP; fine-mapping (FINEMAP, SuSiE, CAVIAR) identifies credible causal sets
  • Imputation: Infer ungenotyped variants using LD patterns from reference panels (1000 Genomes, TOPMed). MINIMAC4, IMPUTE5. Dramatically increases variant density.
  • Manhattan plot: -log10(p-value) vs. genomic position; peaks indicate associated loci
  • QQ plot: Expected vs. observed p-value distribution; genomic inflation factor lambda should be ~1.0

Post-GWAS Analysis

  • LD score regression: Distinguishes polygenicity from population stratification; estimates heritability (h^2_SNP)
  • Mendelian randomization: Uses genetic variants as instrumental variables to infer causal relationships between exposures and outcomes
  • Colocalization (coloc, eCAVIAR): Tests whether GWAS and eQTL signals share a causal variant
  • Pathway/gene-set analysis: MAGMA, FUMA aggregate SNP-level signals to gene/pathway level

Population Genetics

Hardy-Weinberg Equilibrium (HWE)

In an idealized population (infinite size, random mating, no selection/mutation/migration), allele and genotype frequencies remain constant. For biallelic locus with allele frequencies p and q = 1-p:

Expected genotype frequencies: p^2 (AA), 2pq (Aa), q^2 (aa)

HWE testing (chi-squared or exact test) identifies genotyping errors, population stratification, or loci under selection. Variants grossly deviating from HWE in controls are typically filtered from GWAS.

Population Differentiation (FST)

Wright's fixation index measures allele frequency differences between populations:

FST = (H_T - H_S) / H_T

where H_T is expected heterozygosity in the total population and H_S is the average within subpopulations. FST = 0 indicates no differentiation; FST = 1 indicates complete fixation of different alleles. Weir-Cockerham estimator is the standard for multi-locus FST.

Principal Component Analysis (PCA)

PCA on the genotype matrix (individuals x variants) reveals population structure. Leading PCs typically correspond to major ancestry axes. Computed efficiently using randomized SVD (PLINK2, FlashPCA2) on LD-pruned variants. Projection of new samples onto pre-computed PCs enables ancestry assignment.

Admixture Analysis

Model-based clustering (ADMIXTURE, STRUCTURE): Assumes K ancestral populations; estimates per-individual ancestry proportions (Q matrix) and per-population allele frequencies (P matrix) via maximum likelihood (ADMIXTURE) or MCMC (STRUCTURE). Cross-validation error determines optimal K.

Selection Scans

  • iHS (integrated haplotype score): Detects incomplete selective sweeps from extended haplotype homozygosity
  • Tajima's D: Compares nucleotide diversity estimators; negative values suggest positive/purifying selection or population expansion
  • PBS (Population Branch Statistic): Identifies lineage-specific selection using three-population FST comparisons

RNA-seq Analysis

Quantification

  • Alignment-based: STAR/HISAT2 alignment + featureCounts/HTSeq for gene-level counts
  • Pseudo-alignment: Salmon, kallisto map reads to transcriptome without full alignment; EM algorithm for isoform quantification. Orders of magnitude faster.

Differential Expression

DESeq2:

  • Models raw counts with a negative binomial distribution: K_ij ~ NB(mu_ij, alpha_i)
  • Dispersion estimation: gene-wise estimates shrunk toward a fitted trend (empirical Bayes)
  • Size factors for normalization (median-of-ratios method)
  • Wald test or likelihood ratio test for differential expression
  • Independent hypothesis weighting and Benjamini-Hochberg FDR correction

edgeR:

  • Also uses negative binomial model
  • Trimmed mean of M-values (TMM) normalization
  • Empirical Bayes moderation of dispersions (tagwise toward common/trended)
  • Exact test (pairwise) or quasi-likelihood F-test (complex designs)
  • glmQLFit provides robust inference for small sample sizes

limma-voom: Transforms count data using the voom method (variance modeling at the observational level), then applies the limma linear modeling framework with empirical Bayes moderation. Particularly effective for complex experimental designs.

Downstream Analysis

  • Gene set enrichment: GSEA (ranked list), over-representation analysis (hypergeometric test), fgsea (fast GSEA)
  • Pathway analysis: clusterProfiler, ReactomePA, DAVID
  • Weighted gene co-expression network analysis (WGCNA): Identifies modules of co-expressed genes; relates modules to traits

Single-Cell RNA-seq (scRNA-seq)

Experimental Platforms

  • Droplet-based (10x Genomics Chromium): 3' or 5' capture; thousands to hundreds of thousands of cells; shallow sequencing per cell (~1000-5000 genes detected per cell)
  • Plate-based (Smart-seq2/3): Full-length transcripts; hundreds of cells; deeper per-cell coverage
  • Spatial transcriptomics: 10x Visium, MERFISH, Slide-seq; gene expression with spatial context

Seurat (R)

Standard R toolkit for scRNA-seq analysis:

  1. QC filtering: Remove low-quality cells (low gene count, high mitochondrial %) and doublets (DoubletFinder, Scrublet)
  2. Normalization: LogNormalize (library size scaling + log), SCTransform (regularized negative binomial regression; stabilizes variance)
  3. Feature selection: Highly variable genes (HVGs) using variance-stabilizing transformation
  4. Dimensionality reduction: PCA on HVGs; select significant PCs (elbow plot, JackStraw)
  5. Batch correction/integration: CCA + MNN anchors (Seurat v3+); Harmony; scVI
  6. Clustering: Shared nearest-neighbor (SNN) graph construction + Louvain/Leiden community detection
  7. Visualization: UMAP (preferred) or t-SNE for 2D embedding (non-linear dimensionality reduction; for visualization only, not for clustering)
  8. Marker gene identification: Wilcoxon rank-sum, MAST, or logistic regression per cluster
  9. Cell type annotation: Manual (canonical markers), automated (SingleR, scType, CellTypist)

Scanpy (Python)

Python equivalent to Seurat with analogous workflow. Built on AnnData objects (observations x variables with layers and annotations). Integrates with scvi-tools for probabilistic models.

Advanced scRNA-seq Analyses

  • Trajectory/pseudotime: Monocle3 (principal graph), RNA velocity (La Manno/scVelo: spliced/unspliced ratio indicates future state), CellRank (combines RNA velocity with transition probabilities)
  • Cell-cell communication: CellChat, CellPhoneDB (ligand-receptor interaction inference)
  • Multi-modal: CITE-seq (RNA + surface proteins), ATAC-seq + RNA (multiome), spatial + scRNA integration

Polygenic Risk Scores (PRS)

PRS aggregate the effects of many genetic variants into a single score predicting disease risk or trait value:

PRS_i = sum_j beta_j * G_ij

where beta_j is the effect size of variant j (from GWAS summary statistics) and G_ij is the genotype of individual i at variant j.

Construction Methods

  • Clumping + thresholding (C+T): LD-clump variants, select by p-value threshold; simple but suboptimal
  • LDpred2: Bayesian shrinkage of effect sizes using an LD reference panel; models polygenicity via spike-and-slab prior
  • PRS-CS: Continuous shrinkage prior with global-local Bayesian regression
  • SBayesR: Mixture of normal distributions prior; scalable

Evaluation and Challenges

  • Incremental R^2 or AUC: PRS performance measured by variance explained beyond covariates
  • Portability: PRS trained on European ancestry GWAS perform poorly in non-European populations due to LD pattern differences and allele frequency variation
  • Clinical utility: Must demonstrate meaningful risk stratification; ethical considerations around genetic determinism and health disparities