Protein Structure Prediction and Analysis

Structural Hierarchy

Primary Structure

The linear amino acid sequence. Determined by gene sequence; the starting point for all structural prediction. Twenty standard amino acids with diverse physicochemical properties: hydrophobic (A, V, L, I, M, F, W, P), polar (S, T, N, Q, Y, C), positively charged (K, R, H), negatively charged (D, E), and glycine (conformationally flexible, no side chain).

Secondary Structure

Local backbone conformations stabilized by hydrogen bonds between backbone amide (N-H) and carbonyl (C=O) groups:

Alpha-helix: 3.6 residues per turn, hydrogen bond between residue i and i+4. Right-handed. Common in membrane-spanning regions.
Beta-sheet: Extended strands connected by hydrogen bonds. Parallel or antiparallel orientation. Forms the core of many protein domains.
Turns and loops: Connect regular secondary structures. Beta-turns involve 4 residues. Loops are variable in conformation and often functionally important (binding sites, active sites).
Coil: Irregular structure without repeating pattern

Ramachandran plot: Maps backbone dihedral angles (phi, psi) for each residue. Most residues cluster in allowed regions corresponding to alpha-helix, beta-sheet, and left-handed helix conformations. Glycine has expanded allowed regions; proline is restricted.

Tertiary Structure

The three-dimensional fold of a single polypeptide chain, determined by:

Hydrophobic effect (dominant driving force for folding)
Hydrogen bonds, salt bridges, van der Waals interactions
Disulfide bonds (covalent, stabilizing in oxidizing environments)

Levinthal's paradox: A protein of 100 residues has ~10^47 possible conformations, yet folds in milliseconds to seconds. Folding follows a funnel-shaped energy landscape, not exhaustive search.

Quaternary Structure

Assembly of multiple polypeptide chains (subunits). Homo-oligomers (identical subunits) and hetero-oligomers. Governed by the same forces as tertiary structure plus specific interface interactions.

Experimental Structure Determination

X-ray crystallography: Grow protein crystal, diffract X-rays, solve phase problem. Resolution typically 1.5-3.0 Angstroms. Represents a static, crystal-packing-influenced structure.
Cryo-electron microscopy (cryo-EM): Flash-freeze sample, image with electron beam, single-particle reconstruction. Resolution revolution: now routinely <3 Angstroms for large complexes. No crystallization required.
NMR spectroscopy: Solution-state structure via inter-atomic distance restraints (NOEs). Limited to small-medium proteins (<40 kDa typically). Provides dynamics information.

Secondary Structure Prediction

Three-state prediction (helix H, strand E, coil C). Modern methods achieve ~85% per-residue accuracy (Q3).

PSIPRED: Two-stage neural network using PSI-BLAST PSSM features; robust and widely used
DSSP: Assigns secondary structure to experimentally determined structures using hydrogen bond geometry (the ground truth for training predictors)
NetSurfP-3.0: Protein language model-based prediction of secondary structure, solvent accessibility, and disorder
JPred4: Consensus prediction integrating multiple methods

Homology Modeling (Comparative Modeling)

When a homolog with known structure exists (template), build a model by:

Template identification: BLAST, HHpred (profile-profile), HMMER against PDB
Target-template alignment: Critical step; errors propagate to the model. Profile-profile (HHalign) or structure-based (if multiple templates).
Model building: Copy template backbone; rebuild loops and side chains
Refinement: Energy minimization, molecular dynamics relaxation
Evaluation: DOPE score, Ramachandran analysis, ProSA, MolProbity

MODELLER

Standard tool for homology modeling. Uses spatial restraints derived from the alignment and template structure:

Distance restraints from aligned residue pairs
Dihedral angle restraints from template backbone
Statistical potential restraints from known structures
Restraints satisfied by conjugate gradient optimization followed by molecular dynamics with simulated annealing

Accuracy: Reliable above ~30% sequence identity (safe homology zone). In the 20-30% twilight zone, alignment quality is the primary bottleneck. Below 20%, template detection becomes unreliable.

Rosetta

A comprehensive macromolecular modeling suite using Monte Carlo sampling with an energy function combining physics-based and knowledge-based terms.

Energy Function (REF2015)

Components: van der Waals (Lennard-Jones), hydrogen bonding (orientation-dependent), electrostatics (Coulomb with distance-dependent dielectric), solvation (implicit solvent, Lazaridis-Karplus), backbone torsion preferences (Ramachandran), rotamer probabilities (Dunbrack library), reference energies per amino acid type.

Ab Initio Structure Prediction (pre-AlphaFold)

Fragment assembly approach:

Generate 3-mer and 9-mer fragment libraries from known structures matching sequence profile
Monte Carlo assembly: iteratively replace backbone fragments, evaluate energy
Low-resolution (centroid) stage for global fold search, then full-atom refinement
Cluster decoys; select cluster centers as predictions

Effective for small proteins (<150 residues) with simple topologies. Limited by sampling for larger proteins.

Other Rosetta Applications

RosettaDock: Protein-protein docking with local and global search
RosettaDesign: Fixed-backbone and flexible-backbone protein design
RoseTTAFold: Deep learning structure prediction (three-track architecture)
Rosetta loop modeling: KIC (kinematic closure) for loop conformation sampling

AlphaFold

AlphaFold2 Architecture

DeepMind's AlphaFold2 (2020) achieved near-experimental accuracy in CASP14, fundamentally transforming structural biology.

Input Processing:

MSA construction: JackHMMER against UniRef90/BFD/MGnify databases; iterative search to build deep alignments
Template search: HHsearch against PDB70
MSA representation: N_seq x N_res x c_m tensor (pair-weighted)
Pair representation: N_res x N_res x c_z tensor (initialized from relative positional encoding and template features)

Evoformer Module (48 blocks):

MSA row-wise attention: Each row attends to other positions, informed by pair representation (biased attention)
MSA column-wise attention: Each column attends to other sequences (captures coevolutionary patterns)
Pair representation update: Outer product mean from MSA representation; triangular multiplicative updates (modeling that if residues i-j and j-k are close, i-k likely is too); triangular self-attention
Transition blocks: Feed-forward layers between attention operations

Structure Module:

Operates on single representation + pair representation
Invariant Point Attention (IPA): Attention mechanism operating in 3D coordinate frames attached to each residue; equivariant to global rotations and translations
Iteratively updates backbone frames (rotation + translation per residue) and side chain torsion angles
8 recycling iterations refine predictions

Loss Function:

FLDP (Frame Aligned Point Error): Primary structural loss comparing predicted and true atom positions after alignment per residue frame
Auxiliary losses: distogram prediction, masked MSA prediction, pTM (predicted TM-score) head, experimentally resolved head
Confidence: pLDDT (predicted Local Distance Difference Test) per residue; PAE (Predicted Aligned Error) per residue pair

pLDDT Interpretation

pLDDT Range	Interpretation
>90	Very high confidence; backbone and side chains reliable
70-90	Confident; backbone reliable, some side chain uncertainty
50-70	Low confidence; may indicate genuine disorder or modeling difficulty
<50	Very low; likely disordered or poorly predicted

AlphaFold3

Released 2024. Key advances:

Unified architecture for proteins, nucleic acids, small molecules, ions, post-translational modifications
Diffusion-based structure generation (replaces explicit frame updates)
Pairformer replaces Evoformer (operates only on pair representation)
Cross-attention for ligand and nucleic acid features
Improved multimer prediction

AlphaFold Database

200 million predicted structures covering UniProt. Accessible via UniProt integration and programmatic API. Confidence scores guide usability assessment.

Protein-Protein Docking

Predicts the structure of protein complexes from individual component structures.

Computational Approaches

Rigid-body docking: FFT-based shape complementarity search (ZDOCK, ClusPro). Generates thousands of poses; scoring functions rank by shape, electrostatics, desolvation.
Flexible docking: HADDOCK (data-driven, uses experimental restraints from NMR, cross-linking, mutagenesis); RosettaDock (Monte Carlo with backbone/side chain flexibility)
AlphaFold-Multimer: Co-folding of protein complexes using paired MSAs; state-of-the-art for many interaction types

Scoring and Evaluation

CAPRI criteria: Acceptable/medium/high quality based on interface RMSD (i-RMSD), ligand RMSD (L-RMSD), and fraction of native contacts (fnat)
DockQ: Single quality score combining i-RMSD, L-RMSD, and fnat

Molecular Dynamics (MD) Simulations

Simulates the time evolution of a molecular system by numerically integrating Newton's equations of motion.

Force Fields

Parameterized energy functions for bonded (bonds, angles, dihedrals) and non-bonded (van der Waals, electrostatic) interactions:

AMBER (ff19SB): Widely used for proteins; OPC water model
CHARMM (C36m): Comprehensive parameter sets for proteins, lipids, nucleic acids, small molecules
OPLS-AA: Optimized for liquid-state properties

Simulation Protocol

System setup: Solvate in explicit water, add ions for charge neutrality
Energy minimization: Remove steric clashes
Equilibration: NVT (constant temperature via thermostat: Langevin, Nose-Hoover) then NPT (constant pressure via barostat: Berendsen, Parrinello-Rahman)
Production: Typical timestep 2 fs (with SHAKE/LINCS constraints on hydrogen bonds); 100 ns to microsecond timescales routine
Analysis: RMSD, RMSF, radius of gyration, hydrogen bond analysis, PCA, free energy calculations

Software

GROMACS: High performance, free, extensive GPU acceleration
AMBER: Commercial/academic; strong force field development tradition
OpenMM: Python-scriptable, GPU-accelerated, flexible
NAMD: Scalable to very large systems on supercomputers

Enhanced Sampling

Standard MD often traps in local energy minima. Enhanced methods:

Replica exchange (REMD): Multiple replicas at different temperatures exchange configurations
Metadynamics: Adds history-dependent bias potential along collective variables to escape barriers
Accelerated MD/GaMD: Boost potential reduces energy barriers
Steered MD/umbrella sampling: Apply external forces; compute PMF via WHAM

Applications

Protein folding mechanisms, conformational changes, ligand binding/unbinding kinetics, membrane protein dynamics, allosteric mechanisms, drug residence time estimation, free energy perturbation for relative binding affinities.