Protein Structure Prediction and Analysis
Structural Hierarchy
Primary Structure
The linear amino acid sequence. Determined by gene sequence; the starting point for all structural prediction. Twenty standard amino acids with diverse physicochemical properties: hydrophobic (A, V, L, I, M, F, W, P), polar (S, T, N, Q, Y, C), positively charged (K, R, H), negatively charged (D, E), and glycine (conformationally flexible, no side chain).
Secondary Structure
Local backbone conformations stabilized by hydrogen bonds between backbone amide (N-H) and carbonyl (C=O) groups:
- Alpha-helix: 3.6 residues per turn, hydrogen bond between residue i and i+4. Right-handed. Common in membrane-spanning regions.
- Beta-sheet: Extended strands connected by hydrogen bonds. Parallel or antiparallel orientation. Forms the core of many protein domains.
- Turns and loops: Connect regular secondary structures. Beta-turns involve 4 residues. Loops are variable in conformation and often functionally important (binding sites, active sites).
- Coil: Irregular structure without repeating pattern
Ramachandran plot: Maps backbone dihedral angles (phi, psi) for each residue. Most residues cluster in allowed regions corresponding to alpha-helix, beta-sheet, and left-handed helix conformations. Glycine has expanded allowed regions; proline is restricted.
Tertiary Structure
The three-dimensional fold of a single polypeptide chain, determined by:
- Hydrophobic effect (dominant driving force for folding)
- Hydrogen bonds, salt bridges, van der Waals interactions
- Disulfide bonds (covalent, stabilizing in oxidizing environments)
Levinthal's paradox: A protein of 100 residues has ~10^47 possible conformations, yet folds in milliseconds to seconds. Folding follows a funnel-shaped energy landscape, not exhaustive search.
Quaternary Structure
Assembly of multiple polypeptide chains (subunits). Homo-oligomers (identical subunits) and hetero-oligomers. Governed by the same forces as tertiary structure plus specific interface interactions.
Experimental Structure Determination
- X-ray crystallography: Grow protein crystal, diffract X-rays, solve phase problem. Resolution typically 1.5-3.0 Angstroms. Represents a static, crystal-packing-influenced structure.
- Cryo-electron microscopy (cryo-EM): Flash-freeze sample, image with electron beam, single-particle reconstruction. Resolution revolution: now routinely <3 Angstroms for large complexes. No crystallization required.
- NMR spectroscopy: Solution-state structure via inter-atomic distance restraints (NOEs). Limited to small-medium proteins (<40 kDa typically). Provides dynamics information.
Secondary Structure Prediction
Three-state prediction (helix H, strand E, coil C). Modern methods achieve ~85% per-residue accuracy (Q3).
- PSIPRED: Two-stage neural network using PSI-BLAST PSSM features; robust and widely used
- DSSP: Assigns secondary structure to experimentally determined structures using hydrogen bond geometry (the ground truth for training predictors)
- NetSurfP-3.0: Protein language model-based prediction of secondary structure, solvent accessibility, and disorder
- JPred4: Consensus prediction integrating multiple methods
Homology Modeling (Comparative Modeling)
When a homolog with known structure exists (template), build a model by:
- Template identification: BLAST, HHpred (profile-profile), HMMER against PDB
- Target-template alignment: Critical step; errors propagate to the model. Profile-profile (HHalign) or structure-based (if multiple templates).
- Model building: Copy template backbone; rebuild loops and side chains
- Refinement: Energy minimization, molecular dynamics relaxation
- Evaluation: DOPE score, Ramachandran analysis, ProSA, MolProbity
MODELLER
Standard tool for homology modeling. Uses spatial restraints derived from the alignment and template structure:
- Distance restraints from aligned residue pairs
- Dihedral angle restraints from template backbone
- Statistical potential restraints from known structures
- Restraints satisfied by conjugate gradient optimization followed by molecular dynamics with simulated annealing
Accuracy: Reliable above ~30% sequence identity (safe homology zone). In the 20-30% twilight zone, alignment quality is the primary bottleneck. Below 20%, template detection becomes unreliable.
Rosetta
A comprehensive macromolecular modeling suite using Monte Carlo sampling with an energy function combining physics-based and knowledge-based terms.
Energy Function (REF2015)
Components: van der Waals (Lennard-Jones), hydrogen bonding (orientation-dependent), electrostatics (Coulomb with distance-dependent dielectric), solvation (implicit solvent, Lazaridis-Karplus), backbone torsion preferences (Ramachandran), rotamer probabilities (Dunbrack library), reference energies per amino acid type.
Ab Initio Structure Prediction (pre-AlphaFold)
Fragment assembly approach:
- Generate 3-mer and 9-mer fragment libraries from known structures matching sequence profile
- Monte Carlo assembly: iteratively replace backbone fragments, evaluate energy
- Low-resolution (centroid) stage for global fold search, then full-atom refinement
- Cluster decoys; select cluster centers as predictions
Effective for small proteins (<150 residues) with simple topologies. Limited by sampling for larger proteins.
Other Rosetta Applications
- RosettaDock: Protein-protein docking with local and global search
- RosettaDesign: Fixed-backbone and flexible-backbone protein design
- RoseTTAFold: Deep learning structure prediction (three-track architecture)
- Rosetta loop modeling: KIC (kinematic closure) for loop conformation sampling
AlphaFold
AlphaFold2 Architecture
DeepMind's AlphaFold2 (2020) achieved near-experimental accuracy in CASP14, fundamentally transforming structural biology.
Input Processing:
- MSA construction: JackHMMER against UniRef90/BFD/MGnify databases; iterative search to build deep alignments
- Template search: HHsearch against PDB70
- MSA representation: N_seq x N_res x c_m tensor (pair-weighted)
- Pair representation: N_res x N_res x c_z tensor (initialized from relative positional encoding and template features)
Evoformer Module (48 blocks):
- MSA row-wise attention: Each row attends to other positions, informed by pair representation (biased attention)
- MSA column-wise attention: Each column attends to other sequences (captures coevolutionary patterns)
- Pair representation update: Outer product mean from MSA representation; triangular multiplicative updates (modeling that if residues i-j and j-k are close, i-k likely is too); triangular self-attention
- Transition blocks: Feed-forward layers between attention operations
Structure Module:
- Operates on single representation + pair representation
- Invariant Point Attention (IPA): Attention mechanism operating in 3D coordinate frames attached to each residue; equivariant to global rotations and translations
- Iteratively updates backbone frames (rotation + translation per residue) and side chain torsion angles
- 8 recycling iterations refine predictions
Loss Function:
- FLDP (Frame Aligned Point Error): Primary structural loss comparing predicted and true atom positions after alignment per residue frame
- Auxiliary losses: distogram prediction, masked MSA prediction, pTM (predicted TM-score) head, experimentally resolved head
- Confidence: pLDDT (predicted Local Distance Difference Test) per residue; PAE (Predicted Aligned Error) per residue pair
pLDDT Interpretation
| pLDDT Range | Interpretation | |---|---| | >90 | Very high confidence; backbone and side chains reliable | | 70-90 | Confident; backbone reliable, some side chain uncertainty | | 50-70 | Low confidence; may indicate genuine disorder or modeling difficulty | | <50 | Very low; likely disordered or poorly predicted |
AlphaFold3
Released 2024. Key advances:
- Unified architecture for proteins, nucleic acids, small molecules, ions, post-translational modifications
- Diffusion-based structure generation (replaces explicit frame updates)
- Pairformer replaces Evoformer (operates only on pair representation)
- Cross-attention for ligand and nucleic acid features
- Improved multimer prediction
AlphaFold Database
200 million predicted structures covering UniProt. Accessible via UniProt integration and programmatic API. Confidence scores guide usability assessment.
Protein-Protein Docking
Predicts the structure of protein complexes from individual component structures.
Computational Approaches
- Rigid-body docking: FFT-based shape complementarity search (ZDOCK, ClusPro). Generates thousands of poses; scoring functions rank by shape, electrostatics, desolvation.
- Flexible docking: HADDOCK (data-driven, uses experimental restraints from NMR, cross-linking, mutagenesis); RosettaDock (Monte Carlo with backbone/side chain flexibility)
- AlphaFold-Multimer: Co-folding of protein complexes using paired MSAs; state-of-the-art for many interaction types
Scoring and Evaluation
- CAPRI criteria: Acceptable/medium/high quality based on interface RMSD (i-RMSD), ligand RMSD (L-RMSD), and fraction of native contacts (fnat)
- DockQ: Single quality score combining i-RMSD, L-RMSD, and fnat
Molecular Dynamics (MD) Simulations
Simulates the time evolution of a molecular system by numerically integrating Newton's equations of motion.
Force Fields
Parameterized energy functions for bonded (bonds, angles, dihedrals) and non-bonded (van der Waals, electrostatic) interactions:
- AMBER (ff19SB): Widely used for proteins; OPC water model
- CHARMM (C36m): Comprehensive parameter sets for proteins, lipids, nucleic acids, small molecules
- OPLS-AA: Optimized for liquid-state properties
Simulation Protocol
- System setup: Solvate in explicit water, add ions for charge neutrality
- Energy minimization: Remove steric clashes
- Equilibration: NVT (constant temperature via thermostat: Langevin, Nose-Hoover) then NPT (constant pressure via barostat: Berendsen, Parrinello-Rahman)
- Production: Typical timestep 2 fs (with SHAKE/LINCS constraints on hydrogen bonds); 100 ns to microsecond timescales routine
- Analysis: RMSD, RMSF, radius of gyration, hydrogen bond analysis, PCA, free energy calculations
Software
- GROMACS: High performance, free, extensive GPU acceleration
- AMBER: Commercial/academic; strong force field development tradition
- OpenMM: Python-scriptable, GPU-accelerated, flexible
- NAMD: Scalable to very large systems on supercomputers
Enhanced Sampling
Standard MD often traps in local energy minima. Enhanced methods:
- Replica exchange (REMD): Multiple replicas at different temperatures exchange configurations
- Metadynamics: Adds history-dependent bias potential along collective variables to escape barriers
- Accelerated MD/GaMD: Boost potential reduces energy barriers
- Steered MD/umbrella sampling: Apply external forces; compute PMF via WHAM
Applications
Protein folding mechanisms, conformational changes, ligand binding/unbinding kinetics, membrane protein dynamics, allosteric mechanisms, drug residence time estimation, free energy perturbation for relative binding affinities.