7 min read
On this page

Computational Drug Discovery

Drug-Target Interaction

Drug discovery begins with identifying molecular targets (proteins, nucleic acids) causally linked to disease, then finding compounds that modulate their activity.

Target Identification and Validation

  • Genetic evidence: GWAS hits, Mendelian disease genes, loss-of-function intolerance (pLI scores), CRISPR screens
  • Expression data: Differential expression in disease vs. normal tissue; single-cell resolution reveals cell-type specificity
  • Druggability assessment: Does the target have a binding pocket amenable to small molecule binding? Druggability prediction via DoGSiteScorer, FPocket, SiteMap

Binding Site Analysis

Binding pockets are characterized by:

  • Volume and shape: Solvent-accessible surface, pocket depth, enclosure
  • Physicochemical properties: Hydrophobicity, hydrogen bond donors/acceptors, electrostatic potential
  • Druggability scores: Empirical scoring based on pocket properties (Lipinski-like criteria for pockets)
  • Cryptic sites: Allosteric pockets revealed only upon ligand binding or conformational change; detected by MD simulations or fragment screening

Molecular Docking

Predicts the binding pose and affinity of a small molecule (ligand) within a protein binding site.

Docking Algorithm Components

Search algorithm: Explores ligand conformational and positional space within the binding site.

  • Systematic search: Incremental construction (DOCK), exhaustive rotation/translation
  • Stochastic methods: Genetic algorithms (AutoDock, GOLD), Monte Carlo (ICM), simulated annealing
  • Fragment-based: Grow ligand incrementally within the site (FlexX)

Scoring function: Estimates binding affinity for each pose.

  • Force field-based: Sum of van der Waals, electrostatic, desolvation terms (AutoDock)
  • Empirical: Weighted sum of interaction terms (hydrogen bonds, hydrophobic contacts, rotatable bonds) calibrated against experimental binding data (ChemPLP in GOLD, GlideScore)
  • Knowledge-based: Statistical potentials derived from known protein-ligand complexes (PMF, DrugScore)
  • Machine learning: RF-Score, OnionNet-2; trained on PDBbind experimental binding affinities

AutoDock / AutoDock Vina

AutoDock4: Lamarckian genetic algorithm with grid-based energy evaluation. Precomputes affinity maps for each atom type on a 3D grid. Semi-empirical free energy scoring function.

AutoDock Vina: Faster successor using iterated local search with gradient-based optimization. Empirical scoring function with steric, hydrophobic, hydrogen bonding, and rotatable bond penalty terms. Multithreaded; widely used for medium-throughput docking.

AutoDock-GPU: GPU-accelerated AutoDock4 enabling large-scale virtual screening.

Docking Limitations

  • Scoring accuracy: Correlation between docking scores and experimental binding affinity is typically R ~ 0.4-0.6; scores are better for pose ranking than affinity prediction
  • Protein flexibility: Most docking treats the receptor as rigid or with limited flexibility (soft potentials, rotamer libraries, ensemble docking)
  • Water molecules: Bridging water molecules mediate many protein-ligand interactions; typically neglected
  • Entropy: Conformational entropy changes upon binding are poorly captured

Virtual Screening

Computational selection of candidate compounds from large libraries for experimental testing.

Structure-Based Virtual Screening (SBVS)

  1. Prepare protein structure (add hydrogens, assign charges, define binding site)
  2. Prepare compound library (enumerate tautomers, protonation states, 3D conformers)
  3. Dock all compounds; rank by score
  4. Post-filter: clustering, visual inspection, pharmacokinetic filters
  5. Select top compounds (typically 100-1000 from millions) for experimental assay

Ultra-large library docking: Libraries >1 billion molecules (Enamine REAL, virtual combinatorial libraries). V-FLOW, VirtualFlow, and ZINC-22/Docking@Home enable massively parallel docking on cloud/HPC.

Ligand-Based Virtual Screening

When no target structure is available, use known active compounds as reference:

  • Pharmacophore modeling: Abstract 3D arrangement of chemical features (H-bond donors/acceptors, hydrophobic centers, aromatic rings, charge). Screen libraries for pharmacophore matches (Phase, Catalyst)
  • Shape similarity: Overlay 3D molecular shapes (ROCS, USR)
  • 2D similarity: Tanimoto coefficient on molecular fingerprints

Quantitative Structure-Activity Relationships (QSAR)

QSAR models predict biological activity from molecular descriptors.

Descriptors

  • Physicochemical: MW, logP, TPSA, HBD/HBA count, rotatable bonds
  • Topological: Connectivity indices, Wiener index, Balaban J index
  • Electronic: Partial charges, HOMO/LUMO energies, polarizability
  • 3D: Shape descriptors, WHIM, GETAWAY, pharmacophore fingerprints

Modeling Methods

  • Linear: Multiple linear regression, partial least squares (PLS)
  • Non-linear: Random forest, gradient boosting (XGBoost), support vector machines (SVM)
  • Deep learning: Graph neural networks (GNNs) on molecular graphs; message passing neural networks (MPNN, SchNet, DimeNet); transformer-based (MolBERT, ChemBERTa)

Validation

  • Internal: k-fold cross-validation, leave-one-out
  • External: Held-out test set with temporal or scaffold splits (not random, to avoid data leakage from analog series)
  • Applicability domain: Ensure predictions are made within the chemical space of training data (leverage, similarity thresholds)
  • OECD principles: Defined endpoint, unambiguous algorithm, applicability domain, appropriate measures of fit and predictivity, mechanistic interpretation where possible

ADMET Prediction

Predict absorption, distribution, metabolism, excretion, and toxicity properties computationally to prioritize drug candidates.

Key Properties

| Property | Significance | Computational Approach | |---|---|---| | Solubility | Oral bioavailability | ESOL model, ML on logS | | Permeability | Membrane crossing (Caco-2) | PAMPA prediction, PSA-based rules | | CYP inhibition | Drug-drug interactions | Classification models (CYP 1A2, 2C9, 2C19, 2D6, 3A4) | | hERG inhibition | Cardiac toxicity risk | QSAR models, docking to hERG channel | | Metabolic stability | Half-life, dosing frequency | Site-of-metabolism prediction (SMARTCyp, XenoSite) | | BBB permeation | CNS drug access | Multi-task neural networks | | Mutagenicity | Ames test prediction | Derek Nexus, structural alerts |

Lipinski's Rule of Five

Empirical guidelines for oral bioavailability: MW <= 500, logP <= 5, HBD <= 5, HBA <= 10. Violations reduce likelihood of oral absorption. Extensions: Veber rules (rotatable bonds <= 10, TPSA <= 140), beyond Rule of 5 (bRo5) for larger molecules.

Integrated Platforms

  • SwissADME: Web-based physicochemical and pharmacokinetic property prediction
  • pkCSM: ML-based ADMET prediction using graph-based signatures
  • ADMETlab 2.0: Comprehensive ADMET property prediction

De Novo Drug Design

Generate novel molecules with desired properties from scratch.

Generative Models

Variational Autoencoders (VAEs): Encode molecules (SMILES strings or graphs) into a continuous latent space; decode latent vectors into molecules. Latent space enables interpolation and optimization. SMILES-based (Gomez-Bombarelli, 2018) or graph-based (JT-VAE: junction tree encoding ensures chemical validity).

Generative Adversarial Networks (GANs): Generator creates molecules; discriminator distinguishes real from generated. ORGAN adds a reinforcement learning objective for property optimization. Mode collapse is a practical challenge.

Autoregressive Models: Generate molecules token-by-token (SMILES) or atom-by-atom (graph). Language model architectures (RNN, Transformer) naturally handle sequential generation. REINVENT uses transfer learning from a pre-trained generative model, fine-tuned via RL for desired properties.

Diffusion Models: Generate 3D molecular structures through iterative denoising. EDM (Equivariant Diffusion Models), DiffSBDD (structure-based drug design conditioned on binding pocket geometry). Produce 3D coordinates directly.

Reinforcement Learning for Drug Design

Frame molecular generation as a sequential decision process:

  • State: Partially constructed molecule
  • Action: Add atom/fragment, form bond
  • Reward: Multi-objective function combining predicted activity, synthetic accessibility (SA score), drug-likeness (QED), novelty
  • Policy: Neural network (RNN or GNN) trained via REINFORCE or PPO

Synthetic Accessibility

Generated molecules must be synthetically feasible. SA scores (Ertl, 2009) estimate synthetic difficulty from fragment contributions. Retrosynthetic analysis tools (ASKCOS, IBM RXN, Syntheseus) plan synthesis routes using template-based or template-free neural models.

Molecular Fingerprints

Binary or count vectors encoding molecular structure for similarity searching and ML.

Types

  • ECFP (Extended Connectivity Fingerprints): Circular fingerprints generated by Morgan algorithm. Each atom initializes with an identifier; iteratively updated by incorporating neighbor information up to a specified radius (ECFP4 = radius 2). Hashed to fixed-length bit vectors (1024 or 2048 bits). The most widely used fingerprint in cheminformatics.
  • MACCS keys: 166 predefined structural keys (presence/absence of specific substructures)
  • Topological (Daylight): Path-based fingerprints encoding all linear paths up to a given length
  • RDKit fingerprint: Topological fingerprint implementation in RDKit
  • Atom-pair and topological torsion: Encode pairs/triplets of atoms with topological distance

Similarity Metrics

Tanimoto coefficient: T(A,B) = |A intersection B| / |A union B|. Standard for binary fingerprints. T > 0.85 on ECFP4 generally indicates structural analogs with similar activity (the "similarity principle").

RDKit

Open-source cheminformatics toolkit (Python/C++). Core capabilities:

  • Molecular I/O: Parse SMILES, SDF, MOL2; generate 2D/3D coordinates (ETKDG algorithm)
  • Descriptor calculation: >200 descriptors; Morgan fingerprints; pharmacophore fingerprints
  • Substructure search: SMARTS pattern matching
  • Chemical transformations: Reaction SMARTS for virtual reactions
  • Conformer generation: Distance geometry with force field optimization (MMFF94, UFF)
  • Visualization: 2D depiction, highlighting, alignment

Drug Repurposing

Identify new therapeutic indications for existing drugs, reducing development time and cost.

Computational Approaches

  • Network-based: Drug-target-disease networks; proximity of drug targets to disease genes in protein-protein interaction networks (Guney et al.)
  • Signature matching: Compare drug-induced gene expression changes (L1000/CMap) with disease expression signatures; anticorrelated signatures suggest therapeutic potential
  • Knowledge graph embedding: Embed drugs, targets, diseases in a shared vector space (TransE, RotatE, ComplEx); predict missing links. DRKG, Hetionet.
  • Clinical data mining: Electronic health records, adverse event databases (FAERS); identify unexpected beneficial effects of prescribed drugs
  • Molecular similarity: Screen approved drug libraries against new targets using docking or fingerprint similarity

Success Stories

Thalidomide (repurposed for multiple myeloma), sildenafil (hypertension to erectile dysfunction), remdesivir (Ebola to COVID-19). Computational repurposing identified baricitinib for COVID-19 using knowledge graph approaches.