Computational Drug Discovery

Drug-Target Interaction

Drug discovery begins with identifying molecular targets (proteins, nucleic acids) causally linked to disease, then finding compounds that modulate their activity.

Target Identification and Validation

Genetic evidence: GWAS hits, Mendelian disease genes, loss-of-function intolerance (pLI scores), CRISPR screens
Expression data: Differential expression in disease vs. normal tissue; single-cell resolution reveals cell-type specificity
Druggability assessment: Does the target have a binding pocket amenable to small molecule binding? Druggability prediction via DoGSiteScorer, FPocket, SiteMap

Binding Site Analysis

Binding pockets are characterized by:

Volume and shape: Solvent-accessible surface, pocket depth, enclosure
Physicochemical properties: Hydrophobicity, hydrogen bond donors/acceptors, electrostatic potential
Druggability scores: Empirical scoring based on pocket properties (Lipinski-like criteria for pockets)
Cryptic sites: Allosteric pockets revealed only upon ligand binding or conformational change; detected by MD simulations or fragment screening

Molecular Docking

Predicts the binding pose and affinity of a small molecule (ligand) within a protein binding site.

Docking Algorithm Components

Search algorithm: Explores ligand conformational and positional space within the binding site.

Systematic search: Incremental construction (DOCK), exhaustive rotation/translation
Stochastic methods: Genetic algorithms (AutoDock, GOLD), Monte Carlo (ICM), simulated annealing
Fragment-based: Grow ligand incrementally within the site (FlexX)

Scoring function: Estimates binding affinity for each pose.

Force field-based: Sum of van der Waals, electrostatic, desolvation terms (AutoDock)
Empirical: Weighted sum of interaction terms (hydrogen bonds, hydrophobic contacts, rotatable bonds) calibrated against experimental binding data (ChemPLP in GOLD, GlideScore)
Knowledge-based: Statistical potentials derived from known protein-ligand complexes (PMF, DrugScore)
Machine learning: RF-Score, OnionNet-2; trained on PDBbind experimental binding affinities

AutoDock / AutoDock Vina

AutoDock4: Lamarckian genetic algorithm with grid-based energy evaluation. Precomputes affinity maps for each atom type on a 3D grid. Semi-empirical free energy scoring function.

AutoDock Vina: Faster successor using iterated local search with gradient-based optimization. Empirical scoring function with steric, hydrophobic, hydrogen bonding, and rotatable bond penalty terms. Multithreaded; widely used for medium-throughput docking.

AutoDock-GPU: GPU-accelerated AutoDock4 enabling large-scale virtual screening.

Docking Limitations

Scoring accuracy: Correlation between docking scores and experimental binding affinity is typically R ~ 0.4-0.6; scores are better for pose ranking than affinity prediction
Protein flexibility: Most docking treats the receptor as rigid or with limited flexibility (soft potentials, rotamer libraries, ensemble docking)
Water molecules: Bridging water molecules mediate many protein-ligand interactions; typically neglected
Entropy: Conformational entropy changes upon binding are poorly captured

Virtual Screening

Computational selection of candidate compounds from large libraries for experimental testing.

Structure-Based Virtual Screening (SBVS)

Prepare protein structure (add hydrogens, assign charges, define binding site)
Prepare compound library (enumerate tautomers, protonation states, 3D conformers)
Dock all compounds; rank by score
Post-filter: clustering, visual inspection, pharmacokinetic filters
Select top compounds (typically 100-1000 from millions) for experimental assay

Ultra-large library docking: Libraries >1 billion molecules (Enamine REAL, virtual combinatorial libraries). V-FLOW, VirtualFlow, and ZINC-22/Docking@Home enable massively parallel docking on cloud/HPC.

Ligand-Based Virtual Screening

When no target structure is available, use known active compounds as reference:

Pharmacophore modeling: Abstract 3D arrangement of chemical features (H-bond donors/acceptors, hydrophobic centers, aromatic rings, charge). Screen libraries for pharmacophore matches (Phase, Catalyst)
Shape similarity: Overlay 3D molecular shapes (ROCS, USR)
2D similarity: Tanimoto coefficient on molecular fingerprints

Quantitative Structure-Activity Relationships (QSAR)

QSAR models predict biological activity from molecular descriptors.

Descriptors

Physicochemical: MW, logP, TPSA, HBD/HBA count, rotatable bonds
Topological: Connectivity indices, Wiener index, Balaban J index
Electronic: Partial charges, HOMO/LUMO energies, polarizability
3D: Shape descriptors, WHIM, GETAWAY, pharmacophore fingerprints

Modeling Methods

Linear: Multiple linear regression, partial least squares (PLS)
Non-linear: Random forest, gradient boosting (XGBoost), support vector machines (SVM)
Deep learning: Graph neural networks (GNNs) on molecular graphs; message passing neural networks (MPNN, SchNet, DimeNet); transformer-based (MolBERT, ChemBERTa)

Validation

Internal: k-fold cross-validation, leave-one-out
External: Held-out test set with temporal or scaffold splits (not random, to avoid data leakage from analog series)
Applicability domain: Ensure predictions are made within the chemical space of training data (leverage, similarity thresholds)
OECD principles: Defined endpoint, unambiguous algorithm, applicability domain, appropriate measures of fit and predictivity, mechanistic interpretation where possible

ADMET Prediction

Predict absorption, distribution, metabolism, excretion, and toxicity properties computationally to prioritize drug candidates.

Key Properties

Property	Significance	Computational Approach
Solubility	Oral bioavailability	ESOL model, ML on logS
Permeability	Membrane crossing (Caco-2)	PAMPA prediction, PSA-based rules
CYP inhibition	Drug-drug interactions	Classification models (CYP 1A2, 2C9, 2C19, 2D6, 3A4)
hERG inhibition	Cardiac toxicity risk	QSAR models, docking to hERG channel
Metabolic stability	Half-life, dosing frequency	Site-of-metabolism prediction (SMARTCyp, XenoSite)
BBB permeation	CNS drug access	Multi-task neural networks
Mutagenicity	Ames test prediction	Derek Nexus, structural alerts

Lipinski's Rule of Five

Empirical guidelines for oral bioavailability: MW <= 500, logP <= 5, HBD <= 5, HBA <= 10. Violations reduce likelihood of oral absorption. Extensions: Veber rules (rotatable bonds <= 10, TPSA <= 140), beyond Rule of 5 (bRo5) for larger molecules.

Integrated Platforms

SwissADME: Web-based physicochemical and pharmacokinetic property prediction
pkCSM: ML-based ADMET prediction using graph-based signatures
ADMETlab 2.0: Comprehensive ADMET property prediction

De Novo Drug Design

Generate novel molecules with desired properties from scratch.

Generative Models

Variational Autoencoders (VAEs): Encode molecules (SMILES strings or graphs) into a continuous latent space; decode latent vectors into molecules. Latent space enables interpolation and optimization. SMILES-based (Gomez-Bombarelli, 2018) or graph-based (JT-VAE: junction tree encoding ensures chemical validity).

Generative Adversarial Networks (GANs): Generator creates molecules; discriminator distinguishes real from generated. ORGAN adds a reinforcement learning objective for property optimization. Mode collapse is a practical challenge.

Autoregressive Models: Generate molecules token-by-token (SMILES) or atom-by-atom (graph). Language model architectures (RNN, Transformer) naturally handle sequential generation. REINVENT uses transfer learning from a pre-trained generative model, fine-tuned via RL for desired properties.

Diffusion Models: Generate 3D molecular structures through iterative denoising. EDM (Equivariant Diffusion Models), DiffSBDD (structure-based drug design conditioned on binding pocket geometry). Produce 3D coordinates directly.

Reinforcement Learning for Drug Design

Frame molecular generation as a sequential decision process:

State: Partially constructed molecule
Action: Add atom/fragment, form bond
Reward: Multi-objective function combining predicted activity, synthetic accessibility (SA score), drug-likeness (QED), novelty
Policy: Neural network (RNN or GNN) trained via REINFORCE or PPO

Synthetic Accessibility

Generated molecules must be synthetically feasible. SA scores (Ertl, 2009) estimate synthetic difficulty from fragment contributions. Retrosynthetic analysis tools (ASKCOS, IBM RXN, Syntheseus) plan synthesis routes using template-based or template-free neural models.

Molecular Fingerprints

Binary or count vectors encoding molecular structure for similarity searching and ML.

Types

ECFP (Extended Connectivity Fingerprints): Circular fingerprints generated by Morgan algorithm. Each atom initializes with an identifier; iteratively updated by incorporating neighbor information up to a specified radius (ECFP4 = radius 2). Hashed to fixed-length bit vectors (1024 or 2048 bits). The most widely used fingerprint in cheminformatics.
MACCS keys: 166 predefined structural keys (presence/absence of specific substructures)
Topological (Daylight): Path-based fingerprints encoding all linear paths up to a given length
RDKit fingerprint: Topological fingerprint implementation in RDKit
Atom-pair and topological torsion: Encode pairs/triplets of atoms with topological distance

Similarity Metrics

Tanimoto coefficient: T(A,B) = |A intersection B| / |A union B|. Standard for binary fingerprints. T > 0.85 on ECFP4 generally indicates structural analogs with similar activity (the "similarity principle").

RDKit

Open-source cheminformatics toolkit (Python/C++). Core capabilities:

Molecular I/O: Parse SMILES, SDF, MOL2; generate 2D/3D coordinates (ETKDG algorithm)
Descriptor calculation: >200 descriptors; Morgan fingerprints; pharmacophore fingerprints
Substructure search: SMARTS pattern matching
Chemical transformations: Reaction SMARTS for virtual reactions
Conformer generation: Distance geometry with force field optimization (MMFF94, UFF)
Visualization: 2D depiction, highlighting, alignment

Drug Repurposing

Identify new therapeutic indications for existing drugs, reducing development time and cost.

Computational Approaches

Network-based: Drug-target-disease networks; proximity of drug targets to disease genes in protein-protein interaction networks (Guney et al.)
Signature matching: Compare drug-induced gene expression changes (L1000/CMap) with disease expression signatures; anticorrelated signatures suggest therapeutic potential
Knowledge graph embedding: Embed drugs, targets, diseases in a shared vector space (TransE, RotatE, ComplEx); predict missing links. DRKG, Hetionet.
Clinical data mining: Electronic health records, adverse event databases (FAERS); identify unexpected beneficial effects of prescribed drugs
Molecular similarity: Screen approved drug libraries against new targets using docking or fingerprint similarity

Success Stories

Thalidomide (repurposed for multiple myeloma), sildenafil (hypertension to erectile dysfunction), remdesivir (Ebola to COVID-19). Computational repurposing identified baricitinib for COVID-19 using knowledge graph approaches.