Molecular Dynamics and Binding Free Energy Protocol

Hit Validation by MD Simulation, MM-PBSA, and ΔG Calculation — Methodology, Parameters, and Scientific Basis

Overview

Molecular docking identifies candidate compounds based on shape complementarity and an approximated scoring function. It is fast and accurate enough to narrow a library of millions down to a shortlist of tens, but the scoring function is a deliberate simplification: it ignores explicit solvent, protein and ligand flexibility on timescales beyond a single binding event, and the entropic cost of binding. For compounds that survive docking, these omissions matter. A compound that scores well in a rigid-receptor calculation may prove unstable once the binding site breathes, or its apparent affinity may be dominated by a favorable pose geometry that the protein cannot actually maintain in solution.

Autochem's hit validation pipeline addresses this gap by simulating the top docking hits using all-atom molecular dynamics (MD) in explicit solvent, and then deriving a rigorous estimate of the binding free energy (ΔG) from those trajectories using Molecular Mechanics / Poisson-Boltzmann Surface Area (MM-PBSA) analysis. The result is a ranked list of compounds supported by physics-based free energy estimates and dynamic stability metrics — information that is directly interpretable by medicinal chemists and that correlates far better with experimental binding data than docking scores alone.


Scientific Background

Why Molecular Dynamics?

Proteins are not rigid. The binding site of a drug target undergoes conformational fluctuations on timescales from picoseconds (side-chain rotations) to microseconds and beyond (domain movements, loop openings). A compound that binds well to the crystallographic conformation may fail to maintain that contact pattern as the pocket breathes in solution, or it may bind more tightly than the static structure suggests because an induced-fit mechanism closes the pocket around it. Neither effect is visible in a single rigid-receptor docking calculation.

MD simulation propagates the protein–ligand complex forward in time by numerically integrating Newton's equations of motion under a classical force field. At each time step, forces on every atom are computed from bonded terms (bond stretching, angle bending, dihedral torsion) and non-bonded terms (van der Waals interactions and electrostatics). Because the simulation is conducted in explicit solvent with physiologically realistic salt concentrations, the protein, ligand, and water molecules all respond to each other, and the trajectory captures the ensemble of thermally accessible conformations under biologically relevant conditions.

MD-based hit validation has a long track record in drug discovery. Clinically approved compounds including the HIV protease inhibitors and kinase inhibitors from programs at Novartis, Merck, and Roche were characterized by MD as part of their lead optimization. Landmark academic studies established that MD-derived stability metrics — particularly the root-mean-square deviation (RMSD) of the ligand relative to its initial docked pose — are strong predictors of whether a compound will bind experimentally.[1][2] MD also exposes false positives from docking: compounds that adopt an improbable geometry in the rigid receptor will relax to an unfavorable pose or unbind entirely on nanosecond timescales, a failure mode that the subsequent MM-PBSA step will detect as a large, unfavorable ΔG.

Why MM-PBSA?

The gold standard for computing absolute binding free energies is alchemical free energy perturbation (FEP) or thermodynamic integration (TI). These methods are highly accurate, but they require weeks of compute time per compound and are practical only for late-stage candidate optimization of a small number of molecules.

MM-PBSA (and its related variant MM-GBSA) was introduced by Kollman and coworkers in 2000[3] as a practical compromise between docking scores and full FEP calculations. The method computes the binding free energy as the average over a set of snapshots extracted from the MD trajectory:

ΔG_bind = ⟨G_complex⟩ − ⟨G_protein⟩ − ⟨G_ligand⟩

where each free energy term is decomposed as:

G = E_MM + G_solv − T·S_conf

The molecular mechanics energy (EMM) comprises bonded terms (bonds, angles, dihedrals) and non-bonded terms (electrostatics and van der Waals interactions in vacuum). The solvation free energy (Gsolv) is split into a polar component, computed by solving the Poisson–Boltzmann (PB) equation with a finite-difference solver, and a nonpolar component, estimated from the solvent-accessible surface area. The conformational entropy (−TSconf) is estimated using the Interaction Entropy (IE) method, which derives the entropic cost of ligand binding from fluctuations of the protein–ligand interaction energy over the trajectory.[4] IE is computationally cheaper than quasiharmonic analysis and has been shown to produce results in better agreement with experimental binding affinities for drug-like compounds.

MM-PBSA has been validated across hundreds of protein–ligand systems and consistently yields binding free energies with errors of 1–3 kcal/mol relative to experiment when applied carefully — well within the range needed to distinguish potent from weak binders in a lead optimization campaign.[5][6] Crucially, MM-PBSA is fast enough to run on the same trajectory used for MD hit validation, requiring no additional simulation.


Pipeline Overview

The MD hit validation and ΔG calculation workflow consists of six sequential stages, each implemented as an independent distributed pipeline:

  1. System Preparation — force field assignment, ligand parameterization, complex assembly
  2. Simulation Box Setup — solvation and ionization
  3. Energy Minimization — geometric relaxation
  4. Equilibration — NVT and NPT phases to reach thermodynamic equilibrium
  5. Production MD — the main simulation from which trajectory data are collected
  6. Analysis — trajectory processing, RMSD/RMSF metrics, MM-PBSA, ΔG estimation, and per-residue energy decomposition

Each stage is implemented as an Apache Beam pipeline running on Google Cloud Dataflow. Multiple protein–ligand complexes (by default, the top 10 compounds by docking score) are processed in parallel, with each complex progressing through the pipeline independently.


Stage 1: System Preparation

1a. Protein Force Field Assignment with AMBER99SB-ILDN

The prepared receptor structure — already fixed, protonated at pH 7.4, and energy-minimized in the docking pipeline — is processed by gmx pdb2gmx to assign force field parameters and generate GROMACS topology files. The default protein force field is AMBER99SB-ILDN, an improved variant of the widely used AMBER99SB force field. The ILDN suffix denotes optimized side-chain torsion potentials for isoleucine (I), leucine (L), aspartate (D), and asparagine (N), reparameterized against NMR scalar coupling data to correct systematic errors in their rotamer distributions.[7] These corrections are important for binding site residues, where side-chain orientation directly governs ligand recognition.

The -ignh flag instructs pdb2gmx to discard and replace the hydrogen atoms added during receptor preparation in the docking pipeline, and to re-add them according to AMBER99SB-ILDN's internal hydrogen-placement rules, ensuring consistency with the new force field's charge model.

1b. Water Model: TIP3P

The TIP3P (Transferable Intermolecular Potential 3-Point) water model is used throughout the simulation.[8] TIP3P was designed for condensed-phase simulations and is the water model on which the AMBER force field family was validated; using it with AMBER99SB-ILDN ensures that protein–water interactions are treated on a consistent electrostatic basis. TIP3P reproduces the bulk density and diffusion coefficient of liquid water well at 300–320 K, which spans the physiological temperature range used in the simulations.

1c. Ligand Parameterization with GAFF2 and AM1-BCC Charges

Ligands require their own force field, because AMBER99SB-ILDN covers only the 20 standard amino acids. Each ligand is parameterized using the General AMBER Force Field 2 (GAFF2),[9] a general-purpose organic small molecule force field designed to be compatible with AMBER protein force fields. GAFF2 covers a broad range of drug-relevant chemical space, including aromatic rings, halogens, heterocycles, and common functional groups.

Partial atomic charges are assigned using the AM1-BCC (Austin Model 1 with Bond Charge Corrections) method.[10] AM1-BCC uses semi-empirical AM1 quantum chemistry to calculate the electronic wavefunction, then applies bond charge corrections derived from fits to high-level quantum mechanical data. The resulting charge set produces electrostatic potentials in close agreement with HF/6-31G* RESP charges — the reference standard for AMBER parameterization — at a fraction of the computational cost, making it tractable for the batch processing of multiple ligands in parallel.

The parameterization is performed by ACPYPE (Antechamber Python Parser Interface), which wraps the Antechamber tool from the AMBER suite to generate GROMACS-compatible topology (.itp) and structure files. Ligand PDBQT files from the docking stage are converted to SDF format using OpenBabel prior to ACPYPE processing.

1d. Complex Assembly

The prepared protein and each ligand are merged into a single complex structure using BioPython's PDB I/O tools. Chain conflicts are resolved by assigning the ligand a unique chain letter. The ligand topology is incorporated into the system-wide topol.top file via an #include directive inserted immediately after the protein force field include, ensuring that GROMACS can resolve all bonded and non-bonded parameters for the full complex.


Stage 2: Simulation Box Setup, Solvation, and Ionization

2a. Simulation Box

The protein–ligand complex is centered in a triclinic periodic simulation box using gmx editconf, with a minimum distance of 1.0 nm between any heavy atom of the solute and the nearest box face. The triclinic geometry reduces the number of solvent molecules required by approximately 30% compared to a cubic box of equivalent solute-to-face distance, giving shorter runtimes without compromising physical validity. As long as the minimum image convention is satisfied (the box is large enough that the solute does not interact with its own periodic image), the box geometry does not affect the dynamics.

2b. Solvation

The box is filled with TIP3P water molecules using gmx solvate with the spc216.gro pre-equilibrated water box as the solvent template. spc216.gro provides a realistic initial water geometry at 300 K, avoiding the long equilibration times that would be needed if water molecules were placed randomly.

2c. Ionization

Physiological ionic strength is established by adding Na⁺ and Cl⁻ ions to a concentration of 0.15 M using gmx genion. This matches normal human plasma ionic strength and is important for screening long-range electrostatics between charged surface residues and for representing ions present at the binding sites of many drug targets (kinases, ion channels, nucleic-acid binding proteins). The -neutral flag ensures that enough counterions are added to neutralize the net charge of the complex before the target ionic strength is reached, preventing artifacts from a net-charged periodic system.


Stage 3: Energy Minimization

Before dynamics can begin, the system must be brought to a local energy minimum. The combined operations of force field assignment, water placement, and ion insertion introduce bond strain, angular distortions, and steric clashes that would cause atomic forces to be unrealistically large if dynamics were started immediately.

gmx grompp and gmx mdrun are used to run steepest descent minimization for up to 50,000 steps, stopping early if the maximum force drops below 10 kJ/mol/nm. Steepest descent is used rather than the more sophisticated L-BFGS algorithm because it is more robust in the presence of severe clashes that arise after solvation, converging reliably even when the initial potential energy is very high. No position restraints are applied during energy minimization — the protein, ligand, and water molecules are all free to relax, which allows the solvent shell around the ligand and perturbed side chains to find their optimal geometry.


Stage 4: Equilibration

Equilibration is the most critical and most often underappreciated stage in an MD protocol. Running the production simulation before the system has reached thermodynamic equilibrium would produce trajectories dominated by the initial conditions, making computed observables meaningless. The equilibration is performed in two sequential phases: NVT (constant number of particles, volume, and temperature) followed by NPT (constant number of particles, pressure, and temperature).

4a. NVT Equilibration — Temperature Control at Constant Volume

The NVT phase runs for 50,000 steps at a 2 fs time step, corresponding to 100 ps of simulation. The purpose is to bring the system to the target temperature while keeping the volume fixed, so that the box density set during solvation is not perturbed before the thermostat has stabilized.

Temperature: The default target temperature is 310 K (37 °C), body temperature for human targets. For projects where the target organism has a different physiological temperature, the temperature can be set accordingly before equilibration; once set, it is locked for all subsequent stages to ensure physical consistency.

Thermostat — V-rescale: Temperature is controlled with the velocity rescaling thermostat (V-rescale),[11] a modification of the Berendsen thermostat that adds a stochastic correction term to reproduce the correct canonical (NVT) distribution of kinetic energies. Berendsen coupling alone produces the wrong kinetic energy distribution for the protein and solvent separately, leading to equipartition errors that can bias the dynamics of flexible biomolecules. V-rescale corrects this while retaining the numerical stability and fast convergence of Berendsen coupling.

Two temperature coupling groups are defined — the Protein + Ligand group and Water + Ions — each coupled independently with a time constant τT = 0.1 ps. Separate coupling prevents the "hot solvent / cold solute" artifact that arises when energy flows slowly between protein and solvent degrees of freedom.

Constraints — LINCS: All bonds involving hydrogen atoms are constrained using the LINCS (LINear Constraint Solver) algorithm[12] with lincs_iter = 1 and lincs_order = 4. Constraining hydrogen bonds allows the use of a 2 fs time step (compared to 1 fs without constraints) by removing the fastest bond vibrations from the equation of motion; those vibrations carry little thermodynamic weight but would otherwise limit the time step.

Electrostatics — Particle Mesh Ewald: Long-range electrostatic interactions are computed using Particle Mesh Ewald (PME) summation[13] with a real-space cutoff of 1.2 nm, cubic spline interpolation (pme_order = 4), and a FFT grid spacing of 0.16 nm. PME treats long-range electrostatics exactly within a periodic system by decomposing the electrostatic sum into a short-range real-space contribution and a long-range reciprocal-space contribution. Using a plain cutoff in a charged biomolecular system introduces errors of several kcal/mol, which would render any downstream free energy calculation unreliable.

Van der Waals: Short-range vdW interactions use a cutoff of 1.2 nm with a force-switch modifier that smoothly decays the vdW forces to zero between 1.0 and 1.2 nm, rather than truncating them abruptly. Abrupt truncation introduces discontinuities that produce systematic errors in pressure and density calculations.

Position Restraints: A flat-bottom position restraint (-DPOSRES) is applied to protein and ligand heavy atoms during both equilibration phases, preventing the solute from drifting while water molecules equilibrate around it.

Initial velocities are assigned from a Maxwell–Boltzmann distribution at 310 K using a random seed, providing each simulation with a statistically independent starting point.

The NVT molecular dynamics parameter file (.mdp) uses the following key settings:

ParameterValueDescription
integratormdLeap-frog integrator
nsteps50,000100 ps total at 2 fs/step
dt0.002 ps2 fs time step
tcouplV-rescaleModified Berendsen thermostat
tau_t0.1 psTemperature coupling time constant
ref_t310 KTarget temperature (physiological)
pcouplnoNo pressure coupling in NVT
constraintsh-bondsLINCS on all bonds to hydrogen
coulombtypePMEParticle Mesh Ewald
rcoulomb1.2 nmElectrostatic real-space cutoff
rvdw1.2 nmvdW cutoff with force-switch at 1.0 nm
fourierspacing0.16 nmPME reciprocal-space grid spacing
gen_velyesVelocities from Maxwell–Boltzmann

4b. NPT Equilibration — Pressure Control at Constant Temperature

After temperature has been equilibrated in the NVT phase, the simulation continues at constant pressure to allow the box volume to converge to its equilibrium value. The NPT phase runs for 50,000 steps (100 ps) at the same temperature and with the same thermostat, electrostatic, and vdW settings. Position restraints remain active.

Pressure coupling — Berendsen barostat: Pressure is controlled with the Berendsen barostat at 1.0 bar (atmospheric pressure) with an isothermal compressibility of 4.5 × 10⁻⁵ bar⁻¹ (the experimentally measured value for liquid water at 310 K) and a coupling time constant of τP = 2.0 ps. Berendsen coupling is acceptable during equilibration because the goal is rapid convergence to the correct average density, not correct sampling of the NPT ensemble. The Berendsen barostat is switched to Parrinello-Rahman in the production phase, where correct sampling of pressure fluctuations is essential for thermodynamic accuracy.

ParameterValueDescription
pcouplBerendsenPressure coupling for fast density convergence
pcoupltypeisotropicUniform scaling of all box vectors
tau_p2.0 psPressure coupling time constant
ref_p1.0 barAtmospheric reference pressure
compressibility4.5 × 10⁻⁵ bar⁻¹Isothermal compressibility of water at 310 K
refcoord_scalingcomScale center-of-mass coordinates with box

Stage 5: Production MD Simulation

5a. Simulation Parameters

The production run is launched from the final NPT structure and checkpoint, ensuring continuity of positions, velocities, and box geometry from equilibration. The key parameters are:

ParameterValueRationale
Time step (dt)2 fsMaximum stable step with LINCS H-bond constraints
Default simulation length20 nsSufficient for binding site stability assessment and MM-PBSA sampling
Temperature310 K (locked from equilibration)Physiological, consistent with equilibration
ThermostatV-rescale, τT = 0.1 psCorrect canonical sampling
BarostatParrinello-Rahman, τP = 2.0 psCorrect NPT ensemble statistics
Reference pressure1.0 barAtmospheric
Isothermal compressibility4.5 × 10⁻⁵ bar⁻¹Water at 310 K
ElectrostaticsPME, rcoul = 1.2 nmLong-range accuracy
vdWForce-switch, rvdw = 1.2 nmSmooth truncation
ConstraintsLINCS, h-bonds2 fs time step
Saved compressed frames250Trajectory sampling for analysis and visualization

The switch from Berendsen to Parrinello-Rahman barostat[14] for production is deliberate and important. The Berendsen barostat gives correct average volumes but incorrect pressure fluctuations. Parrinello-Rahman produces the correct isothermal-isobaric (NPT) ensemble — including the correct distribution of volume fluctuations — which is essential when computing thermodynamic averages from the trajectory. This distinction matters for MM-PBSA: entropy estimates and solvation free energies are sensitive to the ensemble statistics.

5b. GPU Acceleration

Production runs are GPU-accelerated using NVIDIA CUDA. Short-range non-bonded interactions (-nb gpu), PME reciprocal space calculations (-pme gpu), and bonded interactions (-bonded gpu) are all offloaded to the GPU, with CPU–GPU workload balancing enabled (-pin on). CUDA graph capture (GMX_CUDA_GRAPH=1) reduces CPU kernel launch overhead. These settings allow modern GPUs to achieve 50–200 ns/day for typical drug-like protein–ligand systems (50–100 kDa protein, MW 300–600 Da ligand), making 20 ns simulations feasible within hours rather than days.

5c. Simulation Length and Convergence

Twenty nanoseconds represents a balance between computational cost and physical rigor. For most protein–ligand complexes with molecular weights in the 50–100 kDa range:

  • Binding site side chains equilibrate within the first 1–2 ns.
  • Ligand RMSD converges to a stable plateau within 2–5 ns for genuine binders.
  • MM-PBSA estimates derived from the plateau region of a 10–20 ns trajectory typically have standard deviations of 0.5–1.5 kcal/mol when averaged over 100 frames.

Simulations can be extended in increments using checkpoint files (gmx convert-tpr -extend), allowing users to run longer simulations for flexible targets, large binding sites, or when convergence diagnostics suggest the trajectory has not yet reached steady state. Extended trajectories are concatenated with the original and re-analyzed to update all metrics.


Stage 6: Trajectory Analysis, MM-PBSA, and ΔG Estimation

6a. Trajectory Post-Processing

Before any analysis, the raw production trajectory is post-processed to remove artifacts of the periodic boundary conditions (PBC):

  1. PBC jump removal (gmx trjconv -pbc nojump): Prevents molecules from appearing to teleport across the box boundary when they cross a periodic image.
  2. Molecule reconstitution (gmx trjconv -pbc mol -ur compact): Reassembles molecules split across box boundaries, which occurs for flexible loops that extend beyond the simulation cell.
  3. Rotational and translational fitting (gmx trjconv -fit rot+trans): Removes overall translational and rotational diffusion of the protein complex by fitting each frame to the initial structure. This is necessary because RMSD and RMSF calculations measure internal structural changes, not rigid-body motion of the whole complex.

All three steps operate on the Protein + Ligand group, which contains the protein and ligand heavy atoms but excludes solvent and ions. A separate visualization trajectory is generated by downsampling to approximately 10 frames per nanosecond for interactive rendering in Molstar.

6b. RMSD Analysis and Equilibration Onset Detection

The root-mean-square deviation (RMSD) of both the protein backbone (Cα atoms) and the ligand heavy atoms relative to the initial structure is calculated from the fitted trajectory using gmx rms.

Two RMSD curves are computed:

  • Protein Cα RMSD: Measures the overall conformational stability of the receptor. Large, sustained drifts (e.g., RMSD > 3 Å) indicate major conformational rearrangements.
  • Ligand heavy-atom RMSD: The most direct indicator of binding mode stability. Ligands that maintain their docked pose show RMSD < 1–2 Å throughout the trajectory; unstable ligands exhibit increasing RMSD before potentially unbinding or adopting an alternative pose.

A key output is the automated detection of the equilibration onset — the frame at which the RMSD stabilizes into a plateau. The algorithm divides the trajectory into overlapping windows of 1/5 of the total frame count, computes the standard deviation within each window, and identifies the first sequence of five or more consecutive windows whose standard deviation falls below 50% of the baseline (computed from the first 30% of windows). The frame at the beginning of this plateau is recorded as the start of the production-phase ensemble. All downstream MM-PBSA calculations draw frames exclusively from this equilibrated portion, not from the initial transient during which the system is still relaxing.

This adaptive onset detection is important for accuracy: including pre-equilibration frames in the MM-PBSA ensemble biases computed average energies toward the initial structure rather than the true equilibrium ensemble, which would systematically distort the ΔG estimate.

6c. RMSF Analysis

The root-mean-square fluctuation (RMSF) per residue measures the time-averaged atomic mobility at each position along the protein sequence. High RMSF values in the binding site indicate a flexible, dynamic pocket; low values indicate a rigid, well-defined site. RMSF maps are presented for all simulated complexes as smoothed plots (Savitzky-Golay filter, window length 11, polynomial order 3), enabling rapid visual identification of mobile regions outside the ligand's footprint versus mobile regions within the contact interface — the latter being a potential source of pose instability.

6d. MM-PBSA Calculation

Frame Selection

MM-PBSA is run using gmx_MMPBSA v1.6.3,[15] a Python tool that automates the conversion of GROMACS trajectories and topologies into AMBER format for energy calculation. The analysis is performed on up to 100 frames sampled from the equilibrated portion of the trajectory (from the automatically detected plateau onset to the end of the simulation). The frame interval is calculated dynamically as (end_frame − start_frame) / 100, ensuring uniform coverage of the equilibrated ensemble regardless of trajectory length. Using 100 frames has been shown in benchmark studies to be sufficient to achieve convergence in MM-PBSA estimates to within 0.5–1 kcal/mol for most protein–ligand systems.[6]

Poisson-Boltzmann Solvation

The polar solvation contribution is computed by solving the linearized Poisson-Boltzmann equation numerically on a finite-difference grid. The key parameters are:

ParameterValueDescription
ipb2Full nonlinear PB dielectric model
inp1Nonpolar solvation by decomposition
indi1.0Internal (protein) dielectric constant
exdi80.0External (water) dielectric constant
prbrad1.4 ÅSolvent probe radius
radiopt1Optimized atomic radii
scale2.0Grid spacing = 0.5 Å
nfocus2Two-stage electrostatic focusing
bcopt5Coulombic boundary conditions

The internal dielectric constant (indi = 1.0) is set to the vacuum value, reflecting the assumption that the protein interior is modeled explicitly by the atomic force field (AMBER99SB-ILDN), with no additional electronic polarizability beyond what is included in the fixed partial charges. A value of 1.0 is the most physically consistent choice when the protein is represented at all-atom resolution with fixed charges.[5]

The external dielectric constant (exdi = 80.0) corresponds to bulk water at physiological temperature. The large dielectric discontinuity at the protein–solvent boundary is the physical basis of the PB model: polar groups buried in the protein interior pay a solvation penalty because they are removed from the high-dielectric water environment.

Two-stage electrostatic focusing (nfocus = 2, fscale = 8) computes the electrostatic potential first on a coarse grid (grid spacing 4 Å) that extends far from the solute to handle boundary conditions accurately, then on a fine grid (0.5 Å) centered on the solute. This focusing procedure is the standard approach in PB calculations for biomolecules and improves accuracy at the protein surface and active site without requiring an impractically large fine grid.

The nonpolar solvation energy is computed using the decomposition approach (inp = 1):[16] the cavity formation energy is estimated from the solvent-accessible surface area using a surface tension of 0.0378 kcal/mol·Å² and an offset of −0.5692 kcal/mol, and the dispersion interaction between solute and solvent is computed separately using a solvent probe radius of 0.557 Å. This decomposed nonpolar model is more physically rigorous than the simpler SASA-only model and improves agreement with explicit-solvent free energy calculations.

Force Fields for AMBER Topology

gmx_MMPBSA converts the GROMACS trajectory and topology into AMBER format for energy re-evaluation. The protein parameters are mapped to the oldff/leaprc.ff99SB (AMBER99SB) force field and the ligand to leaprc.gaff, consistent with the GAFF2 parameters used during the MD simulation. This consistency between the MD force field and the MM-PBSA re-evaluation force field is essential: using different force fields for the simulation and the energy calculation would introduce systematic errors in the computed binding energies.

Interaction Entropy

The entropic cost of binding is estimated using the Interaction Entropy (IE) method introduced by Duan et al.:[4]

−TΔS_bind ≈ −kT ln ⟨exp(−β ΔE_int)⟩

where ΔEint is the fluctuation of the protein–ligand interaction energy around its mean, and the average is taken over the trajectory. IE is computed in trajectory segments of 25% of the total frame count, providing an estimate of the entropic contribution that converges faster than quasiharmonic analysis and is better suited to the 20 ns simulation timescale.

Per-Residue Energy Decomposition

In addition to the total binding free energy, the pipeline computes a per-residue contribution to ΔG for all residues within 6 Å of the ligand. The decomposition uses the pairwise interaction scheme, which assigns to each residue the sum of its vdW and electrostatic interactions with the ligand weighted by the ligand distance.

Per-residue decomposition is a powerful tool for medicinal chemistry: it identifies which amino acids in the binding site contribute the most favorable energy toward binding, and — equally important — which residues contribute unfavorable energy (positive ΔG contributions) that could be addressed by modifying the ligand's functional groups. The decomposition output is provided in CSV format for quantitative analysis.


Summary of Reported Metrics

After the full pipeline completes, the following metrics are reported for each simulated complex:

MetricDescriptionInterpretation
Protein Cα RMSDTime series, mean ± SDReceptor conformational stability
Ligand RMSDTime series, mean ± SDBinding mode retention throughout simulation
RMSF per residuePer-residue time-averaged fluctuationBinding site flexibility
Equilibration onsetFrame where RMSD plateau beginsQuality of equilibration for MM-PBSA sampling
ΔGbind (MM-PBSA)Mean ± SD in kcal/molPredicted binding free energy
ΔH (MM energy)Mean electrostatic + vdW contributionsEnthalpic driving force
ΔGpolar solvationMean PB polar solvation penaltyDesolvation cost
ΔGnonpolar solvationMean cavity + dispersion termHydrophobic contribution
−TΔS (IE entropy)Interaction Entropy estimateEntropic cost of binding
Per-residue ΔGDecomposition for all residues within 6 ÅHot-spot identification for medicinal chemistry
Final-frame structurePDB of last simulation frameVisual inspection of bound pose after simulation
Visualization trajectoryDownsampled XTC for interactive viewingReal-time trajectory playback in Molstar

Summary of Tools

StageToolPurpose
Protein force fieldGROMACS pdb2gmx (AMBER99SB-ILDN)Protein topology generation
Water modelTIP3P / spc216.groExplicit solvent
Ligand chargesAM1-BCC (via ACPYPE / Antechamber)Partial charge assignment
Ligand force fieldGAFF2Small molecule parameters
Complex assemblyBioPython PDBIOProtein–ligand complex construction
Simulation boxGROMACS editconf (triclinic, 1.0 nm)Periodic boundary conditions
SolvationGROMACS solvateWater placement
IonizationGROMACS genion (0.15 M NaCl)Physiological ionic strength
Energy minimizationGROMACS mdrun (steepest descent, 50k steps)Geometric relaxation
NVT equilibrationGROMACS mdrun (V-rescale, 100 ps, 310 K)Temperature equilibration
NPT equilibrationGROMACS mdrun (Berendsen, 100 ps, 1 bar)Pressure / density equilibration
Production MDGROMACS mdrun (Parrinello-Rahman, 20 ns)Trajectory generation
GPU accelerationNVIDIA CUDA via GROMACSNB, PME, and bonded interactions on GPU
Trajectory processingGROMACS trjconvPBC removal, centering, fitting
RMSD / RMSFGROMACS rms / rmsfStability and flexibility metrics
Plateau detectionAdaptive windowed std-dev algorithmEquilibration onset identification
MM-PBSAgmx_MMPBSA v1.6.3 (PB + IE)Binding free energy
PB solverAMBER PB (linearized, finite-difference)Polar solvation free energy
EntropyInteraction Entropy methodEntropic contribution to ΔG
Per-residue decompositiongmx_MMPBSA pairwise (within 6 Å)Hot-spot identification
Distributed computeApache Beam / Google Cloud DataflowParallel multi-complex execution

Infrastructure

Each protein–ligand complex is processed as an independent Dataflow job, allowing all selected hit compounds to be simulated in parallel. Equilibration and production stages run on GPU-equipped Dataflow workers provisioned on demand; trajectory analysis and MM-PBSA run on CPU workers. Trajectory files (XTC, TRR), topology files, and analysis results (XVG, CSV, PNG) are stored in Google Cloud Storage. Binding free energy results, frame counts, simulation lengths, and RMSD statistics are written to Cloud SQL (MySQL) and surfaced in the Autochem interface.

References

  1. Perez-Benito, L. et al. (2019). Predicting binding free energies of drugs in molecular dynamics simulations: The effects of ligand protonation states. Frontiers in Chemistry, 7, 198. https://doi.org/10.3389/fchem.2019.00198
  2. Gaieb, Z. et al. (2018). D3R Grand Challenge 2: Blind prediction of protein–ligand poses, affinity rankings, and relative binding free energies. Journal of Computer-Aided Molecular Design, 32, 1–20. https://doi.org/10.1007/s10822-017-0088-4
  3. Kollman, P.A. et al. (2000). Calculating structures and free energies of complex molecules: combining molecular mechanics and continuum models. Accounts of Chemical Research, 33(12), 889–897. https://doi.org/10.1021/ar000033j
  4. Duan, L. et al. (2016). Interaction entropy: a new paradigm for highly efficient and reliable computation of protein–ligand binding free energy. Journal of the American Chemical Society, 138(17), 5722–5728. https://doi.org/10.1021/jacs.6b02682
  5. Genheden, S. and Ryde, U. (2015). The MM/PBSA and MM/GBSA methods to estimate ligand-binding affinities. Expert Opinion on Drug Discovery, 10(5), 449–461. https://doi.org/10.1517/17460441.2015.1032936
  6. Wang, E. et al. (2019). End-point binding free energy calculation with MM/PBSA and MM/GBSA: strategies and applications in drug design. Chemical Reviews, 119(16), 9478–9508. https://doi.org/10.1021/acs.chemrev.9b00055
  7. Lindorff-Larsen, K. et al. (2010). Improved side-chain torsion potentials for the Amber ff99SB protein force field. Proteins: Structure, Function, and Bioinformatics, 78(8), 1950–1958. https://doi.org/10.1002/prot.22711
  8. Jorgensen, W.L., Chandrasekhar, J., Madura, J.D., Impey, R.W., Klein, M.L. (1983). Comparison of simple potential functions for simulating liquid water. Journal of Chemical Physics, 79(2), 926–935. https://doi.org/10.1063/1.445869
  9. Wang, J., Wolf, R.M., Caldwell, J.W., Kollman, P.A., Case, D.A. (2004). Development and testing of a general amber force field. Journal of Computational Chemistry, 25(9), 1157–1174. https://doi.org/10.1002/jcc.20035
  10. Jakalian, A., Jack, D.B., Bayly, C.I. (2002). Fast, efficient generation of high-quality atomic charges. AM1-BCC model: II. Parameterization and validation. Journal of Computational Chemistry, 23(16), 1623–1641. https://doi.org/10.1002/jcc.10128
  11. Bussi, G., Donadio, D., Parrinello, M. (2007). Canonical sampling through velocity rescaling. Journal of Chemical Physics, 126(1), 014101. https://doi.org/10.1063/1.2408420
  12. Hess, B., Bekker, H., Berendsen, H.J.C., Fraaije, J.G.E.M. (1997). LINCS: A linear constraint solver for molecular simulations. Journal of Computational Chemistry, 18(12), 1463–1472. https://doi.org/10.1002/(SICI)1096-987X(199709)18:12<1463::AID-JCC4>3.0.CO;2-H
  13. Essmann, U. et al. (1995). A smooth particle mesh Ewald method. Journal of Chemical Physics, 103(19), 8577–8593. https://doi.org/10.1063/1.470117
  14. Parrinello, M. and Rahman, A. (1981). Polymorphic transitions in single crystals: a new molecular dynamics method. Journal of Applied Physics, 52(12), 7182–7190. https://doi.org/10.1063/1.328693
  15. Valdés-Tresanco, M.S., Valdés-Tresanco, M.E., Valiente, P.A., Moreno, E. (2021). gmx_MMPBSA: A new tool to perform end-state free energy calculations with GROMACS. Journal of Chemical Theory and Computation, 17(10), 6281–6291. https://doi.org/10.1021/acs.jctc.1c00645
  16. Tan, C., Tan, Y.H., Luo, R. (2007). Implicit nonpolar solvent models. Journal of Physical Chemistry B, 111(42), 12263–12274. https://doi.org/10.1021/jp073399n