Overview
Autochem is a computational drug discovery platform developed by Pauling.AI. Its core workflow takes a protein target — identified by a PDB ID, UniProt accession, gene symbol, or an uploaded structure file — and returns a ranked list of candidate small molecules predicted to bind that target.
The entire pipeline is orchestrated by an AI agent that calls individual microservices in sequence. Each service is implemented as a distributed Apache Beam pipeline, capable of running at scale on Google Cloud Dataflow. The design principle throughout is to invest heavily in preparation quality: errors introduced early compound through every downstream step, so each stage exists to reduce a specific class of artifact or uncertainty that would otherwise corrupt docking scores.
Pipeline Steps
Step 1: Target Identification and Validation
Before downloading anything, the input identifier is validated against three databases in priority order: RCSB PDB, UniProt, and HGNC. This normalization step determines the identifier type (pdb_id, uniprot_accession, uniprot_entry_name, or gene_symbol) and ensures all downstream steps operate on a confirmed, canonical identifier. This prevents silent failures where a mistyped or ambiguous identifier would propagate through the pipeline and produce meaningless results.
Step 2: Protein Metadata Download
Once the identifier is validated, protein metadata is fetched from UniProt. This step resolves cross-references between databases and records gene names, organism, functional annotations, and known isoforms. Having this metadata before structure download allows the pipeline to select the biologically relevant chain and to flag cases where multiple orthologues or isoforms exist.
Step 3: Protein Structure Acquisition
The 3D structure is downloaded in one of two ways depending on the identifier type.
Experimental structures (RCSB PDB): The pipeline fetches the PDB record and chain information, downloads the mmCIF file alongside the PDB file, converts mmCIF to PDB when no PDB file is available using the Gemmi library, and splits multi-model structures (e.g., NMR ensembles) into individual conformations for independent evaluation. It also queries the OPM (Orientations of Proteins in Membranes) database to flag membrane proteins, which require special handling, and fetches binding site annotations from NCBI for each chain to inform pocket selection downstream.
Starting from an experimental structure is strongly preferred when one is available. X-ray crystal structures and cryo-EM structures capture the true three-dimensional arrangement of atoms in the binding site, including water molecules, metal ions, and co-crystallized ligands that can reveal the binding mode directly. The resolution of the structure is an important quality indicator: higher-resolution structures (e.g., below 2 Å) provide more reliable atomic coordinates for docking.
Predicted structures (AlphaFold DB): For targets without an experimentally determined structure, the pipeline fetches the predicted structure from the AlphaFold Protein Structure Database. AlphaFold 2 [1] uses a deep learning architecture trained on the Protein Data Bank to predict atomic-resolution structures with accuracy approaching experimental methods for many protein families. Per-residue confidence scores (pLDDT) are available in the AlphaFold output and can flag regions where the predicted geometry is uncertain — these regions, particularly disordered loops, may need extra scrutiny before docking.
Step 4: Receptor Preparation — Fixing, Reducing, and Protonating
This is the most critical preparation stage. Raw PDB files from public databases are almost always unsuitable for docking in their downloaded form. The problems are systematic:
- Missing hydrogens. X-ray crystallography does not resolve hydrogen atoms at typical resolutions, so deposited PDB files contain only heavy atoms. However, hydrogen atoms are essential for docking: they participate in hydrogen bonds, alter the shape and electrostatic surface of the binding site, and are required by AutoDock-family scoring functions to compute partial charges. Docking without hydrogens would miss all hydrogen-bond interactions.
- Missing atoms and residues. Flexible loops and termini are often disordered in crystal structures and absent from the PDB file. If a missing loop borders the binding site, leaving it absent creates an artificial opening that allows ligands to dock in geometrically impossible poses.
- Non-standard residues. Modified amino acids (phosphoserine, selenomethionine, etc.) and crystallization artifacts must be converted to their standard equivalents or removed, because force fields do not have parameters for them.
- Incorrect protonation states. The protonation state of titratable residues (histidine, aspartate, glutamate, lysine, arginine, cysteine, tyrosine) depends on the local pH and electrostatic environment. Deposited structures are solved at a range of pH conditions and contain no explicit proton information. Using the wrong protonation state at the active site can shift docking scores by several kcal/mol because it changes the shape of the electrostatic potential and the hydrogen-bond donor/acceptor pattern of the binding site.
The fix_reduce_protonate pipeline addresses all of these with a four-stage protocol:
1. PDBFixer
PDBFixer (part of the OpenMM ecosystem) is applied first to resolve structural completeness issues. It adds missing heavy atoms using template-based modeling, resolves missing terminal residues, replaces non-standard residues with their standard equivalents, and removes heterogens (co-crystallized ligands, buffer molecules, water outside the binding site). This stage also performs an initial, coarse hydrogen placement that is pH-aware, setting the stage for more precise protonation in later steps.
2. Reduce (TRIM mode)
After PDBFixer, the Reduce tool [2] is applied to optimize hydrogen placement for key titratable and ambiguous groups. Reduce uses an all-atom contact analysis — examining small-probe contact dots between hydrogen atoms and neighboring atoms — to determine the orientation that maximizes favorable interactions and minimizes clashes. This is especially important for histidine (which can be protonated at Nδ, Nε, or both, and can also act as a hydrogen-bond donor or acceptor), and for asparagine and glutamine (whose amide groups can be flipped 180° without changing heavy-atom positions but dramatically changes the hydrogen-bond geometry). The TRIM variant of Reduce then removes the hydrogens it added while preserving their optimized orientations as a guide for the next step. This avoids double-protonation issues when PDB2PQR adds its own hydrogens in the following step.
3. PDB2PQR with the SWANSON force field
PDB2PQR [3] is the core protonation engine. It assigns protonation states at the target pH using empirical pKa models and adds all hydrogen atoms accordingly. The pipeline uses pH 7.4 (physiological pH) as the default, which ensures that results reflect binding under conditions relevant to in vivo activity. For example, at pH 7.4 most aspartate and glutamate residues are deprotonated (negatively charged), most lysine and arginine residues are protonated (positively charged), and histidine residues occupy a range of states depending on their local environment.
The SWANSON force field is used specifically because it assigns partial charges that are compatible with the AutoDock/Vina scoring function. AutoDock-family scoring relies on the Gasteiger charge model for electrostatic and hydrogen-bond terms; using charges parameterized for a different force field (e.g., AMBER or CHARMM) would create a systematic mismatch between the receptor charges used in scoring and the charges the scoring function was trained on, degrading the accuracy of predicted binding affinities.
PDB2PQR outputs a PQR file — a variant of PDB format that includes per-atom partial charges and radii, which are required by the docking engine.
4. OpenBabel format conversions
OpenBabel converts the PQR file back to standard PDB format and then to PDBQT format. PDBQT is the native input format for AutoDock-family docking engines. Beyond the charge assignments from PDB2PQR, the PDBQT format encodes torsional flexibility information: every rotatable bond in a ligand is annotated as a BRANCH/ENDBRANCH block, and the docking engine samples different torsion angles during its search. For the receptor, PDBQT is typically treated as rigid (side-chain flexibility is not sampled in standard virtual screening, as it would be prohibitively expensive at library scale).
Step 5: Receptor Energy Minimization
Even after fixing and protonation, the protein geometry carries artifacts. Replacing non-standard residues with standard templates, adding missing atoms, and converting protonation states all perturb bond lengths, angles, and dihedral angles away from equilibrium values. These distortions manifest as local steric clashes and bond strain that would bias docking results: a strained residue in the binding site would artificially penalize ligands that approach it, producing false negatives.
The receptor_minimization pipeline applies energy minimization using OpenBabel with the MMFF94 (Merck Molecular Force Field) [4]. MMFF94 is a well-parameterized general organic force field covering proteins, organic molecules, and most ligand-relevant functional groups. The minimization runs a two-stage algorithm — steepest descent first to escape large energy barriers quickly, followed by conjugate gradient for efficient convergence near the local minimum. This relaxes strain and resolves clashes without driving the structure away from the experimental geometry, because the minimization is applied only to the regions modified during preparation and the overall fold is preserved.
Step 6: Structural Quality Control with MolProbity
Before committing to a docking calculation that may take hours, the pipeline evaluates the quality of the prepared receptor using MolProbity [5]. MolProbity is the standard tool in structural biology for assessing model quality, and it is used by the wwPDB validation pipeline for all deposited structures.
The pipeline calls a MolProbity server with four sequential analyses:
- Oneline analysis — Computes an overall MolProbity score, which is a log-weighted combination of clashscore and the percentage of Ramachandran outliers and rotamer outliers. It is expressed as a percentile relative to structures of comparable resolution in the PDB, so lower scores indicate a higher-quality structure.
- Clashscore — Counts the number of serious steric overlaps (> 0.4 Å) per 1000 atoms. High clashscores indicate that atoms are occupying the same space, which means the force field or preparation step introduced incorrect geometry. A high clashscore in the binding site is particularly damaging: it will sterically occlude ligand poses and produce false negatives, or it will create artificial voids that accept poses that would not fit the real protein.
- Geometry (Ramachandran analysis) — Evaluates backbone φ/ψ dihedral angles against the expected distributions derived from ultra-high-resolution crystal structures. Residues in disallowed regions of the Ramachandran plot have strained backbones that indicate local errors in the model.
- Residue analysis (rotamer quality) — Assesses whether side-chain conformations match preferred rotamer distributions. Rotamer outliers in or near the binding site directly affect the shape and electrostatic character of the pocket presented to docking.
When multiple conformations are available (e.g., from NMR ensembles or from multiple models produced during receptor preparation), the pipeline automatically selects the conformation with the best (lowest) MolProbity score as the representative structure for docking.
Step 7: Binding Pocket Prediction with P2Rank
With a validated, high-quality receptor structure in hand, the pipeline identifies druggable binding sites using P2Rank (version 2.5) [6]. P2Rank is a machine learning-based pocket predictor that uses a random forest model trained on the structural and physicochemical properties of protein surfaces. Unlike purely geometric methods that identify concave surface regions, P2Rank learns which surface features are associated with observed ligand binding in the PDB training set, making its predictions more chemically meaningful.
P2Rank operates on the protein surface as sampled by solvent-accessible surface (SAS) points. For each point, it computes a feature vector encoding local geometry (curvature, burial depth), physicochemical properties (hydrophobicity, charge, hydrogen-bond donor/acceptor capacity), and evolutionary conservation. The random forest outputs a per-point probability of belonging to a binding pocket; nearby high-probability points are clustered into pocket predictions, each ranked by a combined score.
The pipeline then:
- Extracts the center coordinates of each predicted pocket (center of mass of its SAS points).
- Calculates a bounding box using MDAnalysis, sized to enclose all SAS points belonging to the pocket plus a 2 Å margin. This box defines the search space for docking — the region of the receptor within which UniDock will sample ligand poses.
- Enforces a maximum box volume of 27,000 ų (the limit imposed by UniDock), scaling oversized boxes down while preserving their center and aspect ratio.
- Checks new predictions against previously stored sites (within a 5 Å center tolerance) to avoid writing duplicate pockets when a structure is re-analyzed.
The box size calculation is a critical accuracy factor in docking: too small a box misses valid binding poses at the periphery of the pocket; too large a box wastes GPU search budget on empty space and increases the probability of sampling nonspecific poses. The SAS-point-based sizing strikes a balance between these extremes.
Step 8: Pocket Selection
The agent presents identified pockets to the user through a 3D visualization overlaid on the protein structure and a summary table of scores and positions. Users can select a pocket manually based on biological knowledge — for example, choosing a known active site from the literature, or preferring a cryptic allosteric site over the orthosteric site. Alternatively, the agent can automatically select the top-ranked pocket by P2Rank score for fully automated workflows.
The selected pocket's center coordinates and box dimensions are passed directly to the docking engine.
Step 9: Virtual Screening with UniDock
The docking step screens a library of small molecules against the selected binding site using UniDock [8], a GPU-accelerated molecular docking engine derived from AutoDock Vina [9]. UniDock reimplements Vina's Monte Carlo–based conformational search on GPU hardware, enabling batch processing of thousands of ligands simultaneously. This reduces the wall-clock time for large-scale virtual screening from days (CPU-based Vina) to hours, without sacrificing scoring accuracy relative to the original Vina implementation.
Ligand pre-filtering
Before docking, the compound library is filtered at the database level to remove molecules unlikely to be viable drug candidates:
- Drug-likeness (QED ≥ 0.2): The Quantitative Estimate of Drug-likeness (QED) [7] combines multiple Lipinski-style descriptors (molecular weight, lipophilicity, hydrogen-bond donors/acceptors, polar surface area, rotatable bonds, aromatic rings, structural alerts) into a single score from 0 to 1. A threshold of 0.2 removes obvious non-drug-like compounds while retaining fragments and lead-like molecules.
- Nanoparticles: Carbon-only structures (fullerenes, nanotubes, etc.) are excluded because they are not viable drug candidates and can produce artifactually high docking scores due to their compact, symmetric geometry.
- Metabolites: Endogenous metabolites are excluded to focus the screen on exogenous drug-like molecules.
Two-stage docking strategy
A key design decision in the pipeline is the use of a two-stage docking protocol that balances throughput against accuracy.
Stage 1 — Fast screening across the full library: UniDock runs in fast search mode across the entire filtered compound library. In fast mode, the Monte Carlo search uses a reduced number of steps and evaluations per ligand, enabling very high throughput. Each ligand's best-scoring pose is evaluated by Vina's hybrid scoring function, which combines a knowledge-based term derived from protein–ligand crystal structure data with a physics-based empirical term. Ligands scoring worse than −3.0 kcal/mol are discarded, retaining only those with meaningful predicted affinity for the next stage. Ligands are represented as PDBQT files, where the torsional flexibility of each rotatable bond is explicitly encoded.
Stage 2 — Detail re-docking of top candidates: The top candidates from Stage 1 (by default, 100 compounds for a requested output of 10 final hits, i.e., a 10× enrichment factor) are re-docked using UniDock in detail search mode. Detail mode allocates significantly more Monte Carlo steps and evaluations per ligand, allowing a more thorough exploration of conformational space and a more reliable identification of the true binding mode.
Critically, Stage 2 uses PDBQT files with explicit hydrogen atoms rather than the implicit-hydrogen representation used in Stage 1. Explicit hydrogens improve scoring accuracy for hydrogen-bond interactions because they allow the scoring function to evaluate the precise geometry of donor–acceptor pairs, including the directionality of N–H and O–H bonds. The tradeoff is that explicit-hydrogen structures have significantly more atoms (typically 1.5–2× more), requiring atom-budget-aware adaptive batching to prevent GPU out-of-memory errors.
The final top-N compounds are selected by Vina score after detail re-docking. Results are stored in the database with docking scores, SDF coordinates of the docked pose, and cross-references to the receptor structure and binding site.
Step 10: Docking Quality Control with PoseBusters
High docking scores are a necessary but not sufficient condition for a compound to be a viable hit. Docking engines optimize a scoring function that is an approximation of the true binding free energy, and they can produce poses that score well but are physically impossible. Common artifacts include: ligands passing through the protein surface due to incomplete clash detection, internal steric clashes within the ligand that the scoring function did not penalize sufficiently, and valence violations from format conversion errors.
The final step runs PoseBusters [10] on all docked poses. PoseBusters applies a battery of deterministic, physics-based checks that include:
- Molecular sanitization: Correct valence, no charged atoms, no radicals.
- Ligand bond geometry: Bond lengths, bond angles, and torsion angles within expected chemical ranges.
- Internal clashes: No steric overlaps between atoms within the ligand.
- Receptor clashes: No steric overlaps between the docked ligand and the protein heavy atoms.
- Binding site proximity: The ligand center of mass is within the declared binding site, not docked to a remote surface patch.
- Flat ring systems: Aromatic rings are planar, as required by their electronic structure.
PoseBusters was originally developed to audit the output of AI-based structure prediction methods (including recent diffusion-based docking models), but its checks are equally applicable to physics-based docking output. Its value here is as a filter against scoring-function artifacts: a compound that passes all PoseBusters checks is a physically reasonable pose, and discordance between a good docking score and a PoseBusters failure is a signal that the pose warrants manual inspection before experimental follow-up.
Summary of Tools
| Step | Tool | Purpose | Reference |
|---|---|---|---|
| Structure acquisition | RCSB PDB | Download experimental 3D structures | — |
| Structure acquisition | AlphaFold DB | Download predicted 3D structures | [1] |
| Metadata | UniProt | Protein metadata and cross-references | — |
| Metadata | NCBI | Chain annotations and binding sites | — |
| Format conversion | Gemmi | mmCIF to PDB conversion | — |
| Membrane check | OPM | Identify membrane proteins | — |
| Structure fixing | PDBFixer | Missing atoms, non-standard residues | — |
| Hydrogen optimization | Reduce | Hydrogen flipping and orientation | [2] |
| Protonation | PDB2PQR (SWANSON FF) | pH-aware protonation and charge assignment | [3] |
| Format conversion | OpenBabel | PQR→PDB, PDB→PDBQT conversions | — |
| Energy minimization | OpenBabel (MMFF94) | Receptor geometry relaxation | [4] |
| Quality control | MolProbity | Structure validation and ranking | [5] |
| Pocket prediction | P2Rank v2.5 | ML-based binding pocket identification | [6] |
| Ligand filtering | QED | Drug-likeness pre-filter | [7] |
| Docking | UniDock | GPU-accelerated virtual screening | [8] |
| Docking scoring function | AutoDock Vina | Scoring function underlying UniDock | [9] |
| Pose quality control | PoseBusters | Physics-based docking pose validation | [10] |
| Distributed compute | Apache Beam / Dataflow | Scalable pipeline execution | — |
Infrastructure
All pipelines run as Apache Beam jobs on Google Cloud Dataflow. The docking step uses GPU workers provisioned on demand; the rest of the pipeline runs on CPU workers. Persistent data (structures, sites, docking results) is stored in Cloud SQL (MySQL), and large files (PDB, SDF, PDBQT, NDJSON) are stored in Google Cloud Storage.
References
- Jumper, J. et al. (2021). Highly accurate protein structure prediction with AlphaFold. Nature, 596(7873), 583–589.https://doi.org/10.1038/s41586-021-03819-2
- Word, J.M., Lovell, S.C., Richardson, J.S., Richardson, D.C. (1999). Visualizing and quantifying molecular goodness-of-fit: small-probe contact dots with explicit hydrogen atoms. Journal of Molecular Biology, 285(4), 1711–1733.[DOI pending]
- Dolinsky, T.J., Nielsen, J.E., McCammon, J.A., Baker, N.A. (2004). PDB2PQR: an automated pipeline for the setup of Poisson–Boltzmann electrostatics calculations. Nucleic Acids Research, 32(suppl. 2), W665–W667.https://doi.org/10.1093/nar/gkh381
- Halgren, T.A. (1996). Merck molecular force field. I. Basis, form, scope, parameterization, and performance of MMFF94. Journal of Computational Chemistry, 17(5–6), 490–519.https://doi.org/10.1002/(SICI)1096-987X(199604)17:5/6<490::AID-JCC1>3.0.CO;2-P
- Williams, C.J. et al. (2018). MolProbity: more and better reference data for improved all-atom structure validation. Protein Science, 27(1), 293–315.https://doi.org/10.1002/pro.3330
- Krivák, R. and Hoksza, D. (2018). P2Rank: machine learning based tool for rapid and accurate prediction of ligand binding sites from protein structure. Journal of Cheminformatics, 10, 39.https://doi.org/10.1186/s13321-018-0285-8
- Bickerton, G.R., Paolini, G.V., Besnard, J., Marechal, S., Hopkins, A.L. (2012). Quantifying the chemical beauty of drugs. Nature Chemistry, 4, 90–98.https://doi.org/10.1038/nchem.1243
- Yu, Y. et al. (2023). Uni-Dock: GPU-accelerated docking enables ultralarge virtual screening. Journal of Chemical Theory and Computation, 19(11), 3336–3345.https://doi.org/10.1021/acs.jctc.2c01145
- Trott, O. and Olson, A.J. (2010). AutoDock Vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. Journal of Computational Chemistry, 31(2), 455–461.https://doi.org/10.1002/jcc.21334
- Buttenschoen, M., Morris, G.M., Deane, C.M. (2024). PoseBusters: AI-based docking methods fail to generate physically valid poses or generalise to novel sequences. Chemical Science, 15, 3130–3139.https://doi.org/10.1039/d3sc04185a