<img height="1" width="1" style="display:none;" alt="" src="https://px.ads.linkedin.com/collect/?pid=384506&amp;fmt=gif">
Skip to content
Book a Demo
    March 25, 2026

    Applying Real-World Drug Discovery Experience in the OpenADMET ExpansionRx Blind Challenge: A Summary of Top Submissions

    The first OpenADMET ExpansionRx webinar series featured presentations from two top-performing participants: Jason Wang (Merck) and David Baldini (EMD Serono). Both groups ranked in the top tier of the challenge leaderboard, which drew over 370 teams and evaluated predictions across 25,000 experimental measurements spanning 7,000+ compounds.

    This post summarizes the modeling approaches, key findings, and failure modes reported by each team.

    OpenADMET + ExpansionRx Blind Challenge.webinar1.v2 (2)

     

     

    Background

    The ExpansionRx blind challenge provided one of the largest publicly available ADME datasets to date to date. Participants predicted a range of ADME endpoints including solubility, permeability (CACO-2 Papp and efflux), LogD, and species-specific clearance.

    Submissions were evaluated on a partial leaderboard during the challenge period and against a final held-out test set at the end of the challenge.

    Merck (Ranked 2nd Overall)

    Approach: Multitask ChemProp with Proprietary Data Augmentation

    Jason Wang's team built a multitask message-passing neural network using ChemProp v2.2.1. The model ingested SMILES strings with no additional descriptors and trained across 40 endpoints simultaneously using a 90/10 stratified random split.

    Key architectural and training decisions:

    • Loss function: MAE loss was substituted for the standard MSE, consistent with the challenge evaluation metric.
    • Ensemble: Each submission used an ensemble of 3–5 models to reduce variance, particularly relevant for low-data endpoints.
    • Data weighting: The team applied endpoint-level and compound-level weights in the loss function to emphasize the ~2,000 ExpansionRx training compounds rather than the ~700,000 internal Merck compounds. Weights were determined empirically rather than via hyperparameter search.

    When Proprietary Data Helped

    For solubility and CACO-2 efflux, appending internal Merck data to the training set produced measurable reductions in MAE relative to models trained without it. The team attributed this to high assay consistency between internal and challenge measurements, and sufficient internal data volume to reduce overfitting.

    When Proprietary Data Did Not Help

    For LogD and CACO-2 Papp, models trained exclusively on ExpansionRx challenge data outperformed those augmented with internal Merck data. For LogD specifically, the team reported a consistent performance gap: internal models achieved R² ~0.9 on time-split internal test sets but only ~0.75 on the challenge test set. The team identified probable assay differences (HLC method vs. shake flask) as a contributing factor.

    For CACO-2 Papp, a fine-tuned Kermit foundation model (developed in collaboration with NVIDIA, trained without Merck data) also outperformed the augmented multitask model, suggesting that domain-specific chemical space alignment matters more than dataset size for certain endpoints.

    Overfitting Mitigation

    With fewer than 20 models submitted across the full challenge period, the team deliberately avoided iterative hyperparameter tuning against the partial leaderboard. Final full test set performance was comparable to or better than partial leaderboard performance for most endpoints, indicating limited test set overfitting.

    EMD Serono (Top Performer)

    Approach: Internally Developed Foundation Models via MATCHA

    David Baldini's team entered the challenge as a testbed for MATCHA (Modeling and AI Toolkit for Chemistry and Healthcare Applications), an internal framework that wraps multiple neural network architectures under a unified scikit-learn-style interface. The framework supports multitask pre-training, fine-tuning, and ensembling across graph-based architectures.

    Pre-Training Data

    Two pre-training configurations were used:

    • Generalist model: ~10 million compounds, ~5,000 endpoints, combining public and internal data across ADME, binding affinity, RDKit descriptors, and quantum mechanical (QM) properties.

    • ADME-focused model: Literature-curated ADME data, internal assay data (~80,000 compounds, 19 assays, ~400,000 measurements), and the ExpansionRx training set.

    Notably, the team found that publicly available data was more chemically similar to the ExpansionRx challenge space than their internal data, though internal data contributed greater structural diversity.

    Architectures

    Four graph-based architectures were evaluated: ChemProp, Graph Isomorphism Network (GIN), Gated GCN, and MGPS (a graph transformer). To improve diversity along a local vs. global inductive bias axis:

    • Gated GCN was combined with virtual nodes to facilitate long-range message passing.
    • GIN was augmented with Laplacian positional encodings to provide molecular context beyond local neighborhoods.

    Graph transformers (MGPS) demonstrated stronger performance for permeability endpoints, consistent with their capacity to model long-range structural dependencies. ChemProp remained competitive across most endpoints.

    Key Training Decisions

    • Curriculum learning: Task weights were annealed during pre-training. QM and RDKit descriptor tasks were weighted highly early in training and reduced midway through; ADME and binding affinity tasks were upweighted in later training phases. This prevented the pre-trained representation from being dominated by trivial-to-learn tasks (e.g., molecular weight prediction).
    • Differential learning rates during fine-tuning: Earlier backbone layers used lower learning rates; the MLP head used higher learning rates.
    • Cross-validation ensembling: N models were fine-tuned on N different training splits with distinct random initializations; out-of-fold sets triggered early stopping per model.
    • Auxiliary tasks: Jazzy descriptors were incorporated as auxiliary regression targets during fine-tuning for permeability endpoints, conditioning the latent space without requiring conformer generation at inference.

    What Did Not Work

    • SMILES-based foundation models underperformed graph-based equivalents, likely due to insufficient unsupervised pre-training data and the absence of masked language modeling or autoregressive objectives.
    • Scaling pre-training data beyond ~10 million compounds produced diminishing returns, with some evidence of degraded performance as QM-derived labels increasingly dominated the training signal.
    • Scaling model size beyond ~40 million parameters was not straightforward; GNNs showed instability at depth.
    • Complex ensembling strategies (e.g., stacking) did not outperform equal-weight prediction averaging.
    • Hyperparameter optimization yielded inconsistent gains, attributed partly to small validation set sizes in low-data regimes.

    MLOps as a Prerequisite

    A recurring theme in Baldini's presentation was the primacy of reproducible experiment tracking and modular infrastructure. The team used MLflow throughout and cited the ability to run pre-training and fine-tuning jobs in parallel as a prerequisite for the volume of experimentation conducted. The inference pipeline was designed to avoid conformer generation, keeping deployment latency practical.

    Cross-Cutting Observations

    Several findings were consistent across both teams:

    Proprietary data does not universally improve performance. Assay differences between internal and external data can introduce systematic bias that reduces model accuracy on held-out challenge data. Dataset size is not a reliable proxy for dataset relevance.

    Multitask models are generally competitive with single-task models, and in correlated endpoint groups (e.g., cross-species clearance, transporter endpoints) they offer a meaningful advantage. However, negative transfer is observed when uncorrelated tasks or mismatched assay data are included.

    Ranking performance is the operationally relevant metric for most ADME use cases. Both teams noted that accurate absolute value prediction may be less important than correct compound ordering for prioritization decisions.

    Overfitting to partial leaderboards is a real risk in blind challenge settings. Controlled submission rates and diverse pre-training data were the primary mitigations cited.

    Watch the presentations given by Alex Rich (Inductive Bio, top-ranked overall submission) and Daniel Crusius (deepmirror).

    Other posts you might be interested in

    View All Posts
    CDD Vault Updates
    2 min   February 15, 2019
    CDD Vault Update (February 2019): Import from XLSX; Expose Batch Owner & Date Fields; ELN 3D Viewer Supports Additional File Types
    Read More
    CDD Blog
    2 min   June 30, 2012
    How to do more with less = Science Exchange + existing expert academic capacity
    Read More
    Webinars
    1 min   December 4, 2025
    Distributed Operations in Drug Discovery: Balancing Internal Expertise and External Execution
    Read More