Search

Book a Demo

March 25, 2026

Applying a Focused Modeling Strategy in the OpenADMET ExpansionRx Blind Challenge: Lessons from Top Performers

The second OpenADMET ExpansionRx webinar featured presentations from the first- and fourth-place finishers: Alex Rich (Inductive Bio) and Daniel Crusius (deepmirror). Both teams competed under pseudonyms — Pebble and campfire-capillary, respectively — and represent the startup segment of a top-10 cohort that spanned large pharma, academia, and biotech.

This post summarizes their modeling approaches, data handling decisions, and post-challenge analyses.

Background

The ExpansionRx challenge provided one of the largest lead-optimization-like datasets available in the public domain: 7,000+ compounds, 25,000+ measurements, 103 model reports submitted, and over 4,000 total submissions from approximately 370 groups. Endpoints included kinetic solubility, CACO-2 permeability (Papp and efflux), LogD, plasma protein binding, and species-specific microsomal clearance.

Inductive Bio (Ranked 1st Overall)

Approach: Consortium-Backed GNN Ensemble with Staged Fine-Tuning

Inductive Bio builds ADME models backed by a pre-competitive data consortium composed of partner IP-protected data, internal experiments, and curated public literature and patent data. Their models use graph neural network representations, ensembling across multiple GNN architectures operating on 2D molecular inputs.

Models are trained in conceptually related task groups rather than a single massively multitask model. For example, CACO-2 Papp and efflux ratio were trained jointly alongside permeability assay data from MDCK and auxiliary tasks including LogD. Training followed a staged approach: models were first trained on full consortium data, then fine-tuned to the ExpansionRx program specifically.

Hyperparameter optimization was based on a temporal split of the ExpansionRx training set, using compound ID as a proxy for time.

Effect of External Data and Auxiliary Tasks

After the competition, the team ran a four-condition ablation study to quantify the contribution of external data and auxiliary tasks:

Full model (baseline)
Auxiliary tasks removed (no LogD or pKa in permeability or binding models)
All external data removed; ExpansionRx-only multitask training
Nine individual single-task models

Removing auxiliary tasks produced measurable MAE increases, particularly for CACO-2 and binding endpoints — consistent with the known relationship between LogD, pKa, lipophilicity, and membrane partitioning. Removing all external data produced a larger performance drop across all endpoints, with single-task training showing further degradation for endpoints that had been grouped in multitask configurations. Even within a single program's data, multitask training provided a measurable benefit.

The team noted that an open question remains: whether grouped multitask training (as practiced by Inductive Bio and EMD Serono) outperforms massively multitask training (as practiced by Merck). This was identified as a priority area for future investigation.

Assay Data Curation: A Kinetic Solubility Case Study

The team identified a data quality issue in the kinetic solubility endpoint that had a direct impact on model performance.

During exploratory data analysis, two clusters were observed in the solubility distribution: one near the expected ceiling of ~300 µM and a second, unexpected cluster near 50–100 µM. This lower cluster appeared predominantly in the first ~1,800 compounds in the dataset and was absent in later compounds.

The hypothesis: some early compounds were run at a lower maximum assay concentration (~100 µM) while later compounds were standardized to 300 µM. Compounds in the lower cluster with values between 40–110 µM were likely at ceiling and would have measured higher had they been tested at the higher concentration.

The team imputed solubility values of 250 µM for these compounds under that assumption. This correction produced a measurable reduction in MAE on the final test set.

The broader point: data quality issues of this type are common in both public and proprietary datasets. Identifying them requires exploratory analysis, knowledge of assay protocols, and collaboration with experimental colleagues — and addressing them can matter as much or more than model architecture choices.

Connecting ADME Predictions to In Vivo Exposure

The team proposed a forward-looking challenge for the field: using ML-predicted ADME properties as inputs to physiologically based pharmacokinetic (PBPK) models to estimate in vivo exposure and effective dose.

As an initial proof of concept using the ExpansionRx test set, they selected the ~325 compounds with complete measured ADME data required for a one-compartment PBPK model and estimated minimum unbound plasma concentration (Cmin) for a once-daily dose. They then repeated the same calculation using their ML-predicted ADME values.

The two Cmin estimates had a Spearman correlation of 0.77 and a ~2.7-fold error. Critically, this error was comparable to the individual endpoint model performance, suggesting that ADME prediction errors do not compound catastrophically when propagated through a mechanistic model.

The team framed this as supporting evidence — not validation — and noted that actual validation would require blinded in vivo preclinical data. They have announced a more detailed writeup on this analysis in the Inductive Bio blog.

deepmirror (Top Performer)

Approach: Endpoint-Specific Model Selection with Diverse Architecture Ensemble

deepmirror operates a drug design platform that builds ADME and potency models from three data sources: program-specific uploaded data, curated public data, and proprietary alliance data from contributing customers and partners. Their competition approach centered on systematic evaluation of multiple modeling strategies across endpoints, with final model selection driven by internal holdout and leaderboard performance.

Data Analysis and Split Strategy

Prior to modeling, the team examined temporal trends in the training data using compound ID as a time proxy. Two relevant observations:

Solubility values showed a distribution shift in the first ~15% of compounds, consistent with the assay concentration issue identified by Inductive Bio. Removing these compounds improved holdout and leaderboard performance.
HLM clearance data had a large gap toward the end of the training set, creating uneven data availability across time-based splits.

The team used temporal splits as their primary internal holdout strategy, citing simplicity and alignment with the challenge evaluation structure. Both single-task temporal splits and multitask temporal splits were applied depending on endpoint, with tradeoffs in each approach for endpoints with sparse or time-skewed data.

Modeling Approaches

Five model types were evaluated:

Single-task GNN models
Classic multitask GNN models (predicting correlated endpoints jointly)
Source-stratified multitask models (a single endpoint split by data source as separate output nodes — used for LogD)
Dependent models (ML-predicted LogD injected as an input feature for downstream endpoints)
Stacked ensembles (predictions from one model used as additional input to a correction model)

Ensembling and bagging were applied consistently and produced improvements across most endpoints. Tree-based models implemented through the AutoGluon framework performed on par with or better than GNNs for clearance and permeability endpoints, particularly when bagged and ensembled. CheMeleon was used as a baseline across all endpoints.

Endpoint-Specific Findings

For LogD and kinetic solubility, multitask ChemProp models with external data and custom MAE loss performed best. Ablation studies on LogD showed a consistent trend: each additional data source reduced test set MAE. For solubility, external data provided less marginal benefit on the full training set but showed clear value when the training set was subsampled — as few as 50 program-specific compounds combined with external data produced functionally useful models.

For clearance, tree-based AutoGluon models augmented with predicted LogD as an input feature outperformed or matched GNN approaches. Using ML-predicted LogD as a dependent feature also improved HLM and CACO-2 permeability models.

For mouse plasma protein binding, ensembling a focused ChemProp multitask model (trained on the three mouse binding endpoints only) with a graph transformer produced the strongest endpoint-level performance — with no external data required.

Key Takeaways

Classical descriptor and fingerprint models, when ensembled, remain competitive with GNN approaches for clearance and permeability endpoints.
Hyperparameter optimization produced limited gains and frequently failed to transfer from the internal holdout to the leaderboard, consistent with overfitting risk in low-data regimes.
The best-performing competition model for solubility was optimized for high-solubility compounds, where most of the data density exists. This produced a lower leaderboard MAE but worse performance at low solubility values — a potentially meaningful limitation in practical drug discovery use cases where flagging low-solubility compounds is the primary objective.

Cross-Cutting Observations

Several themes were consistent across both presentations and align with findings from the first webinar:

External data and multitask training consistently improve performance, particularly for endpoints where related physicochemical properties (LogD, pKa) provide transferable learning signal. The benefit is most pronounced when program-specific data is limited.

Data curation precedes modeling in practical impact. Both teams identified and addressed assay-related data quality issues in the solubility endpoint. In both cases, correcting for the issue improved final test set performance.

Ensemble diversity matters more than ensemble size. Combining three to four architecturally distinct models consistently outperformed larger ensembles of similar models or more complex stacking schemes.

Leaderboard optimization does not guarantee practical utility. Models tuned for challenge metrics may underperform on the lower end of dynamic ranges — precisely where some endpoints carry the most actionable information.

Program-specific fine-tuning generalizes quickly. Both teams reported that small quantities of program-specific data (on the order of 50 compounds) meaningfully improved predictions when combined with well-curated external data.

Tag(s): Webinars , CDD Blog

Other posts you might be interested in

View All Posts

1 min October 28, 2025

ExpansionRx–OpenADMET Blind Challenge: Webinar Recap and How to Participate

CDD Vault Updates

2 min March 27, 2023

CDD Vault Update (March 2023[#2]): Enhanced Run Index Page, New Property Models and Tautomer Registration Workflow

CDD Vault Updates

4 min July 12, 2016

GSK's Kinetoplastid Research Data Set Available FREE Within CDD Vault