From the desks of Peter Gedeck, Jonathan Bisson, and Barry Bunin
Everyone claims their models are the best in the world. They are all lying. Or put more gently and accurately, no (QSAR) model is significantly better than others for all drug discovery. Within some limited context, certain methods can and do perform better than others. Typically, the state-of-the-art models can improve a 0.1% high-throughput screening assay hit rate to a 1—10% hit rate. The important thing is having the data and a platform that continuously improves as the data get richer and the corresponding models refined.
An assay has variability with error bars, so models will too. An enzyme assay is a model of a cell response which is a model of an animal response which is a model of a human response. So it is a model of a model of a model of a model of the real system.
CDD Vault has implemented an automated, extensible regression model that generally works reasonably well, gives an indication of the error range, and requires no fine tuning. The models are only generated when there is a statistically useful signal. There are no claims about the CDD Vault Inference model being best in the world. However, having a model automatically updated as new data is continuously generated and automatically run on any new molecule is useful. Especially when predictions and experiments can be compared side by side on the same molecule and similar bioisosteres. This is now available for use, caveat emptor.
That said, a model can be useful. However, each assay model will have its own statistics, training set, test set, and domain of applicability. And the usefulness of the predictions is a function of numerous variables that will be different in each case. Other questions are germane: can this molecule be synthesized with materials in house vs. outsourced, is it available off the shelf, how many analogs can be made in parallel, is it a huge library or a multi-year FTE project? Any model utility is a function of the assay and the data. Yes, and it is also a function of each organization’s ability to use it based on specific criteria: how reliable is my assay, how many FTEs do we have on this project, how creative are my scientists, what’s the mechanism of action of the drug, how predictive of human disease is the model?
So what really matters is not the absolute accuracy of the model, which will vary as a function of the experimental data … but rather how easy it is for experimentalists and computational scientists to see predictions and experiments side by side.
CDD has eliminated the overhead of that step to zero.
Now, for every experiment run within a statistically reasonable range, standard SAR (structure-activity relationship) models are created automatically, with zero clicks, and continuously updated and improved.
Importantly, these models are not only automatically generated, but they are automatically run.
And run against any compound. For example, compounds already in your Vault. For example, generative bioisosteres—which are by definition new IP and similar to your training compounds. For example, any molecule you can imagine and draw gets modeled in milliseconds against models generated by your assay. Automatically. Instantly.
This all happens securely within your private CDD Vault, of course.
The CDD Vault zero-click, fully automated inference models provide predictions with error bars in the same units as the experimental values for easy side-by-side comparison of prediction accuracy—unlike previous models with unitless values (often ranging above and below zero) which are less intuitive to experimentalists accustomed to IC50 and pIC50 frameworks. Importantly, the framework improves over time, as algorithms evolve.
This capability was too valuable to hide, so we are making Inference Models available for the entire scientific community in CDD Vault as part of the core offering.
Generative bioisosteres are part of the AI module add-on as well as other models like Boltz-2, AlphaFold2, and ESMfold. The automatic, zero-click inference models are part of the core platform and now available for everyone. Check it out, use it, give us feedback on V1.0.
The models will continuously improve with your data.