Beware Researchers! Challenges Navigating Commercial and Public Databases

From the desk of CDD CSO Sean Ekins, M.Sc., Ph.D., D.Sc.

Each of the phases of a drug discovery project is directly impacted by the quality and availability of relevant data. There is no question that well-curated chemical structures and linked bioactivity data enable drug discovery projects. This is true whether managing a single organization, sharing between groups, or exploring all publicly accessible data. This gets to the core of all we do at CDD.

Within the last decade and most notably within the last few years, there has been an explosion of chemical information available in commercial and public databases, including PubChem and ChEMBL. Where once there was a dependence on the commercial database CAS SciFinder, this is no longer the case. This has led to a number of challenges when searching for necessary information and performing medicinal chemistry due diligence. It is not straightforward to ask relatively simple questions like:

  • “Is there already biological data associated with this molecule?”
  • “Is this lead compound desirable?”
  • “Is this compound novel?”

Information Overload

For one, a single database is not going to give you the answer. You must search many to find resolution.

From our recent efforts to collect and evaluate the NIH Probes with Dr. Christopher Lipinski and our collaborations with Dr. Chris Southan, Dr. Antony Williams and Dr. Alex Clark, we have been able to describe some of the extreme difficulties apparent when navigating public and commercial databases. Trouble arises when expanding text abstraction of patents and encountering prophetic compounds, which have no experimental data. The vendor deposition of virtual make-on-demand compounds that have never been made, and a lack of intersection between different databases adds to some of the difficulties encountered when performing due diligence. A final issue of note is the lack of standardization in the description of bioassay data making direct comparison challenging, an issue we have previously discussed. Taken together, the whole is greater than the sum of the parts. Where we could probably cope with one issue, instead we are overwhelmed with many. It is therefore harder to make sense of the data retrieved from these large databases both from a legal and scientific perspective, if indeed appropriate data can be identified. Our recent publication was featured on Derek Lowe’s In the Pipeline blog: Our disorganized piles of chemical information.

Solving these considerable problems will take the action of the community, increased collaboration, and communication. It is likely that we have only uncovered the tip of the iceberg. We are keen to know what other challenges you face in navigating commercial and public databases, there might be solutions we could provide. Our interest in this came from a chance discussion and has already expanded into the resulting papers. We are glad to be a part of the conversation, helping to bring attention to these important issues, and offering some potential solutions. Let us know if there are high quality public datasets that you would like to be able to mine alongside your private data within the security of the CDD Vault. We are keen to hear your thoughts. Please contact us.

In case you are not yet familiar, the CDD Vault is a hosted database for secure management and sharing of chemical and biological data. It lets you collaborate with internal or external partners through an easy to use web interface. CDD also has a large collection of public data that you can mine side-by-side with your own private data.