Why Is Publicly Available Data Important And How Is It Being Used?
October 22, 2018
The Future of Open Source Data & Drug Discovery
Based on a discussion with Ashley Farley, from the Bill and Melinda Gates Foundation Open Access Group, Andrew Leach from EMBL-EBI, the provider of [chEMBL], Evan Bolton from the MCBI/NLM of the NIH, the provider of PubChem, and Frank Cole with CDD.
This is Part-1 of our 2-part series on open-source data and its impact on drug discovery. Read Part-2, What Does The Future Of Open Source Data Hold For Drug Discovery? to learn more.
There is great interest in increasing the quantity and quality of shared drug discovery information.
More open access data aids researchers throughout the drug discovery process, prevents them from “reinventing the wheel” and helps with research focused on neglected diseases.
In 2006, CDD introduced a public section to the otherwise private CDD Vault.
There are over 2.5 million compounds and associated data on the CDD platform for others who want to share and publish data. On the CDD website, there is a public access section where the Vault's public section is available.
Additionally, the structures available in CDD public data are also visible in PubChem, with links back to the bioactivity data in CDD public. CDD continues to add and expand the publically available data that exists within CDD Vault.
The value of large, open-source databases to society is huge and, as a scientific community, we are constantly improving the state of these databases.
The large, highly curated chEMBL database and the massive PubChem repository for chemical and biological data are 2 more examples of public drug discovery databases that researchers can access.
It would be great to get to a point where all competitive drug discovery information can be made available so that those who are trying to discover, whether it be the citizen scientist or a large multinational drug discovery company, can access it.
As we move toward this end goal of more open and accessible information, it’s important to optimize systems for both computers and humans to be able to access the data.
As published in the Journal of Cheminformatics, the most significant immediate benefactors of open data are chemical algorithms, which are capable of absorbing and presenting concise insights to working chemists, on a scale that could not be achieved by traditional publication methods.
But, to achieve the benefits of these digital chemical algorithms — that can synthesize and present data insights quickly — will require a paradigm shift in the way individual scientists translate their data into digital form.
Currently, most scientists enter their data in a way that is designed for presentation to humans rather than consumption by machine learning algorithms. Extra annotation of text and figures is required by scientists to make this data consumable by algorithms, but the extra effort required to complete this annotation is off-putting for scientists.
One solution to this issue, published by CDD, is a hybrid system that combines machine learning based on natural language processing, and a simplified user interface designed to help scientists curate their data with minimum effort.
Removing the barrier for scientists to record their data in a way that data algorithms can interpret is a first step toward creating a massive and open searchable database that can be accessed by scientists everywhere.
As a society, it would be very beneficial to capture all that biocuration and make sure that we're constantly building on top of it, as opposed to reinventing the wheel or doing the same thing over and over again.
Here, we discuss how researchers can make the most of the databases that are currently available, and the type of information that can be found in both chEMBL and PubChem.
Access to a vast knowledge base of information so you know you aren’t “reinventing the wheel”
PubChem is best known as an archive, but it's also a knowledge base.
PubChem is helpful for those researchers who are interested, not just in chemical information, but also in the known biological activities of a particular compound.
The data in PubChem is integrated with the entire scope of what researchers are looking for: genes, genomes, and literature, as well as physical properties, such as toxicity.
Literature, basic chemical data, and the biological data for a huge base of chemicals, is searchable via PubChem.
chEMBL is a manually created database of bioactive molecules.
Many of these are related to drug discovery, and that is where the origin of much of the data comes from.
But, there are other types of bioactive molecules in the database as well, including small molecules, peptides, and therapeutic antibodies.
In chEMBL, there is a certain amount of bespoke curation, to enhance the quality of the information within the database.
Additionally, chEMBL shares data with resources such as PubChem.
The sharing and accessibility of data are key.
Otherwise, researchers end up in the situation where they keep reinventing the wheel and wasting time, instead of making more discoveries in an expedited fashion.
Find answers to your questions by searching for data about specific biological targets
chEMBL contains data about molecules and their activities against biological targets.
So, within the chEMBL database, it’s possible to ask a question such as, “show me all of the bioactive molecules you have against this particular protein, or indeed against this family of proteins.”
Then, using the data about those proteins, that are often reported in publications and documents, chEMBL will display the requested information.
That allows for follow-up questions about those molecules to be asked.
A researcher can ask questions such as, “What are the selectivities of the targets? What is the status of these molecules? Are they in clinical trials? Are they marketed drugs?”
In PubChem, researchers can use the data view to gather information about a biological target.
If a researcher is interested in a particular target, there's no sense in having to look at thousands of different sets of experiments that have been run.
Instead, with data view, there is a single page where everything is aggregated together into one document where it’s possible to download that content.
It becomes more actionable.
Additionally, instead of just viewing one possible target, with aggregator-based pages it’s possible to look at a set of targets.
For example, a researcher may not be interested in just a single GCPR, but rather is interested in all GCPRs, or not a single potassium ion channel receptor but all potassium ion channel receptors.
The aggregator pages allow for these broad questions to be asked.
It allows researchers to get access to the content they need, download it, and then do something more with it in their own research.
Search a huge number of molecules and substructures all in one place
In addition to being about searching these large databases for biological targets, it’s also possible to do molecule queries.
For example, in chEMBL it’s possible to search for a particular compound of interest or to perform substructure-based queries.
Once the compounds of interest are identified, a researcher can retrieve additional bioactivity data or other information they want.
What’s key is that it’s all integrated into this core resource.
Within PubChem, there are about 95 million small molecule chemicals integrated inside the system and almost 30 million chemicals that have some degree of annotation.
When looking up a molecule in PubChem, the database will summarize pretty much everything that's known about a particular chemical, because it’s integrated with a number of other knowledge bases.
PubChem has over 600 contributors of content.
PubChem is focused on putting all the data in one spot so that rather than having to navigate tens of different sites, it is possible to look at the information that’s available in one spot.
Plus, researchers can see exactly where that content came from and then link back to that other website if interested in getting more information.
How do we get to the point where we can use these high-powered tools and large databases to answer questions through open data? It’s a lot of work to get to that point and there are a lot of tough questions and current barriers that we have to work through in order to ensure that we have future success in this area. Being able to innovate and quicken this kind of drug discovery is one of our goals. But, it’s important that we continue to drive forward in making drug discovery data more widely available.
This blog is authored by members of the CDD Vault community. CDD Vault is a hosted drug discovery informatics platform that securely manages both private and external biological and chemical data. It provides core functionality including chemical registration, structure activity relationship, chemical inventory, and electronic lab notebook capabilities.