It’s unknown how information will be accessible or useful in the future.
So, how do we prepare to ensure the long-term value of open source data? Or, how do we get the most value out of the open source data that is available today?
Well, first there is data science.
Data science needs huge amounts of information in order for it to work, so there needs to be more and more information available for the repositories to gather and distribute this information content.
There also needs to be high-quality information, which highlights the need for curation.
Content can be curated based on the science, or curated as a function of time. For example, if you think of an experiment that was run back in the 1980s, would you trust that experiment or would you rerun it today?
Science is constantly evolving, so having high-quality information made available will help ensure the value of open source data well into the future.
The major value of open source data that still needs to be fully realized is the possibility of having everything that happened before available to you, so that it can then fuel discoveries going forward at a faster rate than ever possible before.
It would be very, very helpful — but, we need appropriate metadata.
We need high-quality information for that process.
We need biocuration to make that happen, and we need to pull it all together.
chEMBL's key project for a number of months now is a completely newly designed web interface that is better for users.
Along with the interface redesign, there are a number of behind-the-scenes changes relating to the way the data is set up.
Then, looking more broadly, there are questions about the different types of data that could be incorporated into chEMBL.
For example, recent exploration is looking at how bioactivity data from patents might be extracted and added to the chEMBL database.
Additionally, as experimental platforms change the types of data and the scale at which data is generated, the way data is stored and found would change as well.
It’s possible that all this will then lead to new ways in which these data feed into databases, including the applications of AI and machine learning.
If you think about the dynamic landscape of what's happening with data science around web technologies and beyond, it's really a very, very interesting time and the next 3-5 years will be even more interesting.
At PubChem, there are oftentimes tens of thousands, or hundreds of thousands, of literature links to a single chemical. Figuring out how this massive amount of information can best be summarized is one focus of future changes at PubChem.
One near-future change being implemented in PubChem to address this is to introduce a view called co-occurrence, where you can find other chemicals that are often mentioned relative to this other chemical.
It will also be possible to view diseases related to a chemical (treat or cause), to give you some sense of the types of disease that are commonly associated with a chemical. And, a similar co-occurrence view will also be available for genes and proteins.
The thought here is that a researcher could ask questions relative to a given disease and find out the types of information that we know about that disease, relative to PubChem.
Then, that person would be able to investigate the bioactivities relative to this disease, the other genes and targets associated to this disease, the chemicals that may treat this disease or may cause this disease, and all the articles that back this information up.
The idea is to stitch together the fabric and the ecosystem of all available data and from whom it originates.
As you start to think about the ecosystem that chemists, biologists, drug discovery scientists, pharmacologists, toxicologists, and environmental scientists all care about, the next step is to wrap it all together in a package that they can then access and download.
In the future, there will be more moving away from analysis-type tools and shifting more toward data views and pre-computed information that are in line with what the users are hoping for, or trying to find out, because it's just too much content for a human to figure this out anymore. And, using data science type approaches can make things a little bit more obvious to the interactive user.
The future looks positive. As more information becomes available and more metadata is made available, everybody wins. These and other changes will allow researchers to access that content and do more with it.
As long as the researcher can find what they need, we all win because we make more discoveries quicker, better, and faster.
As technology advances and scientists are able to produce more and more data at faster and faster rates, the accessibility of that data becomes paramount. Questions about what is needed to sustain the future of open source data and how to improve the usability of data are key.
This blog is authored by members of the CDD Vault community. CDD Vault is a hosted drug discovery informatics platform that securely manages both private and external biological and chemical data. It provides core functionality including chemical registration, structure activity relationship, chemical inventory, and electronic lab notebook capabilities.