Thomas Sander heads the drug discovery informatics department at Idorsia Pharmaceuticals Ltd. and leads the team behind DataWarrior and the OpenMolecule platform.
DataWarrior* and the OpenMolecule.org suite were created to provide an utilitarian platform of cheminformatics tools for synthetic and medicinal chemists. Dr. Sander kindly agreed to give us this interview at the Idorsia headquarters in Basel, Switzerland.
Asking the questions from CDD are Neil Chapman and Mariana Vaschetto.
1. Thomas before we start to talk about DataWarrior tell me a little bit about your career to date.
By education I am organic chemist. During my seventh year at school we started to have chemistry classes and soon I had made up my mind to study chemistry. Four years later while still at school I had an opportunity to access the local University's Tectronix graphics computers. I was very intrigued by the computer world and started to learn some programming. Very soon I owned my own simple 8-bit computer, these had just started to become affordable for people like me. Two years later I started studying chemistry in Marburg and during the years to come I never lost interest in computer programming. Later, when I was working on my diploma in organic chemistry, a friend and I founded a chess database software company. For the next three years my attention was divided between chemistry and software development. After finishing my PhD thesis in organic chemistry I wanted to combine software engineering and chemistry. I left the company and spent one post-doc year with Prof. J.B.Hendrickson at Brandeis University in Waltham, Mass, USA where I developed a fast and interactive reaction search system. Then, in 1993, I joined a small team at Roche in Basel to develop software for drug discovery. Five years later I left Roche and joined the recently founded start-up Actelion to build up the drug discovery informatics environment. When Actelion was taken over by Johnson & Johnson, in 2017, Actelion's former drug discovery department along with some clinical development and service staff were split-off as a new, reasonably well funded, company: Idorsia Pharmaceuticals.
2. What is your role within Idorsia Pharmaceuticals?
Currently, I am leading the 'Scientific Computing' group within drug discovery, which develops algorithms and software to make use of the wealth of internal and external data related to drug discovery.
3. Idorsia Pharmaceuticals is a relatively new company that basically split from Actelion Pharmaceuticals in the first half of 2017. Tell me a little about that and if it has changed software development within your group.
At Actelion we had been a team of 12 people of which 9 were actively developing scientific software that covered most of the drug discovery processes. Roughly, the software fell into a number of categories, equipment management, bio-sample management, compound management, chemical and biological data acquisition, electronic note books, analytics, high-throughput screening, automated image analysis, chem- and bioinformatics, data visualisation, etc. We also maintained most of the database and application servers driving the software landscape. After the de-merger Idorsia's drug discovery department continues to function as it had before at Actelion. However, for our team there was a slight change: In order to free some of our resources to focus on more scientific aspects, we moved the responsibility for routine application development and maintenance to our colleagues of the global IT department. This involved about half of our productive systems.
4. Can you tell me about the background to DataWarrior? Why was it developed, how was it developed?
The DataWarrior story started in the year 2002, when Actelion was still a very young company. We had built an Oracle based drug discovery database containing experimental in-house data including chemical structures, batch information, research projects, biological assays and their results. We also had installed nightly running processes that would extract for every scientific project all related chemical structures and biological results into a project specific ChemFinder database. These allowed project members to relate structural features to assay results. However, we missed proper data visualisation functionality combined with cheminformatics algorithms. In order to provide such functionality, we first looked into Spotfire as a potential solution. However, its prohibitive pricing at that time, its limitation to Windows, and technical difficulties to extend it with cheminformatics functionality finally drove us into a different approach. We decided to develop our own solution in the Java programming language. Within four weeks we had a prototype with zoomable, cartesian 2D- and 3D-views, a structure grid view and row filters on alphanumerical cells as well as on chemical structures. This could be done so quickly because we had already developed a cheminformatics toolkit in Java, which provided sub-structure and a descriptor based similarity search. The 3D-view was built on the JMol 3D-graphics engine.
5. DataWarrior is available as a free download. What drove the decision to provide it at no cost?
DataWarrior is closely connected to the underlying cheminformatics toolkit, which we earlier had released as the open-source project 'OpenChemLib'. This release was motivated by short-term and long-term reasons. We engaged in various collaborations with universities where our toolkit's source code provided the cheminformatics foundation, an open source platform was often a precondition for our academic partners. One example is the chemical structure search on all Wikipedia molecules, a joint activity involving Peter Ertl (Novartis), Luc Patiny (EPFL) and ourselves.
The long-term goal I consider as even more important. During the last two decades a couple of open-source cheminformatics platforms were established and had gained momentum from the support of the growing community. It would only be a matter of time before some open platform would outpace any Actelion-internal development activities on our proprietary platform. By then we would be forced to replace our then outdated engine. Effectively, that would mean replacing any chemistry software built on the original platform. In order to prevent such a scenario, our only hope could be to establish one of multiple standards with our toolkit and to get external people into the boat. Since we were already late, when we released OpenChemLib, we needed a way to advertise it. We considered DataWarrior to be our best option for advertising its underlying cheminformatics toolkit.
There is also another answer to it. We are a pharmaceutical company and not a software company. It is just not our business to provide professional support and run a software sales force. Additionally, being part of Idorsia's scientific drug discovery community, we are asked to publish and to increase the reputation of the company. For software engineers publishing means publishing source code. Idorsia builds many of its scientific applications in-house so publishing open source projects gets this message out and helps to attract top scientific software engineers.
6. DataWarrior is a very popular program and it is able to interface with a number of databases, including CDD Vault. Tell me a little about the interfacing.
DataWarrior's access to the ChEMBL database and to the Crystallography Open Database (COD) are solved through pure HTTP access of the respective server engines. All query options, which include substructure and similarity queries are encoded as text strings and sent to the server. Both servers are pure Java based HTTP servers built on top of the SimpleFramework, which provides a multithreaded communication engine as a lean jar file. Both servers keep their entire databases in memory and handle any structure search request distributed to all cores of the server hardware. When returning chemical structures as result these are encoded as OpenChemLib ID codes, minimising the network traffic. Retrieving the Wikipedia compounds is much simpler. Once a day the server generates a complete new list of all chemical structures known to Wikipedia. DataWarrior then downloads the entire list also as ID codes. The source code to access ChEMBL, COD and Wikipedia are part of the DataWarrior source code and this functionality is part of the public DataWarrior installation.
The CDD Vault access is done in a different way. In order to simplify the development of additional modules for accessing any alphanumerical or structural databases, DataWarrior has a plugin interface. Independently of the DataWarrior source code this interface allows development of a plugin that opens a dialog to define alphanumerical and structural query conditions. These can then be sent to some kind of database and the returned result may then be processed to populate a new DataWarrior table. All Java code that makes up a plugin is compiled into an independent jar file and put into the plugin folder of the DataWarrior installation. When DataWarrior is started, it checks for files in this folder and displays a menu item for every plugin. When the user selects that item, DataWarrior relinquishes the control to the plugin until it creates and populates a new DataWarrior window. The CDD vault plug-in uses this mechanism to retrieve and display the result of a CDD query. The CDD vault plug-in is an open-source project on Github and maintained by CDD staff.
7. What is the current situation with Data Warrior, are there plans to enhance it further?
Certainly. I am fully committed to extend DataWarrior functionality to meet upcoming needs. Some of the ideas include access to a commercial chemicals database, bioisostere replacement functionality with force field minimisation and consideration of synthetic feasibility, better reaction support, more graphical view options, more flexible macro support with branching and variables. Unfortunately, our resources are very limited so we need to compromise. In the past I often had to postpone bigger ideas for the sake of implementing small issues or to streamline existing functionality.
8. Do you have plans to develop additional software programs for external use?
In fact we maintain two other open-source software projects, "Orbit Image Analysis" and "Spirit Biobank". In addition, we consider publishing a new project in the field of Next Generation Sequencing.
9. What are the types of interesting scientific questions that can be asked with your software? What historical insights have they provided? What new types of problems can the software be applied to in the future?
I assume this question refers to our internally built drug discovery software. To be honest, I believe that the biggest impact on the drug discovery process has been to make many simple and some more complex tools that just enable a smooth workflow. For instance a small tool to reserve a time slot on the NMR, a chemicals inventory that automatically places orders in the SAP system, a chemical notebook with embedded NMR viewer and seamless connection to the chemicals inventory. The value of a software platform not only depends on which features are available but also on how easy it is to use these features and how well they are integrated. For instance, when browsing biological assay results the associated IC50 curves or HCS-images or all compounds in the same experiment should be available with a mouse click. DataWarrior's macro functionality also proved to be very useful, with it expert users can define complex workflows which the less experienced can repeatedly run on updated data.
But you were asking for the more exciting scientific features of our software, probably the ones in the field of big data and machine learning. For instance, we run a server with about a quarter of a billion compounds in memory that can be substructure- or similarity-searched within a few seconds. We also use it for virtual screening with pharmacophore searches. We do natural language processing on PubMed abstracts to learn about gene-disease relationships. We further relate genes to compounds that are reported to be active in the respective targets. We also use an advanced imaging platform to handle, navigate, classify and process image content. We use a computing grid for pharmacophore searches, image processing, and ligand-protein docking. For the future we are just in the process of defining priorities. There is a strong interest in synthesis planning, bioisostere replacement and possibly in augmented reality to support discussions around ligand and target structures.
10. What are the outstanding technical challenges in cheminformatics that if solved would have the most impact on drug discovery?
If one reliably could predict biological activities, toxicity and pharmacological properties of a compound directly from its chemical structure this would, of course, revolutionise the drug discovery process. However, despite the enormous hype about machine learning, I personally don't believe that we will see rapid progress in this field. We don't have much training data, chemical structures are not a proper input format for these methods and we still have a limited understanding of the biochemical processes involved.
For me an overdue challenge is to improve the underlying concepts for molecular modelling. Molecular mechanics based force fields have not changed much during 30 years, while computing performance rose roughly by a factor of one million. Recent papers by Adrian Roitberg or Anatole von Lilienfeld seem to suggest that it should be possible to use machine learning techniques for the calculation of molecular energies and forces on the molecular level. These methods promise to reach an accuracy comparable to quantum mechanical methods, but they are almost as fast as conventional force fields. If we could solve the water influence in addition, we would be a big step further forward.
*DataWarrior is a free cheminformatics program for data visualization and analysis. It combines dynamic graphical views and interactive row filtering with chemical intelligence. Scatter plots, box plots, bar charts and pie charts are used to visualize numerical and categorical data, and demonstrate trends across multiple scaffolds and compound substitution patterns.
DataWarrior is currently used in over one hundred countries with a user base that is growing by approximately one thousand users per month
For more details or to download DataWarrior go to www.openmolecules.org.
Please visit our blog post for the DataWarrior and CDD Vault integration.
This blog is authored by members of the CDD Vault community. CDD Vault is a hosted drug discovery informatics platform that securely manages both private and external biological and chemical data. It provides core functionality including chemical registration, structure activity relationship, chemical inventory, and electronic lab notebook capabilities.