ReviewInformaticsAssessing the public landscape of clinical-stage pharmaceuticals through freely available online databases
Introduction
The recent rise in the number of drug discovery databases has initiated interest in analyzing the usage and adoption of each database and the interconnections and cross-fertilization that exists between them. A variety of databases focusing on targets, proteins, metabolism and active pharmaceutical ingredients (APIs) has proliferated, particularly over the past 5 years. These databases are crucial resources for in silico drug discovery, for prioritizing repurposing opportunities and for identifying trends in the drug development enterprise. Drug repurposing has become increasingly attractive in recent years as a less expensive option with lower barriers to approval than traditional drug discovery and development. Therefore, it is of interest to assess the current public landscape of approved APIs and APIs that have been in clinical trials.
Assessing the public landscape of clinical-stage APIs involves comparison of various biopharmaceutical databases, which is a notoriously challenging task. Yonchev et al. [1] reported on the redundancies in PubChem and ChEMBL. The authors pointed out specific aspects that made comparing the two databases difficult. For example, in PubChem the same API can have multiple compound records. Furthermore, differences in terminology and ways in which the databases are structured impede straightforward comparison of database content. Furthermore, owing to the compound submission model of PubChem, a researcher might run the risk of extracting duplicate compounds. Southan et al. compared chemical structures in ChEMBL, DrugBank, Human Metabolome Database and the Therapeutic Target Database in 2013 [2]. In 2018, Southan reviewed the largest chemical structure databases: PubChem, ChemSpider and UniChem, which contain 95, 63 and 154 million chemical structure records, respectively [3]. Here, Southan examined the databases’ contributing sources and found that sources common among databases could have substantial differences in chemical structure count. Fourches et al. [4] provided guidance on reviewing and comparing chemogenomic datasets with suggestions for how to curate and clean chemical datasets and discuss the importance of properly cleaning chemical datasets, including removing duplicates. Ambiguity in and across databases can confound efforts to model and analyze data.
Although not immediately obvious, one fact emerging from our studies was that each database has a distinct emphasis and target audience. Chemistry-based databases, such as PubChem and ChEMBL, contain large-scale record counts of compounds with potential medicinal uses. Other databases, such as DrugBank, focus on unique APIs, most of which convey some evidence of clinical interest. Several open databases specialize in specific subsets of drugs, such as approved or withdrawn medicines. In this review, we have selected databases to meet the following criteria: (i) are public and freely available with downloadable data; (ii) are compound oriented and contain clinically tested compounds; (iii) have at least one peer-reviewed publication describing the content and construction of the database. The databases meeting these criteria are: ChEMBL [5], CRIB NME [6], DrugBank [7], DrugCentral [8], PubChem [9], repoDB [10], SuperDrug2 [11] and WITHDRAWN [12]. Whereas additional databases are undoubtedly available through commercial sources on a subscription basis or as the result of extensive competitive intelligence, these are not freely available and therefore not included in our present analysis. Instead, we focus on public and freely available databases.
For the work herein, a short summary of each selected database is provided. The usage and adoption of each database is discussed by analyzing peer-reviewed publications citing each database. The relationships between each database is examined according to the sources used for construction as well as the overlap in clinical-stage drug compounds in each database. Finally, we summarize the current number of pharmaceuticals in the public domain with evidence of in-human experience. Raw data files and codes for this review can be found at: https://github.com/WUSTL-CRIB/clinical_databases_review.
Section snippets
Overview of selected databases
The past decade has witnessed an increase in publications citing public drug databases. In Fig. 1, we review the largest and most cited databases: PubChem, Chembl and DrugBank. Whereas there were only two mentions of PubChem, ChEMBL or DrugBank in the literature in 2004, the field began to grow soon thereafter. By 2010, the rate of annual citations was nearing 500 and would more than double within 2 years. This trend continues as the rates of new citations have continued to climb after 2010. At
Usage and adoption
To quantify the adoption and usage of each database, peer-reviewed publications were extracted from the Elsevier Scopus literature database. For the top-three most-cited drug databases (ChEMBL, DrugBank and PubChem) the database name itself was used as the search term to obtain a set of publications that cite the database. For the remaining databases, articles citing the main manuscript containing the database description were included for analysis.
Table 3 summarizes the adoption of the
Source analysis
We examined the relationships between sources used to create open drug databases along with external links to other databases to assess their interconnectivity. The sources of each database were compiled from publications describing the content and construction of the database. Online descriptions of data sources from the websites hosting each database were also included in this evaluation. A network graph of all sources and external links and their shared connections to each database is shown
Content analysis
To assess the uniqueness and overlap of each drug database, compounds with evidence of clinical testing were extracted from each database. Evidence of clinical testing included links to any clinical trial information or records labeled as ‘approved’ or ‘withdrawn’ in any database. Data formats varied among the numerous data sources, therefore general coding ability in languages such as Python, Bash and Structured Query Language (SQL) was required to access and thoroughly evaluate all data
Discussion
Barriers to entry can impede the usefulness for compound databases for certain analytical efforts. For example, selecting drug compounds that have been through clinical testing is not a straightforward task in large compound databases (such as PubChem and ChEMBL). A handful of databases contain clinical-stage drugs but the extent to which clinical trial data are linked is limited. It is not clear and very difficult to ascertain which compounds are unique to a particular source because many of
Concluding remarks
In this study, we compiled >11 700 unique APIs with evidence of clinical testing. Online drug databases are crucial expeditors of in silico drug discovery and chemoinformatics, especially in academic, biotechnology startup and non-profit environments where high-cost subscription databases are undesirable. However, the cloudy existence of overlapping and unique compounds in common drug databases impedes comprehensive analysis.
Acknowledgments
Research reported in this publication was supported by the Washington University Institute of Clinical and Translational Sciences grant UL1TR002345, subaward TL1TR002344, from the National Center for Advancing Translational Sciences (NCATS) of the National Institutes of Health (NIH). The content is solely the responsibility of the authors and does not necessarily represent the official view of the NIH.
References (38)
Redundancy in two major compound databases
Drug Discov. Today
(2018)An overview of FDA-approved new molecular entities: 1827–2013
Drug Discov. Today
(2014)Oral druggable space beyond the rule of 5: insights from drugs and clinical candidates
Chem. Biol.
(2014)Dosing-time makes the poison: circadian regulation and pharmacotherapy
Trends Mol. Med.
(2016)Separating chemotherapy-related developmental neurotoxicity from cytotoxicity in monolayer and neurosphere cultures of human fetal brain cells
Toxicol. Vitr.
(2016)Comparing the chemical structure and protein content of ChEMBL, DrugBank, human metabolome database and the therapeutic target database
Mol. Inf.
(2013)Caveat Usor: assessing differences between major chemistry databases
ChemMedChem
(2018)Trust, but verify: on the importance of chemical structure curation in cheminformatics and QSAR modeling research
J. Chem. Inf. Model.
(2010)The ChEMBL database in 2017
Nucleic Acids Res.
(2016)DrugBank 5.0: a major update to the DrugBank database for 2018
Nucleic Acids Res.
(2017)