Elsevier

Drug Discovery Today

Volume 24, Issue 4, April 2019, Pages 1010-1016
Drug Discovery Today

Review
Informatics
Assessing the public landscape of clinical-stage pharmaceuticals through freely available online databases

https://doi.org/10.1016/j.drudis.2019.01.010Get rights and content

Highlights

  • The content of biopharmaceutical databases is dictated by the intended audience.

  • Databases share sources in their construction and often cite each other as sources.

  • No single database captures comprehensive information.

  • Each database has at least a small number of distinct clinical drug compounds.

Several public databases have emerged over the past decade to enable chemo- and bio-informatics research in the field of drug development. To a naive observer, as well as many seasoned professionals, the differences among many drug databases are unclear. We assessed the availability of all pharmaceuticals with evidence of clinical testing (i.e., been in at least a Phase I clinical trial) and highlight the major differences and similarities between public databases containing clinically tested pharmaceuticals. We review a selection of the most recent and prominent databases including: ChEMBL, CRIB NME, DrugBank, DrugCentral, PubChem, repoDB, SuperDrug2 and WITHDRAWN, and found that ∼11 700 unique active pharmaceutical ingredients are available in the public domain, with evidence of clinical testing.

Introduction

The recent rise in the number of drug discovery databases has initiated interest in analyzing the usage and adoption of each database and the interconnections and cross-fertilization that exists between them. A variety of databases focusing on targets, proteins, metabolism and active pharmaceutical ingredients (APIs) has proliferated, particularly over the past 5 years. These databases are crucial resources for in silico drug discovery, for prioritizing repurposing opportunities and for identifying trends in the drug development enterprise. Drug repurposing has become increasingly attractive in recent years as a less expensive option with lower barriers to approval than traditional drug discovery and development. Therefore, it is of interest to assess the current public landscape of approved APIs and APIs that have been in clinical trials.

Assessing the public landscape of clinical-stage APIs involves comparison of various biopharmaceutical databases, which is a notoriously challenging task. Yonchev et al. [1] reported on the redundancies in PubChem and ChEMBL. The authors pointed out specific aspects that made comparing the two databases difficult. For example, in PubChem the same API can have multiple compound records. Furthermore, differences in terminology and ways in which the databases are structured impede straightforward comparison of database content. Furthermore, owing to the compound submission model of PubChem, a researcher might run the risk of extracting duplicate compounds. Southan et al. compared chemical structures in ChEMBL, DrugBank, Human Metabolome Database and the Therapeutic Target Database in 2013 [2]. In 2018, Southan reviewed the largest chemical structure databases: PubChem, ChemSpider and UniChem, which contain 95, 63 and 154 million chemical structure records, respectively [3]. Here, Southan examined the databases’ contributing sources and found that sources common among databases could have substantial differences in chemical structure count. Fourches et al. [4] provided guidance on reviewing and comparing chemogenomic datasets with suggestions for how to curate and clean chemical datasets and discuss the importance of properly cleaning chemical datasets, including removing duplicates. Ambiguity in and across databases can confound efforts to model and analyze data.

Although not immediately obvious, one fact emerging from our studies was that each database has a distinct emphasis and target audience. Chemistry-based databases, such as PubChem and ChEMBL, contain large-scale record counts of compounds with potential medicinal uses. Other databases, such as DrugBank, focus on unique APIs, most of which convey some evidence of clinical interest. Several open databases specialize in specific subsets of drugs, such as approved or withdrawn medicines. In this review, we have selected databases to meet the following criteria: (i) are public and freely available with downloadable data; (ii) are compound oriented and contain clinically tested compounds; (iii) have at least one peer-reviewed publication describing the content and construction of the database. The databases meeting these criteria are: ChEMBL [5], CRIB NME [6], DrugBank [7], DrugCentral [8], PubChem [9], repoDB [10], SuperDrug2 [11] and WITHDRAWN [12]. Whereas additional databases are undoubtedly available through commercial sources on a subscription basis or as the result of extensive competitive intelligence, these are not freely available and therefore not included in our present analysis. Instead, we focus on public and freely available databases.

For the work herein, a short summary of each selected database is provided. The usage and adoption of each database is discussed by analyzing peer-reviewed publications citing each database. The relationships between each database is examined according to the sources used for construction as well as the overlap in clinical-stage drug compounds in each database. Finally, we summarize the current number of pharmaceuticals in the public domain with evidence of in-human experience. Raw data files and codes for this review can be found at: https://github.com/WUSTL-CRIB/clinical_databases_review.

Section snippets

Overview of selected databases

The past decade has witnessed an increase in publications citing public drug databases. In Fig. 1, we review the largest and most cited databases: PubChem, Chembl and DrugBank. Whereas there were only two mentions of PubChem, ChEMBL or DrugBank in the literature in 2004, the field began to grow soon thereafter. By 2010, the rate of annual citations was nearing 500 and would more than double within 2 years. This trend continues as the rates of new citations have continued to climb after 2010. At

Usage and adoption

To quantify the adoption and usage of each database, peer-reviewed publications were extracted from the Elsevier Scopus literature database. For the top-three most-cited drug databases (ChEMBL, DrugBank and PubChem) the database name itself was used as the search term to obtain a set of publications that cite the database. For the remaining databases, articles citing the main manuscript containing the database description were included for analysis.

Table 3 summarizes the adoption of the

Source analysis

We examined the relationships between sources used to create open drug databases along with external links to other databases to assess their interconnectivity. The sources of each database were compiled from publications describing the content and construction of the database. Online descriptions of data sources from the websites hosting each database were also included in this evaluation. A network graph of all sources and external links and their shared connections to each database is shown

Content analysis

To assess the uniqueness and overlap of each drug database, compounds with evidence of clinical testing were extracted from each database. Evidence of clinical testing included links to any clinical trial information or records labeled as ‘approved’ or ‘withdrawn’ in any database. Data formats varied among the numerous data sources, therefore general coding ability in languages such as Python, Bash and Structured Query Language (SQL) was required to access and thoroughly evaluate all data

Discussion

Barriers to entry can impede the usefulness for compound databases for certain analytical efforts. For example, selecting drug compounds that have been through clinical testing is not a straightforward task in large compound databases (such as PubChem and ChEMBL). A handful of databases contain clinical-stage drugs but the extent to which clinical trial data are linked is limited. It is not clear and very difficult to ascertain which compounds are unique to a particular source because many of

Concluding remarks

In this study, we compiled >11 700 unique APIs with evidence of clinical testing. Online drug databases are crucial expeditors of in silico drug discovery and chemoinformatics, especially in academic, biotechnology startup and non-profit environments where high-cost subscription databases are undesirable. However, the cloudy existence of overlapping and unique compounds in common drug databases impedes comprehensive analysis.

Acknowledgments

Research reported in this publication was supported by the Washington University Institute of Clinical and Translational Sciences grant UL1TR002345, subaward TL1TR002344, from the National Center for Advancing Translational Sciences (NCATS) of the National Institutes of Health (NIH). The content is solely the responsibility of the authors and does not necessarily represent the official view of the NIH.

References (38)

  • O. Ursu

    DrugCentral: online drug compendium

    Nucleic Acids Res.

    (2016)
  • S. Kim

    PubChem substance and compound databases

    Nucleic Acids Res.

    (2015)
  • A.S. Brown et al.

    A standard database for drug repositioning

    Sci. Data

    (2017)
  • V.B. Siramshetty

    SuperDRUG2: a one stop resource for approved/marketed drugs

    Nucleic Acids Res.

    (2017)
  • L.D. Gillespie

    WITHDRAWN: interventions for preventing falls in elderly people

    Cochrane Database Syst. Rev.

    (2009)
  • D.S. Wishart

    DrugBank: a comprehensive resource for in silico drug discovery and exploration

    Nucleic Acids Res.

    (2006)
  • C. Southan

    The IUPHAR/BPS Guide to PHARMACOLOGY in 2016: towards curated quantitative interactions between 1300 protein targets and 6000 ligands

    Nucleic Acids Res.

    (2016)
  • M.D. Hanwell

    Avogadro: an advanced semantic chemical editor, visualization, and analysis platform

    J. Cheminf.

    (2012)
  • J. Inglese

    Quantitative high-throughput screening: a titration-based approach that efficiently identifies biological activities in large chemical libraries

    Proc. Natl. Acad. Sci . U. S. A.

    (2006)
  • Cited by (0)

    View full text