ChEMBL Resources

Resources:
ChEMBL
|
SureChEMBL
|
ChEMBL-NTD
|
ChEMBL-Malaria
|
The SARfaris: GPCR, Kinase, ADME
|
UniChem
|
DrugEBIlity

Tuesday, 28 October 2014

Paper: PPDMs – A resource for mapping small molecule bioactivities from ChEMBL to Pfam-A protein domains


We've just published a Open Access paper in Bioinformatics on an approach to annotate the region of ligand binding within a target protein. This has a lot of applications in the use of ChEMBL, in particular providing greater accuracy in mapping functional effects, improving ligand-based target prediction approaches, and reducing false positives in sequence/target searching of ChEMBL. Where next for this work - well annotating to a site-specific level would be a good thing to implement (think about HIV-1 RT with the distinct nucleoside and non-nucleoside sites).

Here's the abstract...

Summary: PPDMs is a resource that maps small molecule bioactivities to protein domains from the Pfam-A collection of protein families. Small molecule bioactivities mapped to protein domains add important precision to approaches that use protein sequence searches alignments to assist applications in computational drug discovery and systems and chemical biology. We have previously proposed a mapping heuristic for a subset of bioactivities stored in ChEMBL with the Pfam-A domain most likely to mediate small molecule binding. We have since refined this mapping using a manual procedure. Here, we present a resource that provides up-to-date mappings and the possibility to review assigned mappings as well as to participate in their assignment and curation. We also describe how mappings provided through the PPDMs resource are made accessible through the main schema of the ChEMBL database.

Availability: The PPDMs resource and curation interface is available at https://www.ebi.ac.uk/chembl/research/ppdms/pfam_maps

The source-code for PPDMs is available under the Apache license at https://github.com/chembl/pfam_maps

Source code is available at https://github.com/chembl/pfam_map_loader to demonstrate the integration process with the main schema of ChEMBL.

Monday, 27 October 2014

Django model describing ChEMBL database.





TL;DR: We have just open sourced our Django ORM Model, which describes the ChEMBL relational database schema. This means you no longer need to write another line of SQL code to interact with ChEMBL database. We think it is pretty cool and we are using it in the ChEMBL group to make our lives easier. Read on to find out more....



It is never a good idea to use SQL code directly in python. Let's see some basic examples explaining why:


Can you see what is wrong with the code above? SQL keyword `JOIN` was misspelled as 'JION'. But it's hard to find it quickly because most of code highlighters will apply Python syntax rules and ignore contents of strings. In our case the string is very important as it contains SQL statement.

The problem above can be easily solved using some simple Python SQL wrapper, such as edendb. This wrapper will provide set of functions to perform database operations for example 'select', 'insert', 'delete':


Now it's harder to make a typo in any of SQL keywords because they are exposed to python so IDE should warn you about mistake.

OK, time for something harder, can you find what's wrong here, assuming that this query is executed against chembl_19 schema:


Well, there are two errors: first of all `molecule_synonyms` table does not have a `synonims` column. The proper name is `synonyms`. Secondly, there is table name typo  `molecule_synonyms`.

This kind of error is even harder to find because we are dealing with python and SQL code that is syntactically correct. The problem is semantic and in order to find it we need to have a good understanding of the underlying data model, in this case the chembl_19 schema. But the ChEMBL database schema is fairly complicated (341 columns spread over 52 tables), are we really supposed to know it all by heart? Let's leave this rhetorical question and proceed to third example: how to query for compounds containing the substructure represented by 'O=C(Oc1ccccc1C(=O)O)C' SMILES:

For Oracle this would be:


And for Postgres:


As you can see both queries are different, reasons for these differences are:
  1. Differences in Oracle and Postgres dialects
  2. Different chemical cartridges (Accelrys Direct and RDKit)
  3. Different names of auxiliary tables containing binary molecule objects
These queries are also more complicated than the previous examples as they require more table joins and they make calls to the chemical cartridge-specific functions.

The example substructure search queries described above are similar to those used by the ChEMBL web services, which are available on EBI servers (Oracle backend) and in the myChEMBL VM (PostgreSQL backend). Still, the web services work without any change to their code. How?

All of the problems highlighted in this blogpost can be solved by the use of a technique known as Object Relational Mapping (ORM). ORM converts every table from database (for example 'molecule_dictionary') into Python class (MoleculeDictionary). Now it's easy to create a list of all available classes in Python module (by using 'dir' function) and check all available fields in class which corresponds to columns from SQL tables. This makes database programming easier and less error prone. The ORM also allows the code to work in a database agnostic manner and explains how we use the same codebase with Oracle and PostgreSQL backends.

If this blogpost has convinced you to give the ORM approach a try, please take a look at our ChEMBL example also included in myChEMBL:

Wednesday, 22 October 2014

myChEMBL 19 Released



                     
We are very pleased to announce that the latest myChEMBL release, based on the ChEMBL 19 database,  is now available to download. In addition to the extra data, you will also find a number a great new features. So what's new then?

More core chemoinformatics tools

We have included OSRA (Optical Structure Recognition), which is useful for extracting compound structures from images. OSRA can be accessed from the command line or by very convenient web interface, provided by Beaker (described below). We've also added OpenBabel - another great open source cheminformatics toolkit. This means you can now experiment with both RDKit and OpenBabel and use whichever you prefer.

ChEMBL Beaker

myChEMBL now ships with a local instance the ChEMBL Beaker service. For those not familiar with Beaker, the service provides users with an array of chemoinformatics utilities via a RESTful API. Under the hood, Beaker is using RDKit and OSRA to carry out its methods. With the addition of Beaker in myChEMBL, users can now carry out the following tasks in secure local environment:
  • Convert chemical structure bewteen multiple formats
  • Extract compound information from images and pdfs
  • Generate compound images in raster (png) and vector (svg) forms
  • Generate HTML5 ready representation of compound structure
  • Generate compound fingerprints
  • Generate compound descriptors
  • Identify Maximum Common Substructure
  • Compound standardisation
  • Lots of more calculations

 

New IPython notebooks

We have written a number of new IPyhthon notebooks, which focus on a range of cheminformatics and bioinformatic topics. The topics covered by the new notebooks include:
  • Introduction on how to use ChEMBL Beaker
  • Using the Django ORM to query the ChEMBL database
  • Introduction to BLAST and creation of a simple Druggability Score
  • Introduction to machine learning
  • Analysis of SureChEMBL data, focused on identifying the MCS core identified in a patent 
  • Extraction and analysis of ChEMBL ADME data 

We have also updated the underlying Ubuntu VM to 14.04 LTS, which also required us to make a number of changes the myChEMBL installation. To see how these changes and new additions have effected a bare metal installation of myChEMBL, head over the myChEMBL github repository.

 

Installation

There are 2 different ways we recommend for installing myChEMBL:
  1. Follow the instructions in the INSTALL file on the ftpsite. This will import the myChEMBL VM into VirtualBox
  2. Use Vagrant to install myChEMBL. See this earlier blogpost for more details, but the command to run is:
vagrant init chembl/myChEMBL && vagrant up

   If you already have myChEMBL_18 installed via Vagrant, instead of running 'vagrant box update', we strongly recommend running: 

vagrant box remove chembl/myChEMBL
vagrant init chembl/myChEMBL && vagrant up

Future plans

The myChEMBL resource is an evolving system and we are always looking to add new open source projects, tools and notebooks. We would be really interested to hear from users about what they would like to see in future myChEMBL releases, so please get in touch if you have any suggestions. (Just so you know, we already have a couple of ideas for myChEMBL 20).

We hope you find this myChEMBL update useful and if you spot any issues or have any questions let us know.

The myChEMBL Team

Thursday, 16 October 2014

New Drug Approvals 2014 - Pt. XII - Naloxegol (Movantik™)




https://www.ebi.ac.uk/chembltools/autoiconlarge2/1,0,0,1,0,1,0,0,0,1

ATC Code: A06AH03
Wikipedia: Naloxegol
ChEMBL: CHEMBL2219418

On September 16th FDA approved Movantik (naloxegol, AZ-13337019), as an oral treatment for patients with opioid-induced constipation and chronic non-cancer pain.

Naloxegol
Naloxegol is an opioid receptor antagonistDue to its similarity to noroxymorphone, a main metabolite of oxycodone, naloxegol is classed as a controlled substance. However, the FDA analysed its abuse potential and concluded that there was no risk of dependency.


Naloxegol



Mode of Action
Opioids are a class of drugs which are used to manage pain, but have a common side effect of reducing the motility of the gastrointestinal tract, making bowel movements difficult. Opioids work by binding to the mu-receptors (CHEMBL233, UniProt:P35372) in the central nervous system, thereby reducing pain. However, they are also able to bind to the mu-receptors in the gastrointestinal tract, hence causing opioid-induced constipation. 
Movantik is a peripherally-acting opioid receptor antagonist, which is able to prevent constipation by reducing this specific side effect of the opioids without affecting the efficacy of the pain management.

Clinical Trials
The clinical trials for this drug were carried out on a KODIAC clinical programme, comprising of four studies. Tests showed that 44% and 41% of patients receiving 25mg and 12.5mg, respectively, experienced increased bowel movements, compared to just 29% who took the placebo. [Paper]


Indication and Warnings
This drug is for non-cancer related pain. Side effects have been abdominal pain, diarrhea, headache and excessive gas in the stomach and/or intestinal area. 
When used in conjunction with another peripherally-acting opioid antagonist, there is the chance of gastrointestinal perforation.
There is also the chance of withdrawal symptoms.
This is contraindicated for anyone who is also taking CYP3A4 (CHEMBL5792, UniProt:Q9HB55) inhibitors, such as clarithromycin (CHEMBL1741), as this will increase the exposure to naloxegol and could precipitate opioid withdrawal symptoms. [FDA]

Trade Names
Naloxegol was developed by AstraZeneca and is marketed under the trade name of Movantik. It is due for release during the first quarter of 2015.

New Drug Approvals 2014 - Pt. XI - Idelalisib (Zydelig™)





https://www.ebi.ac.uk/chembltools/autoiconlarge2/1,1,0,1,0,1,0,0,1,1


ATC Code: L01XX47
Wikipedia: Idelalisib
ChEMBL: CHEMBL2216870

On July 23rd the FDA approved Zydelig (idelalisib, GS-1101), as an orally-delivered drug to treat patients with three types of blood cancers.
Relapsed chronic lymphocytic leukemia (CLL)
Relapsed follicular B-cell, non-Hodgkin lymphoma  (FL)
Relapsed small lymphocytic lymphoma (SLL)

Blood cancer
The three main categories of blood cancer are leukemia, lymphoma and myeloma. Lymphoma is also split into two types: Hodgkin lymphoma and non-Hodgkin lymphoma. Both leukemia and myeloma occur in the bone marrow, whilst lymphoma is a cancer that is isolated to the lymphatic system. Acute leukemia is where there is an abundance of underdeveloped white blood cells that can’t function properly and chronic leukemia is where there are just far too many white blood cells, which is just as bad as having too few. Myeloma is where the plasma cells form tumours in the bone marrow.


Idelalisib
This drug is a phosphoinositide 3-kinase inhibitor, which works by blocking P110σ (CHEMBL3130, Uniprot:O00329), the delta isoform of the phosphoinositide 3-kinase enzyme, encoded in humans by the PIK3CD gene. This isoform plays a role in B-cell development, proliferation and function and is expressed predominantly in leukocytes.

Idelalisib
Mode of action
Idelalisib works on patients by inhibiting the PI3 kinase delta isoform (PI3Kδ), which plays an important role in malignant lymphocyte survival. It is the delta and gamma forms that are specific to the hematopoietic system. This treatment impairs the normal tracking of CLL lymph nodes. It can be used in conjunction with Rituxan (rituximab), an existing blood cancer treatment, for relapsed CLL and on its own for FL and SLL.

Clinical trials
Clinical trials were carried out on 220 patients, with relapsed CLL, who were not healthy enough, due to co-existing medical conditions or damage from previous chemotherapy, to receive cytotoxic therapy. Patients were administered either idelalisib plus rituximab or a placebo and rituximab. Most of these patients were 65 years of age or older.
After 24 weeks, 93% of the group who had taken the combination treatment were disease progression-free, compared to only 46% of the group who had received the placebo and rituximab combination.
After 12 months, 90% of the dual drug combination group were alive, compared to 80% of the placebo-containing group. [NCI]

Indication and Warnings
This drug can be used in combination with rituximab or on its own, indicated for patients with relapsed conditions. There are several warnings for idelalisib, including hepatotoxicity, pneumonitis (fatal and serious), intestinal perforation and embyro-fetal toxicity. [FDA]

Trade Names
Idelalisib was developed by Gildead Sciences and is marketed under the name Zydelig.

Monday, 29 September 2014

The great US patent spike on SureChEMBL


Apparently, there was a huge spike of new granted US patents released by the USPTO a few days ago. The reason?

In March 2013, US patent law changed. The ‘first to invent’ became ‘first inventor to file’ for patent protection purposes (see more on this here). As a result, a lot of people rushed to submit applications just before the change. Fast forward 18 months later (last week), a huge spike in USPTO granted patents is observed. 

Did SureChEMBL pick that up? See below the cumulative count plot of new patent documents:

And the corresponding compound count extracted from these patents:

For more information on SureChEMBL, see our previous posts.

George

Friday, 19 September 2014

SureChEMBL Available Now





Followers of the ChEMBL group's activities and this blog will be aware of our involvement in the migration of the previously commercially available SureChem chemistry patent system, to a new, free-for-all system, known as SureChEMBL. Today we are very pleased to announce that the migration process is complete and the SureChEMBL website is now online.

SureChEMBL provides the research community with the ability to search the patent literature using Lucene-based keyword queries and, much more importantly, chemistry-based queries. If you are not familiar with SureChEMBL, we recommend you review the content of these earlier blogposts here and here. SureChEMBL is a live system, which is continuously extracting chemical entities from the patent literature. The time it takes for a new chemical in the patent literature to become searchable in the SureChEMBL system is 1-2 days (WO patents can sometimes take a bit longer due to an additional reprocessing step). At time of writing this blogpost the number of unique compounds in SureChEMBL is 15,760,514, which have been extracted from 12,949,021 patents.

To get started using SureChEMBL, head over to the homepage, where you will be presented with a range of search methods and filters. The image below provides a brief overview of the search functionality offered by the system:




To provide an example of how to use the SureChEMBL website, let's assume you are interested in patents which contained structures similar (or identical) to Sildenafil in the claims section of the document and also mention the term PDE5 anywhere in the document. To run this search, go to the SureChEMBL homepage and carry out the following actions:
  1. Enter the term 'PDE5' in the search text box 
  2. Sketch in the structure of Sildenafil (or use the name look-up function)
  3. Change the search type to similarity (>85%) 
  4. Click the 'Claims' checkbox in the document filter section and 
  5. Hit 'Search' button


After clicking 'Search', you will be presented with a page which contains all compounds that match your search criteria:





From the compound results page above you then have the choice of either exporting the chemistry (all the compounds returned by the search) or viewing the patents associated with 1 more of the selected compounds. For the selected compounds in this search, the associated patents (sorted by descending publication date) are :


 

From the patent document results page, you are able to export chemistry from all documents on display, view patent family information and view the chemistry-annotated, full text document. The claims section of the first patent (US-20140255433-A1) includes references to both sildenafil and PDE5:


 

The aim of this blogpost is to introduce the SureChEMBL system and not to provide a comprehensive review of all the functionality the system offers. This will be covered in future training sessions and webinars, which will be announced on this blog in the near future.

We would like to thank the people over at Digital Science, who were responsible for building the original SureChem system and supported its migration over to EMBL-EBI. In particular, we would like to thank Nicko Goncharoff, James Siddle and Richard Koks.

The system runs on the cloud - specifically on Amazon Web Services, a stable, secure and highly scalable way to deploy web applications. We need to keep a close eye on performance and patterns of usage over the coming weeks, to get an idea of how many servers, etc, we need for full deployment. In particular, we will throttle scripted access,  so please get in touch if you want to try anything like this, so you are not frustrated by slow performance, and we will try and accommodate your use case. There is also a download link on the homepage, so please explore this if you are interested.

We have an exciting roadmap for the future development of SureChEMBL, bt if you have any priority requests, mail them to surechembl-help (at) ebi.ac.uk.

If you experience any issues with the system, or have any questions please get in touch.

Friday, 12 September 2014

Another Confusion in the Literature - Trust but Verify


Another kinase inhibitor mystery solved! I (jpo) feel like I'm turning into a rotund Hercule Poirot, except English and with a huge fat wirey 'tache.

The literature has quite a few references to an Trk inhibitor in phase 2 trials from Cephalon - CEP-2583; just do a google search with something like "CEP-2583 clinical trial" and you'll come across a number of references to this, usually in the peer-reviewed literature. However, this is really the only place is occurs, try clinicaltrials.gov, Cephalon regulatory filings, Teva pipeline, nothing.

However, this small letter to Journal of Urology (DOI:s10.1097/01.ju.0000138215.70709.9c) solves the mystery clinical candidate - it was a typo, it was meant to refer to CEP-2563/CEP-701 (a prodrug/drug pair), and so a simple transcription error (8 for 6), and then repetition across other reviews (who I guess trusted the primary source) gave a footprint to the existence of a clinically interesting asset, that never was.

What to do with these sort of errors in the literature, the are too minor for journals to ever worry about fixing, but arguably they should if bought to their attention. But of course, post-publication peer review should also fix this. So I'll make some comments on pubmedcommons when I have some time. It was good of Cephalon to correct the record in this way.

jpo, Bissan, and Krister

Tuesday, 9 September 2014

Papers: Literature text mining and extensions to UniChem


Two new papers from the group have just been published, both in Journal of Chemoinformatics - and of course both Open Access.

The first deals with some extensions to UniChem to allow far more flexible searches. The abstract is:

UniChem is a low-maintenance, fast and freely available compound identifier mapping service, recently made available on the Internet. Until now, the criterion of molecular equivalence within UniChem has been on the basis of complete identity between Standard InChIs. However, a limitation of this approach is that stereoisomers, isotopes and salts of otherwise identical molecules are not considered as related. Here, we describe how we have exploited the layered structural representation of the Standard InChI to create new functionality within UniChem that integrates these related molecular forms. The service, called ‘Connectivity Search’ allows molecules to be first matched on the basis of complete identity between the connectivity layer of their corresponding Standard InChIs, and the remaining layers then compared to highlight stereochemical and isotopic differences. Parsing of Standard InChI sub-layers permits mixtures and salts to also be included in this integration process. Implementation of these enhancements required simple modifications to the schema, loader and web application, but none of which have changed the original UniChem functionality or services. The scope of queries may be varied using a variety of easily configurable options, and the output is annotated to assist the user to filter, sort and understand the difference between query and retrieved structures. A RESTful web service output may be easily processed programmatically to allow developers to present the data in whatever form they believe their users will require, or to define their own level of molecular equivalence for their resource, albeit within the constraint of identical connectivity.

The second deals with using text mining approaches to find papers that look like they could be abstracted into ChEMBL - that is they contain keywords enriched in medicinal chemistry and compound structure concepts. The abstract for this paper is:


The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are ‘ChEMBL-like’ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining.
The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: ftp://ftp.ebi.ac.uk/pub/databases/chembl/text-mining. These can be readily modified to include additional keyword constraints to further focus searches.
Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data.

%T UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers
%A J Chambers
%A M Davies
%A A Gaulton
%A G Papadatos
%A A Hersey
%A JP Overington
%J Journal of Cheminformatics 
%D 2014
%V 6:43  
%O doi:10.1186/s13321-014-0043-5
%O http://www.jcheminf.com/content/6/1/43

%T A document classifier for medicinal chemistry publications trained on the ChEMBL corpus
%A G Papadatos
%A GJP van Westen
%A S Croset
%A R Santos
%A S Trubian
%A JP Overington
%J Journal of Cheminformatics 
%D 2014
%V 6:40  
%O doi:10.1186/s13321-014-0040-8
%O http://www.jcheminf.com/content/6/1/40

Sunday, 7 September 2014

Structure Confirmation, Reproducibility of Research, and Literature Abstraction - Trust But Verify


The identity of compounds in the literature is often uncertain - with high profile cases of incorrect compounds being used in experiments - with subsequent difficulties in reconciling conflicting experiments or repeating results.

One example is the kinase inhibitor Bosutinib (SKI-606) - where several compound vendors shipped a similar isomer of the actual drug. It will be unclear which data comes from the bona fide Bosutinib and the compound sold as Bosutinib by multiple vendors. More details on this case are in this post by Derek Lowe, and there is a definitive structural biology paper here. So this probably clouds a lot of the reported Bosutinib bioactivity data that's out there. Here the difference is in the differential substitution pattern on an aromatic ring (2,4-dichloro-5-methoxy vs 3,5-dichloro-4-methoxy), the mass is the same for these isomers.

A second example is this case (a clinical stage compound called TIC10) where there was a mix-up in the structure assigned to a compound in a patent, and since the substance had novel and potentially useful bioactivity and the annotated structure in wrong, led to all sorts of complicated IP issues and shenanigans. Since this compound was also in a widely distributed and profiled NCI set, then the literature ends up with difficult to deconvolute data.

There are probably many of these examples - and here are two more, both connected to clinical development stage protein kinase inhibitors from Exelixis.


The first is XL-765 also known as SAR-245409 and by the INN Voxtalisib - it is a PI3K inhibitor. Compound vendors have been selling 'XL-765' for some time, but the structure sold is (typically) not the correct XL-765 structure.

XL-765 is known to be this structure - multiple independent authoritative source point to this identity. It's a quite simple, small molecule.  The InChI key is RGHYDLZMTYDBDT-UHFFFAOYSA-N.
Several vendors have been selling a different structure as 'XL-765/SAR-245409'; as you will immediately note, it is a very different structure - not a positional or stereo-isomer, it will have very different mass, etc.



Some purchasers of this compound report peer-reviewed in vivo data. To be clear, this isn't just one company, it is all/most vendors of XL-765.


So there are a number of things that could have happened here:

i) The correct structure was supplied, but the structure on the label/on the website didn't match the real physical structure - i.e. a mislabelling.
ii) The incorrect structure was supplied, and the structure on the label/website was what was in the bottle. In our studies (at FIMM) the vendor supplied XL-765 has not behaved as a PI3K inhibitor, so the latter is more likely.

However, the incorrect XL-765 structure has propagated into public chemistry databases, and literature data for XL-765 is probably now largely suspect (with the exception of that from Exelixis themselves (and their collaborators)).

Here's the second new example - XL-147 (aka SAR-245408 and Pilaralisib) - again a PI3K inhibitor. The correct structure is QINPEPAQOBZPOF-UHFFFAOYSA-N.


Many vendors were selling XL-147 as the following compound, here there is a lot more similarity between the actual and sold compound, but it's still different, and any biology on this, although it may well be active in it's own right, is not the same as XL-147.



tl;dr Be careful if you are interested in purchasing or analysing activities and properties of Bosutinib, Pilaralisib &Voxtalisib. Be careful in loading names from Vendor catalogs, and try to use more definitive authorities for compound synonyms. Be careful with relying on things from any public databases.


Krister and I would like to thank Willie Yuan at Chemietek for helping to clarifying aspects of the confusing history for this compound.

jpo and Krister