ChEMBL Resources

The SARfaris: GPCR, Kinase, ADME

Friday, 12 September 2014

Another Confusion in the Literature - Trust but Verify

Another kinase inhibitor mystery solved! I (jpo) feel like I'm turning into a rotund Hercule Poirot, except English and with a huge fat wirey 'tache.

The literature has quite a few references to an Trk inhibitor in phase 2 trials from Cephalon - CEP-2583; just do a google search with something like "CEP-2583 clinical trial" and you'll come across a number of references to this, usually in the peer-reviewed literature. However, this is really the only place is occurs, try, Cephalon regulatory filings, Teva pipeline, nothing.

However, this small letter to Journal of Urology (DOI:s10.1097/01.ju.0000138215.70709.9c) solves the mystery clinical candidate - it was a typo, it was meant to refer to CEP-2563/CEP-701 (a prodrug/drug pair), and so a simple transcription error (8 for 6), and then repetition across other reviews (who I guess trusted the primary source) gave a footprint to the existence of a clinically interesting asset, that never was.

What to do with these sort of errors in the literature, the are too minor for journals to ever worry about fixing, but arguably they should if bought to their attention. But of course, post-publication peer review should also fix this. So I'll make some comments on pubmedcommons when I have some time. It was good of Cephalon to correct the record in this way.

jpo, Bissan, and Krister

Tuesday, 9 September 2014

Papers: Literature text mining and extensions to UniChem

Two new papers from the group have just been published, both in Journal of Chemoinformatics - and of course both Open Access.

The first deals with some extensions to UniChem to allow far more flexible searches. The abstract is:

UniChem is a low-maintenance, fast and freely available compound identifier mapping service, recently made available on the Internet. Until now, the criterion of molecular equivalence within UniChem has been on the basis of complete identity between Standard InChIs. However, a limitation of this approach is that stereoisomers, isotopes and salts of otherwise identical molecules are not considered as related. Here, we describe how we have exploited the layered structural representation of the Standard InChI to create new functionality within UniChem that integrates these related molecular forms. The service, called ‘Connectivity Search’ allows molecules to be first matched on the basis of complete identity between the connectivity layer of their corresponding Standard InChIs, and the remaining layers then compared to highlight stereochemical and isotopic differences. Parsing of Standard InChI sub-layers permits mixtures and salts to also be included in this integration process. Implementation of these enhancements required simple modifications to the schema, loader and web application, but none of which have changed the original UniChem functionality or services. The scope of queries may be varied using a variety of easily configurable options, and the output is annotated to assist the user to filter, sort and understand the difference between query and retrieved structures. A RESTful web service output may be easily processed programmatically to allow developers to present the data in whatever form they believe their users will require, or to define their own level of molecular equivalence for their resource, albeit within the constraint of identical connectivity.

The second deals with using text mining approaches to find papers that look like they could be abstracted into ChEMBL - that is they contain keywords enriched in medicinal chemistry and compound structure concepts. The abstract for this paper is:

The large increase in the number of scientific publications has fuelled a need for semi- and fully automated text mining approaches in order to assist in the triage process, both for individual scientists and also for larger-scale data extraction and curation into public databases. Here, we introduce a document classifier, which is able to successfully distinguish between publications that are ‘ChEMBL-like’ (i.e. related to small molecule drug discovery and likely to contain quantitative bioactivity data) and those that are not. The unprecedented size of the medicinal chemistry literature collection, coupled with the advantage of manual curation and mapping to chemistry and biology make the ChEMBL corpus a unique resource for text mining.
The method has been implemented as a data protocol/workflow for both Pipeline Pilot (version 8.5) and KNIME (version 2.9) respectively. Both workflows and models are freely available at: These can be readily modified to include additional keyword constraints to further focus searches.
Large-scale machine learning document classification was shown to be very robust and flexible for this particular application, as illustrated in four distinct text-mining-based use cases. The models are readily available on two data workflow platforms, which we believe will allow the majority of the scientific community to apply them to their own data.

%T UniChem: extension of InChI-based compound mapping to salt, connectivity and stereochemistry layers
%A J Chambers
%A M Davies
%A A Gaulton
%A G Papadatos
%A A Hersey
%A JP Overington
%J Journal of Cheminformatics 
%D 2014
%V 6:43  
%O doi:10.1186/s13321-014-0043-5

%T A document classifier for medicinal chemistry publications trained on the ChEMBL corpus
%A G Papadatos
%A GJP van Westen
%A S Croset
%A R Santos
%A S Trubian
%A JP Overington
%J Journal of Cheminformatics 
%D 2014
%V 6:40  
%O doi:10.1186/s13321-014-0040-8

Sunday, 7 September 2014

Structure Confirmation, Reproducibility of Research, and Literature Abstraction - Trust But Verify

The identity of compounds in the literature is often uncertain - with high profile cases of incorrect compounds being used in experiments - with subsequent difficulties in reconciling conflicting experiments or repeating results.

One example is the kinase inhibitor Bosutinib (SKI-606) - where several compound vendors shipped a similar isomer of the actual drug. It will be unclear which data comes from the bona fide Bosutinib and the compound sold as Bosutinib by multiple vendors. More details on this case are in this post by Derek Lowe, and there is a definitive structural biology paper here. So this probably clouds a lot of the reported Bosutinib bioactivity data that's out there. Here the difference is in the differential substitution pattern on an aromatic ring (2,4-dichloro-5-methoxy vs 3,5-dichloro-4-methoxy), the mass is the same for these isomers.

A second example is this case (a clinical stage compound called TIC10) where there was a mix-up in the structure assigned to a compound in a patent, and since the substance had novel and potentially useful bioactivity and the annotated structure in wrong, led to all sorts of complicated IP issues and shenanigans. Since this compound was also in a widely distributed and profiled NCI set, then the literature ends up with difficult to deconvolute data.

There are probably many of these examples - and here are two more, both connected to clinical development stage protein kinase inhibitors from Exelixis.

The first is XL-765 also known as SAR-245409 and by the INN Voxtalisib - it is a PI3K inhibitor. Compound vendors have been selling 'XL-765' for some time, but the structure sold is (typically) not the correct XL-765 structure.

XL-765 is known to be this structure - multiple independent authoritative source point to this identity. It's a quite simple, small molecule.  The InChI key is RGHYDLZMTYDBDT-UHFFFAOYNA-N.
Several vendors have been selling a different structure as 'XL-765/SAR-245409'; as you will immediately note, it is a very different structure - not a positional or stereo-isomer, it will have very different mass, etc.

Some purchasers of this compound report peer-reviewed in vivo data. To be clear, this isn't just one company, it is all/most vendors of XL-765.

So there are a number of things that could have happened here:

i) The correct structure was supplied, but the structure on the label/on the website didn't match the real physical structure - i.e. a mislabelling.
ii) The incorrect structure was supplied, and the structure on the label/website was what was in the bottle. In our studies (at FIMM) the vendor supplied XL-765 has not behaved as a PI3K inhibitor, so the latter is more likely.

However, the incorrect XL-765 structure has propagated into public chemistry databases, and literature data for XL-765 is probably now largely suspect (with the exception of that from Exelixis themselves (and their collaborators)).

Here's the second new example - XL-147 (aka SAR-245408 and Pilaralisib) - again a PI3K inhibitor. The correct structure is QINPEPAQOBZPOF-UHFFFAOYNA-N

Many vendors were selling XL-147 as the following compound, here there is a lot more similarity between the actual and sold compound, but it's still different, and any biology on this, although it may well be active in it's own right, is not the same as XL-147.

tl;dr Be careful if you are interested in purchasing or analysing activities and properties of Bosutinib, Pilaralisib &Voxtalisib. Be careful in loading names from Vendor catalogs, and try to use more definitive authorities for compound synonyms. Be careful with relying on things from any public databases.

Krister and I would like to thank Willie Yuan at Chemietek for helping to clarifying aspects of the confusing history for this compound.

jpo and Krister

Tuesday, 2 September 2014

We're hiring! Web developer for NIH Illuminating the Druggable Genome (IDG) project

We got a prize today, so we are happy. What better way to celebrate, than to recruit someone new for the group. We have a position available for a developer to support web service development and integration for the Knowledge Management Centre part of the recently announced NIH Illuminating the Druggable Genome project, see this link for details of the job.

Closing deadline for applications is 12th October 2014.

Thursday, 28 August 2014

SureChEMBL Update 1

As announced in the previous SureChEMBL blogpost, the temporary holding page is now in place. So when users visit (or, you will be redirected to

For updates on the release of the new SureChEMBL site, please keep an eye on the ChEMBL-og.

Tuesday, 26 August 2014

SureChEMBL Coming Very Soon

In the coming weeks we will be very pleased to announce the release of the new SureChEMBL website. Since the beginning of the year, we have been working hard with the folks over at Digital Science, along with all the content and software providers to get the system setup and running on our own Amazon Web Service controlled environment. As we approach the final stages of the transition, we will need to temporarily halt access to the original SureChem site. The reason for this minor disruption is to allow us to complete the testing of the additional functionality we have added to the SureChEMBL user interface.

We will use ChEMBL-og as the primary route of communicating with users, so if you want to be kept up to date, bookmark the site. We will also make ad hoc tweets about SureChEMBL on @johnpoverington, @georgeisyourman, @surechembl and @chembl.

SureChEMBL User Interface

Users familiar with the previous SureChem UI will find a lot in common with the new SureChEMBL UI. A summary of the changes and new features we have added to the SureChEMBL UI are provided below:
  • A user account is no longer required to access the system
  • All users will have access to ‘Pro’ account features, which include chemistry exports, PDF downloads and enhanced search filters
  • UniChem has been integrated and provides dynamic cross references to external chemical resources
  • The new SCHEMBL identifier is used throughout the interface.
  • Updated compound sketchers (Latest Marvin JS and JSME)
  • Rebranding of headers and footers and removing old SureChem references
We will be keeping an eye on usage of the UI, and don;t know what to expect in terms of new users. We will then review scaling hardware to cope with the load now that the default 'Pro' system is open to all.

SCHEMBL Identifier

In line with ChEMBL IDs, all compounds in SureChEMBL have been given SCHEMBL identifiers. For example, SCHEMBL1353 corresponds to 2-(acetyloxy)benzoic acid, aka aspirin. The identifier can be used to access the SureChEMBL compound page and will be included in all SureChEMBL downloads.

SureChEMBL Data Content

The SureChEMBL pipeline has been running daily throughout the summer and has now processed and extracted an additional ~400,000 novel compounds from patents since SureChem’s pipeline freeze. At the time of writing (16:22 22/08/14), the SureChEMBL counts are:
  • Total number of compounds 15,668,22
  • Total number of annotated patents 12,888,125
The rate of novel compounds and annotated patents is truly staggering: There are approximately 80,000 compounds extracted from 50,000 patents that are added to the system every month. Moreover, the latency for a new patent document from its application date to becoming searchable in the system is only between 2 and 7 days, in most cases.

SureChEMBL and UniChem

The complete SureChEMBL structure repository has been added to UniChem (src_id=15)  and consists of 15.2M unique structures mapped to their SCHEMBL IDs. SureChEMBL updates will be added to UniChem on a weekly basis, so that UniChem will be up to date with novel patent chemistry.

SureChEMBL Data Access

Besides availability in UniChem, the complete SureChEMBL structure repository is provided as SD and tsv file in our ftp site:

It has to be emphasised here that this is the raw compound feed as extracted automatically from text and images and is provided without any further filtering or manual curation. This feed contains fragments, radicals, atoms with wrong valencies, polymers and other oddities but if you are the sort of person who wants to use this raw data, you will know what and how to filter things you don't like out.

The chemical registry rules between SureChEMBL and ChEMBL have not been fully aligned yet - they use fundamentally different toolkits - so there are sometimes multiple SCHEMBL ids for the same InChI - if you know this is an issue, you will know how to fix it for your local purposes if you download the data.

Initially, the SureChEMBL files on the ftp site will be updated on a quarterly basis.

SureChEMBL Future Plans


Going forward we have many plans related to SureChEMBL, some of which are linked to our involvement in the Open PHACTS project. Our current plans include:
  • Extraction of biological entities from the patent literature
  • SureChEMBL API release 
  • Updated workflow tool integration (e.g. KNIME and Pipeline Pilot)
You will hear more about these plans over the coming year, but our top priority now is to deliver the new SureChEMBL user interface.

If you have any questions about the new SureChEMBL system and data please get in touch

Monday, 25 August 2014

The incredible expanding universe of amino acids - Part 1

There are 23 currently known, natural, genetically encoded amino acids - they are pictured above, ordered by the number of heavy atoms contained within them. There are the core 20, then the additional, more unusual ones selenocysteine, N-formyl-methionine and pyrrolysine - the latter two are used only in bacteria. Post-translationally, many further covalent modifications are found, for example phosphorylation of serine, threonine, tyrosine and histidine, but the above is the core building blocks of proteins, the incredible chemical diversity of the proteome can be through of as 'edits' on this core genetic set.

All of the above are alpha amino acids, and all, with the exception of glycine have defined stereochemistry at the alpha carbon (they are all L-amino acids). Three of the amino acids have defined chirality in their side chains (isoleucine, threonine and pyrrolysine). There are only six elements used within this set (Carbon, Hydrogen, Nitrogen, Oxygen, Sulphur and Selenium). For me, to think that (along with some non-genetically encoded cofactors, such as ATP, Zinc, etc.) all the chemistry going on in our bodies comes from this simplicity is amazing. Of course, most of this complexity of function comes from the fact that the amino-acids can form polymers, so, although from the 20 'common' genetically encoded amino acids, there are 202 (400) possible dipeptides, 203 tripeptides, and so on; so the chemical space of peptides gets big, quickly - for a decapeptide there are over 10 trillion possible covalently distinct peptides (well more than that actually, if the presence of free thiol and connectivity isomers of disulphide bonds are considered). This diversity of peptides has been well explored, but we became interested some time ago in identifying other useful amino acids.

You can go a long way with this set of building blocks, however, sometimes it is desirable to include other amino acids into drugs or proteins, for example the drug Desmopressin, a vasopressin receptor agonist has a deaminated N-terminus and the unnatural chiral form of Arginine (D-Arginine) at the eighth position - these improve drug properties compared to dosing with the natural peptide. Occasionally, more radically different amino acids are used (e.g. Aib, alpha-aminoisobutyric acid at position 2 of the clinical candidate Taspoglutide (a GLP-1 mimetic)).

Saturday, 16 August 2014

Citing ChEMBL, and Data DOIs

There are now multiple formats and ways to access the ChEMBL data, and we have recently assigned DOIs to all available versions of ChEMBL (and will archive these on the ftp server, permanently).

So when you publish use of ChEMBL, could you reference the following papers:

ChEMBL Database
A. Gaulton, L. Bellis, J. Chambers, M. Davies, A. Hersey, Y. Light, S. McGlinchey, R. Akhtar, A.P. Bento, B. Al-Lazikani, D. Michalovich, & J.P. Overington (2012) ‘ChEMBL: A Large-scale Bioactivity Database For Chemical Biology and Drug Discovery’ Nucleic Acids Res. Database Issue, 40 D1100-1107. DOI:10.1093/nar/gkr777 PMID:21948594

A.P. Bento, A. Gaulton, A. Hersey, L.J. Bellis, J. Chambers, M. Davies, F.A. Krüger, Y. Light, L. Mak, S. McGlinchey, M. Nowotka, G. Papadatos, R. Santos & J.P. Overington (2014) ‘The ChEMBL bioactivity database: an update’ Nucleic Acids Res. Database Issue, 42 1083-1090. DOI:10.1093/nar/gkt103 PMID: 24214965

R. Ochoa, M. Davies, G. Papadatos, F. Atkinson and J.P. Overington (2014) 'myChEMBL: A virtual machine implementation of open data and cheminformatics tools' Bioinformatics. 30 298-300. DOI10.1093/bioinformatics/btt666 PMID: 24262214

S. Jupp, J. Malone, J. Bolleman, M. Brandizi, M. Davies, L. Garcia, A. Gaulton, S. Gehant, C. Laibe, N. Redaschi, S.M Wimalaratne, M. Martin, N. Le Novère, H. Parkinson, E. Birney and A.M Jenkinson (2014) 'The EBI RDF Platform: Linked Open Data for the Life Sciences' Bioinformatics 30 1338-1339 DOI:10.1093/bioinformatics/btt765 PMID:24413672

Also please reference the version of ChEMBL you may have used in any published analyses, using the following DOIs:







Future releases will adhere to the following patterns. We will be modifying the attribution part of the ChEMBL license to require reporting of these DOIs in publications that use ChEMBL. We hope this will contribute to reproducibility of analyses.

Friday, 8 August 2014

Registry numbers in ChEMBL

The numbers are in - the public vote (N=69, so quite small) was overwhelmingly (in roughly a 3:1 ratio) in favour of including registry numbers in ChEMBL/UniChem, as you will see from the screenshot above. There was some discussion (see Google+ and ChEMBL-og comments for details, as well as some Twitter response (it's pretty easy to hunt down if you are really really interested). So we will see what we can do.....

ChEMBL US Tour - an update

We've had a great response to our call for offers of venues to help us on a ChEMBL outreach tour, funded by the project. Things are shaping up pretty well, but we probably still have space for something in the Seattle area, and also space maybe for something in Philadelphia. We also will probably do both East and West coasts on the same trip, due to the very positive response.

Get in touch if you are in the north-west, or north-east!