ChEMBL Resources

The SARfaris: GPCR, Kinase, ADME

Tuesday, 7 July 2015

ChEMBL @ Boston this August

A couple of us will be visiting Boston, MA for the ACS Meeting between 16 and 20 August. We'll be talking about SureChEMBL and ChEMBL. If you'd like to arrange a meeting/seminar or just go out for drinks and clams, just let me know


Thursday, 4 June 2015

Resources for Computational Drug Discovery

We are again running the popular EMBL-EBI/Wellcome Trust Workshop on:
Resources for Computational Drug Discovery.  It is being held on the Genome Campus near Cambridge between 2nd and 6th November 2015.

This will be the fourth year that we have run the course and each year we adapt it using feedback from previous participants.  It is designed for academic and industrial researchers who want to learn more about the principles and applications of computational chemical biology. There will be a chance to explore ways of accessing data as well as tools to analyse and visualise data.  There will also be lots of opportunities to apply these methods in practical, hands-on sessions and sessions will be led by academic as well as industrial tutors.

More information, including the detailed programme, can be found here along with details of how to apply here.

Don’t forget places are limited and the deadline for applications is Wednesday 5th August. 

Finally this year there will an added bonus in that the course participants can come along to the Campus Fireworks display.

Friday, 29 May 2015

Compound popularity contest

Have you ever wondered which compound is the most popular in ChEMBL? And by popular I don't mean the one which cracks the best jokes at dinner parties; I mean the compound with the largest number of structural analogues or nearest neighbours (NNs). This number also gives an indication of the sparsity or density of the chemical space around a compound and is a useful concept during hit expansion and lead optimisation. 

This number of course depends on the fingerprint, the hashing and folding parameters, the similarity coefficient and the threshold. So let's say 2048-bit RDKit Morgan fingerprints with a radius of 2 or 3 (equivalent to ECFP_4 or ECFP_6) and Tanimoto threshold of 0.5. Why so low threshold? For an explanation, see here and here.

To calculate this compound 'popularity', one would need to calculate the full similarity matrix of the 1.4M compounds in ChEMBL. This used to be prohibitively computationally expensive just a few years ago; nowadays, thanks to chemfp, it only takes 5 commands and a few hours to calculate the matrix and counts at a certain similarity threshold on a decent machine. Here's how it's done:

#install chemfp
pip install chemfp

#get the chembl sdf

#calculate rdkit morgan fps
rdkit2fps --morgan --radius 3 --id-tag "chembl_id" --errors report chembl_20.sdf.gz -o rdkit_chembl.fps

#calculate counts of neighbours at a given threshold, here 0.5
simsearch --threshold 0.5 --NxN -c rdkit_chembl.fps -o chembl_sim_matrix.txt

#sort by number of neighbours and find most 'popular' compound

sort -rn chembl_sim_matrix.txt | head -25

Here's a list of the most popular ones with the number of NNs:

1570 CHEMBL3037891
1540 CHEMBL2373012
1523 CHEMBL3086861
1461 CHEMBL2158309
1451 CHEMBL440810
1451 CHEMBL414360
1428 CHEMBL2401865
1425 CHEMBL408133
1425 CHEMBL354100
1425 CHEMBL344931
1425 CHEMBL310737
1425 CHEMBL303256
1425 CHEMBL302290
1425 CHEMBL27867
1425 CHEMBL137783
1403 CHEMBL1866472
1396 CHEMBL3102922
1392 CHEMBL3102921
1392 CHEMBL3102920
1387 CHEMBL91573
1387 CHEMBL70362
1387 CHEMBL601767
1387 CHEMBL533732
1387 CHEMBL503883
1387 CHEMBL477

Most of the most popular compounds are large, usually peptides, e.g. CHEMBL3037891. We'll leave it as an exercise for the reader why that is. But this not always the case, for example CHEMBL477 is adenosine and it genuinely has a lot of near neighbours in ChEMBL (including a lot of stereo variants).  

And what about the least popular compounds, i.e. the ones with no neighbours above 0.5 Tanimoto? One would not expect a lot of them, since ChEMBL by design contains compounds from congeneric series as reported in the med. chem. literature. And yet, due to small size, symmetry (CHEMBL100050), possible mistakes, and peculiarities of fingerprint similarity, or due to being genuinely lonely (CHEMBL35416), they do exist.  

Apart from popularity contests, a similarity/distance matrix of ChEMBL compounds is the first step for clustering and graph/community detection analysis.


PS: Thebacon has only 22 neighbours...

Friday, 15 May 2015

What's Going On?

I’ve been asked a lot by mail recently ‘What’s Going On?’ Well, here is are some facts and some emotion.

So today is my last day at work here at EMBL-EBI. It’s been a fun and thrilling ride (for me at least), I’ve made lots of new friends, living life as an Open Data advocate and academic researcher, and most importantly having the privilege to lead the team here responsible for the ChEMBL database. It had been a long-term goal of mine to unlock large-scale bioactivity data from proprietary data silos and eye-wateringly expensive paywalls; so as US President George Bush famously said ‘Mission Accomplished!’. The impact of ChEMBL on academia, SMEs and large pharma has been great - and you can see the impact in new method development, but more importantly in potential new future drugs. My personal indebtedness to the Wellcome Trust for their support is immeasurable. An additional big shout out to Digital Science for their vision in donating the SureChEMBL platform to us.

I’ll be starting a new blog, covering my next adventure - artificial intelligence-enhanced drug discovery. I’ll tweet when this is up and running, but the first few weeks at a new job, as I’m sure you know, is spent sorting out pencils, working out where the best coffee is hidden and most importantly navigating the office politics of the milk in the fridge. When this blog starts up, I’ll tweet the url. For those of you interested in the ChEMBL groups activities, make sure you follow @ChEMBL and @SureChEMBL. If you want to see what I'm up to next, it's at @StratMed.

If any of you are ever in the West End of London (which to non-native Londoners actually means the centre) get in touch with me, and I’ll try to treat you to an orange mocha frappuccino.

Now for the bit you all actually care about….

  • Anne Hersey is taking over the ChEMBL Wellcome Trust Strategic Award grant for ChEMBL (which also covers SureChEMBL). Many of you will know Anne already, and know just what good news this is. Anne is also taking over the majority of our other grants and activities of the group, including our participation in IMI eTox, NIH IDG KMC, & GSK CTTV grants.
  • Jo McEntyre will be PI for EMBL-EBI on the IMI OpenPHACTS grant, although the majority of the work will be done by ChEMBL group staff. If you don’t know about the Open PHACTS platform, check out what they have done!
  • Ugis Sarkans will become PI for EMBL-EBI on the FP7 HeCaTos grant. The data content and modelling components will be done by ChEMBL group staff.

If you have general questions about ChEMBL or SureChEMBL first try the support email addresses and

Thursday, 7 May 2015

The Ying and Yang of Drugs and Targets

Today is election day in the UK, and so it's ones duty to vote, and then spend the rest of the day thinking about just how cool science is.

I was thinking the other day on the train, that we don't always want drugs that are benign, of course we always want drugs that are safe and predictable in action; but mechanistically we sometimes want to do harm, kill a cell, wipe it out entirely - let's call these malign. Of course, this socially responsible expression of this desire to obliterate life with drugs is in the development of antibiotics (bacteria, viruses, fungi and parasites) and also anti-cancer agents. The other system we sometimes want to modulate to cause specific harm is via activation of the immune system - priming it to attack non-self (pathogen) cells, or that limbo-land of pseudo-self cancer cells.

Anyway, this thinking of drugs as either (simplistically) malign or benign was triggered by a mail from a new collaborator interested in antiparasite agents identified via drug repositioning. How could you come up with a prioritised and annotated set of current drugs (and their associated cognate targets) that are biased to killing certain cells?

There are many approaches to this problem, here is just one... Given that the majority of drugs try to do no harm, and restore normal function to the recipient, screening most 'benign' drugs would be expected a priori to do little. Do we really want to find a way to balance sugar metabolism in a plasmodium and cure it's diabetes? - of course not, so screening known antidiabetic agents is less likely to reveal a cidal phenotype.
  • sporadic binding of a non-cidal drug/target to a new cidal target in the pathogen.
  • a benign-malign target switch occurs - so a context (organism) dependent switch happens - a nice example here is the statins, which were first isolated and characterised as anti fungal natural products, and now of course are used as cholesterol lowering agents in humans. It’s left as an exercise for the reader to answer why statins aren’t used as antifungals.

There are exceptions to all such half-assed approximations. The two most obvious are

So, let’s have a high level go at classifying drugs as malign or benign in intent. I’m on a train, so it will need to be a quick and dirty approach…. Use the WHO ATC classification! Classes J, L and P ('Antiinfectives for systemic use', 'antineoplastic and immunomodulating agents', and 'anti-parasitic products, insecticides and repellents' respectively) are clearly enriched in malign drugs (and malign targets), and certain subclasses of other top level classes are also malign - e.g. D01, D06 and G01 for example. 

So, we can rapidly (and approximately) classify drugs into malign and benign classes using the ATC, and via simple extension produce sets of malign and benign targets. We can even address some of the more-likely benign to malign switches, for example using sequence similarity to identify similar architectures/mechanisms. As an example, quite a few antithelmintics work by binding to ion channels and causing paralysis, finding ATC classes of ion-channel blockers is straightforward.

Toodlepip, for ever, jpo!

Thursday, 30 April 2015

Easter egg hunt results.

As promised, we would like to provide the answer to the Easter egg hunt competition and announce the winner. Exactly seven hours after publishing the blog post we received a comment with the correct answer. The author of the comment was Matt Swain, who runs his blog about cheminformatics.

You can verify the correctness of his answer by visiting the password protected link in the original post. The password is:


Congratulations to Matt!

ChEMBL team

Thursday, 9 April 2015

Upcoming Webinars

We are pleased to announce a new round of resource-specific webinars that will be given in May and June 2015. These four webinars will cover UniChem, ChEMBL, MyChEMBL (the ChEMBL Virtual Machine) and the ChEMBL Web Services.

  • ChEMBL Web Services, 4pm BST, 17th June 2015,   Currently Postponed

For those of you who can't make these days/times, each 1 hour long webinar will be video recorded and will be available to watch on YouTube afterwards. Additionally, we will make the slides available for download.

The video for last month's SureChEMBL webinar can be found here (and part 2 here). 

For more information about the webinars, or to suggest other topics to cover, please contact

Wednesday, 1 April 2015

Easter egg hunt

Easter is coming and for all those, who don't know what to do with their spare time and fancy entering a little competition, we've prepared a small challenge.

Easter Egg?

In software development, an Easter egg is funny (but harmless) and undocumented feature hidden from users in unusual places. Excel 97 has its Flight Simulator, FireFox about:robots address and Debian's apt-get has a moo command. The ChEMBL web services has now joined this list and we invite you to find its hidden feature and share with others.

But why?

We would like to encourage you to look at the source code of our web services.  Reading code is essential developer skill, as it helps in understanding how the code works. This can lead to the development of new software and/or improve an existing codebase. After skimming through the code, hopefully you will agree that it is well written and easy to extend. Let us know if you disagree, either by emailing us or creating a GitHub issue. We promise, there are no dragons there, only an Easter egg, which should be easy to find after reading the code.


The first person to comment on this post with a URL revealing the Easter egg (web service URL, hmm hint maybe?), will be honoured with a mention in a future blog post. As a 'guarantee', that the Easter egg exists now, we provide the following password protected URL - we will reveal the password, the winner and more details in the follow up blog post. Do you think you can do it? Check it out!

Is this an April fool joke?

No. The joke is in having an Easter egg, not in finding it.

Monday, 23 March 2015

Compound names and The InterWeb

We have a long-standing interest in finding clinical stage compounds from the literature - and it turns out that the peer reviewed literature is pretty useless, by the time something is published and appears in print it is old news, and although reviews of particular areas or targets are useful in capturing a snapshot, they are not really useful in decision support - using data to inform future experiments and investment. So these things need to be databased and online to be of much use.

So, we have a pretty big set of clinical stage kinase inhibitors that we've gathered from a wide variety of sources - this is the subject of a paper we're currently writing up, so I won't bore you with how we got the data, well, not right now.

We've posted a couple of times before about the transition of names, or classes of names as compounds go through development and approval - a search of the ChEMBL-og will show you these. But here's something hot off the press.

Each row is a distinct kinase inhibitor, each column is a synonym or identifier for that compound. Columns 1 to 5 are research code type names (e.g. UK-92480), with the first column being the one we use as the primary identifier, Column 6 is the InChI key, Column 6.1 is the CAS Registry number, 7 is the USAN/INN 8 is a deprecated USAN/INN or another common trivial name, Columns 9 to 12 are trade names (in case you are wondering, row 12 is of the Chinese tradename of a compound). The cells are coloured by the log(10) of the count of the number of times the name occurs in a google search - pink high, blue low (or there is no synonym of that class for that compounds). There are some false positives, where the name is unusually common, so it matches the name of something unrelated to a kinase inhibitor or therapeutic application.

It's interesting to see the diversity of names increasing as a compound becomes a launched drug, but also the broad coverage for many of these compounds with hits to the InChI key - and also for what fraction compound structures are known.

The most frequent hits in google are to:

ALL-3  11000000
CIF     7400000
X-82   1930000
AT-877  623000
X-396   425000
DE-10  362000
AC-220   245000
R-112     226000
Imatinib  226000
Erlotinib  148000

Note this is without any filtering, or other fancy stuff, just raw search engine hit counts.

Francis and jpo

Friday, 20 March 2015

The SureChEMBL map file is out

As many of you know, SureChEMBL taps into the wealth of knowledge hidden in the patent documents. More specifically, SureChEMBL extracts and indexes chemistry from the full-text patent corpus (EPO, WIPO and USPTO; JPO titles and abstracts only) by means of automated text- and image-mining, on a daily basis. We have recently hosted a webinar about it which turned out to be very popular - for those who missed it, the video and slides are here.

Besides the interface, SureChEMBL compound data can be accessed in various ways, such as UniChem and PubChem. The full compound dump is also available as a flat file download from our ftp server.

Since the release of the SureChEMBL interface last September, we have received numerous requests for a way to access compound and patent data in a batch way. Typical use-cases would include retrieving all compounds for a list of patent IDs, or vice versa, retrieving all patents where one or more compounds have been extracted from. As a result, we have now produced this so-called map file which connects SureChEMBL compounds and patents.

It is available here.
More information can be found in the README file.

What is this file?

There is a total of 216,892,266 rows in the map, indicating a compound extracted from a specific section of a specific patent document. The format of the file is quite simple: it contains compound information (SCHEMBL ID, SMILES, InChI Key, corpus frequency), patent information (patent ID and publication data), and finally location information, such as the field ID and frequency. The field ID indicates the specific section in the patent where the compound was extracted from (1:Description, 2:Claims, 3:Abstract, 4:Title, 5:Image, 6:MOL attachment). The frequency is the number of times the compound was found in a given section of a given patent. More information on the format of the file in the README file.

How many compounds and patents are there?

There are 187,958,584 unique patent-compound pairs, involving 14,076,090 unique compound IDs extracted from 3,585,233 EP, JP, WO and US patent documents - an average of ~52 compounds per patent. The patent coverage is from 1960 to 31-12-2014 inclusive.

Here's a breakdown of the patents in the map per year and patent authority:

Are these all the compounds and patents in SureChEMBL?

Technically, no - in practice, yes. We excluded chemically annotated patents that are not immediately relevant to life sciences, such as this one. For the filtering, we used a list of relevant IPCR and related patent classification codes. At the same time, we excluded too small, too large, too trivial compounds, along with non-organic and radical/fragment compounds.

Are these compounds genuinely claimed as novel in their respective patents?

Automated methods to assess which are the important and relevant compounds in a pharmaceutical patent is a field of research and one of our future plans. For now, the map file include all extracted chemistry mentioned in all sections of a patent, subject to the filters listed in the previous section. A quick and effective trick to filter out trivial and/or uninformative compounds is to use the corpus frequency column and exclude everything with a value more than, say, 1000. Note that, in this way, you will also exclude drug compounds such as sildenafil, which are casually mentioned in a lot of patents. You could also look for compounds mentioned only in claims, description or images sections by filtering by the corresponding field ID.

What can I do with this?

Well, you can start by 'grepping' for one or more patent IDs or SCHEMBL IDs or InChI keys, followed by further filtering. Many of you will choose to normalise the flat file into 3 database tables (say compounds, documents and doc_to_compound) for centralised access and easy querying.

For example, to find the patents the drug palbociclib has been extracted from:

Any plans to update this map file?  

New patents and chemistry arrive and are stored to SureChEMBL every day. We are planning to release new versions and incremental updates of the map file every quarter, in sync with the update of the compound dump files.

I couldn’t find my compound / patent - this compound should not be there

Don’t forget this an automated, live, high-throughput text-mining effort against an inherently noisy corpus such as patents. We are constantly working on improving data quality. If you find anything strange, let us know.

Can I join more metadata, such as patent assignee and title?

Obviously your first port of call would be the SureChEMBL website for patent metadata, but other services you may wish to use include the EPO web services for programmatic access.

Is there anything else?

Errr, yes. Watch this space for another post on storing and accessing live SureChEMBL data, behind your firewall. 

The SureChEMBL Team