ChEMBL Resources

The SARfaris: GPCR, Kinase, ADME

Saturday, 16 August 2014

Citing ChEMBL, and Data DOIs

There are now multiple formats and ways to access the ChEMBL data, and we have recently assigned DOIs to all available versions of ChEMBL (and will archive these on the ftp server, so they are available to allow repetition of analyses, etc).

So when you publish use of ChEMBL, could you reference the following papers:

ChEMBL Database
A. Gaulton, L. Bellis, J. Chambers, M. Davies, A. Hersey, Y. Light, S. McGlinchey, R. Akhtar, A.P. Bento, B. Al-Lazikani, D. Michalovich, & J.P. Overington (2012) ‘ChEMBL: A Large-scale Bioactivity Database For Chemical Biology and Drug Discovery’ Nucleic Acids Res. Database Issue, 40 D1100-1107. DOI:10.1093/nar/gkr777 PMID:21948594

A.P. Bento, A. Gaulton, A. Hersey, L.J. Bellis, J. Chambers, M. Davies, F.A. Krüger, Y. Light, L. Mak, S. McGlinchey, M. Nowotka, G. Papadatos, R. Santos & J.P. Overington (2014) ‘The ChEMBL bioactivity database: an update.’ Nucleic Acids Res., 42 1083-1090. DOI: 10.1093/nar/gkt103 PMID: 24214965

R. Ochoa, M. Davies, G. Papadatos, F. Atkinson and J.P. Overington (2014) myChEMBL: A virtual machine implementation of open data and cheminformatics tools, Bioinformatics. Jan 15;30(2):298-300. DOI 10.1093/bioinformatics/btt666 PMID: 24262214

S. Jupp, J. Malone, J. Bolleman, M. Brandizi, M. Davies, L. Garcia, A. Gaulton, S. Gehant, C. Laibe, N. Redaschi, S.M Wimalaratne, M. Martin, N. Le Novère, H. Parkinson, E. Birney and A.M Jenkinson (2014) The EBI RDF Platform: Linked Open Data for the Life Sciences Bioinformatics 30 1338-1339 DOI:10.1093/bioinformatics/btt765 PMID:24413672

Also please reference the version of ChEMBL you used, using the following DOIs:







Friday, 8 August 2014

Registry numbers in ChEMBL

The numbers are in - the public vote (N=69, so quite small) was overwhelmingly (in roughly a 3:1 ratio) in favour of including registry numbers in ChEMBL/UniChem, as you will see from the screenshot above. There was some discussion (see Google+ and ChEMBL-og comments for details, as well as some Twitter response (it's pretty easy to hunt down if you are really really interested). So we will see what we can do.....

ChEMBL US Tour - an update

We've had a great response to our call for offers of venues to help us on a ChEMBL outreach tour, funded by the project. Things are shaping up pretty well, but we probably still have space for something in the Seattle area, and also space maybe for something in Philadelphia. We also will probably do both East and West coasts on the same trip, due to the very positive response.

Get in touch if you are in the north-west, or north-east!


Monday, 4 August 2014

ChEMBL US Tour 2014

We have some specific funding to do some training and outreach for ChEMBL (including UniChem and SureChEMBL). We would like to set something up on either the East or West coasts - the map above is the Google Analytics view for the blog (remember, we don’t run any analytics on the ChEMBL site, we respect your privacy). Based on this there are a couple of realistic options, and we can realistically only do one of these this year.
  • West Coast - Seattle, Bay Area and San Diego.    or 
  • East Coast - RTP, DC, Philly, New York, New Jersey and Boston.

We are thinking of sometime in November or early December 2014.

In order to make this work, we would need a local coordinator to arrange rooms, advertising to local interested users, and so forth, and also some assistance with logistics planning, and if you have access to special rates at hotels that would be great.

We could run either a lecture/chalk-and-talk set of lectures at each location, or if you have a training room with computers could do some workshops/hands-on training. We would typically cover
  1. Introduction to ChEMBL, UniChem and SureChEMBL
  2. Application of ChEMBL in lead discovery and medicinal chemistry
  3. patent searching in SureChEMBL
  4. Drugs and Targets in ChEMBL
  5. Using KNIME with ChEMBL
  6. Database schema and SQL querying, myChEMBL.

So, any interest in hosting us for a day, and what would you like to hear about? Please mail us!

If there is sufficient interest, we will look into which option (East or West coast) has the most potential meetings.

Once we’ve set something up, we’ll post an itinerary an further details on the ChEMBL-og.

Friday, 1 August 2014

Should CAS numbers be in ChEMBL and/or UniChem?

A very quick survey to add excitement to either your holiday or work-day! None of these sucker links, where there appears a 0.24% complete progress bar on the second page, it's just a simple yes/no question on whether it's a good idea to add CAS registry numbers to ChEMBL and/or UniChem. No promises that we could deliver this, but depending on what you vote for, we will consider our options.

Update: Given the multiple channels out there, there are also comments on this on LinkedIn (in the ChUG - "ChEMBL User Group" group - why not join, if you're not already) and a couple on Google+.

Update 2: I'll let the poll run till the end of the week (Friday 8th 2014) - and then write something up on the results.

Wednesday, 23 July 2014

ChEMBL_19 Released - Now with Crop Protection Data!

We are pleased to announce the release of ChEMBL_19. This version of the database was prepared on 3rd July 2014 and contains:
  • 1,637,862 compound records
  • 1,411,786 compounds (of which 1,404,752 have molfiles)
  • 12,843,338 bioactivities
  • 1,106,285 bioassays
  • 10,579 targets
  • 57,156 abstracted documents
Data can be downloaded from the ChEMBL ftpsite. Please see ChEMBL_19 release notes for full details of the changes in this release.

New crop protection data

We have now expanded the content of ChEMBL to include data relevant to crop protection research. Bioactivity data covering insecticides, fungicides and herbicides were extracted from a number of different journals such as J. Agric. Food. Chem., J. Pesticide Sci., Crop Protection and Pest Manag. Sci. The addition of this dataset to ChEMBL was funded by Syngenta. In total, more than 40K compound records and 245K activities were added in this dataset. These data are included in the 'Scientific Literature' data source and can be retrieved from the ChEMBL interface using the taxonomy browser ('Browse Targets' -> 'Taxonomy') or through assay keyword searches (e.g., 'insecticidal', 'herbicidal').

Other changes since the last release

New neglected disease data sets

ChEMBL_19 includes the following data sets:

  • MMV malaria box Plasmodium falciparum screening data deposited by Eisai
  • MMV malaria box Onchocerca lienalis screening data deposited by Northwick Park Institute for Medical Research
  • MMV malaria box Cryptosporidium parvum screening data deposited by the University of Vermont
  • Trypanosoma cruzi fenarimol series screening data deposited by Drugs for Neglected Diseases Initiative (DNDi)
  • Plasmodium falciparum screening data from the Open Source Malaria project.

Hepatotoxicity data

Hepatotoxicity information for more than 1,200 compounds has been extracted from the following publication, relating to the 14th edition of the Drug hepatotoxicity bibliographic database:

  • Biour M, Ben Salem C, Chazouillères O, Grangé JD, Serfaty L and Poupon R. [Drug-induced liver injury; fourteenth updated edition of the bibliographic database of liver injuries and related drugs]. Gastroenterol. Clin. Biol., 2004, 28(8-9), 720-759.

New journal coverage
We are now pleased to be able to include MedChemComm in our list of journals for routine data extraction. ChEMBL_19 includes 120 articles from this excellent journal, published between 2013 and 2014. We are most grateful to the RSC for access to the source journal material. We will post more on this exciting new partnership in a future blog post!

Interface enhancements

New compound sketcher:
The ligand search now provides ChemAxon's Marvin JS sketcher as default for substructure/similarity searches.

Cochrane Collaboration reviews and British National Formulary (BNF) entries:
For drugs, the compound report card now provides links to any available Cochrane reviews and entries in the British National Formulary.

UniChem cross references:
UniChem now contains two additional sources: NMRShiftDB and the LINCS program. Cross references to these databases (where the compound occurs in the relevant source) are now provided on compound report card pages.

As usual, contact us at for any questions/feedback.

The ChEMBL Team

Monday, 14 July 2014

Research Code Distribution Across the Literature

As you will see from many previous blog posts, we like compound research codes (alternatively known as company codes). These are often the first and primal identifier for a compound in the literatures, and reports of activity often pre-date disclosure of structure. They are of key importance in having view of state-of-the-art for progress against a target, etc, and are often the only way to search for bioactivity across a broad set of literature sources. They also have many other applications, such as competitive intelligence, investor/analyst data mining, etc.

Above is a plot of the frequency of occurrence in full text of 808,378 Open Access EuropePMC articles for GSK codes across EuropePMC - I’ve canonicalised them to account for differences in punctuation (“GSK-123456”, “GSK123456” and “GSK 123456”). If you do a google search for the most common ones, you will easily see the importance of these compounds across drug discovery and pharmacology. For example GSK-1521498 is a mu opioid receptor antagonist, with (as of today) 158,000 hits in google).

I chose GSK codes for a few reasons 1) because they are a large, ethical research company, with a commitment to publication of research 2) that they are local to us here in Hinxton, and it’s easy to ask questions of our local friends 3) GSK don’t switch early company codes to development stage codes, or use a duality of names, which complicate simple analysis - a nice example here is for the recently approved kinase inhibitor Alectinib, which goes under the following codes: RG-7853, CH-5424802, AF-802 and Ro-5424802. In fact there are some complexities - for example in the GSK pipeline web-page, they don't themselves use the GSK prefix, just integers; but here, in this context, it is implicit they will be GSK codes. 

Oh, I also stripped salt/batch codes (so GSK-123456A became GSK-123456). There are quite a few out of range values (too small, or unbelievably big) and then there are various ‘mutations’ that occur; either in writing the manuscript, or in typesetting, etc. Examples here are that there are 4 occurrences of GSK-112012 which is a small typo from the far more frequent (221 times) GSK-1120212, and it is easy to see how a simple transposition error could have caused this. To be clear, there will be a GSK-112012, it’s a valid name, but the likelihood is that references to this in the literature, without being supported by other evidence, are in fact about GSK-1120212. Interestingly, the occurrence rate of these mutants is about 10% of all unique GSK numbers (and this is a lower estimate - my first pass attempt at finding these relied on the first few digits being correct, edit-distance based clustering would be the place to start here of course). However, there do seem to be some more common transcription errors as one would expect for strings containing mostly numbers (so GSK-123456 -> GSK-12346 is a lot more likely to happen than becoming GSK-123q56). It’s likely to be the case that a set of 'real world' typos can readily be built to build modified ‘edit distances’ useful in cleaning up data. With such a high potential error rate, this could become critical in real-world use. Interestingly, these errors then propagate from paper to paper as they are copied from one source to another.

Below is the frequency distribution of each cleaned up GSK- code, it shows the classic log-normal/power law distribution - with some compounds, very likely the most interesting, with most data. They are also likely to be the most progressed towards becoming a real drug. The long tail is there too, and one would expect that this long tail is more likely to be full of errors than the commonly referred to compounds.

And here is a frequency scatterplot. Many compounds are mentioned only once in the literature. This dual-domain (1- order in time, not linear though! from ordinal number in the research code, and 2- frequency of mentions in the literature) ’frequency spectrum’ is really interesting and useful, as future posts will outline. There is also another time-domain at work here - the time of disclosure/publication.

This initial analysis is just for EuropePMC full text content, but of course a similar analysis can be done across ChEMBL, SureChEMBL (for patents), the internet (in both search engine index, and with more complexity and difficulty across the dark-web). Of course, this can be combined with the list of research codes, and tracking across company mergers that is part of ChEMBL as well. 

Toodle-pip for now!

jpo and Jee-Hyub Kim (McEntyre group)

Wednesday, 9 July 2014

Conference: SLAS 2015 Call for abstracts

The July 28 deadline for podium abstract submissions for SLAS2015 is just a few weeks away. Please consider responding to this call to play an important role in the industry-leading event located at the intersection of science and technology. Abstract submissions from industry, academia and government professionals are welcome.
Submit Your Abstract »

SLAS is currently accepting both podium and poster presentation abstracts for consideration towards the scientific program at SLAS2015. Program tracks include:
  • Assay Development and Screening
  • Automation and High-Throughput Technologies
  • Bioanalytical Techniques
  • Biomarker Development and Applications
  • Drug Target Strategies
  • Informatics
  • Micro/Nano Technologies
Visit the SLAS2015 Web site for detailed track descriptions.
Presenting at SLAS2015 positions both you and your organization as a leader in your field. Serving as a presenter is also sure to expand your professional network and provide you with qualified insights from peers that will improve your research moving forward. If you have expertise in the field scientific automation - or an interesting story or case study to share - please consider this opportunity to contribute to the SLAS2015 scientific program.
The Tony B. Academic Travel Award
SLAS is proud to once again offer the Tony B. Academic Travel Award to facilitate the attendance of students and emerging academic professionals to present their scientific work at SLAS2015. Award recipients receive airfare or mileage reimbursement, conference registration and shared accommodations at SLAS2015. Undergraduate and graduate students, postdoctoral associates, and junior faculty are encouraged to apply. Click here for complete details on the Tony B. Academic Travel Award.
The SLAS Innovation Award
All presenters selected to deliver podium presentations can nominate themselves to be considered for the 2015 SLAS Innovation Award. The final round of judging for this award will take place at SLAS2015 in Washington, DC. This prestigious award carries with it a $10,000 cash prize to recognize the top-rated podium presentation, among all those nominated, delivered at the SLAS Annual Conference.Click here for complete details on the SLAS Innovation Award.

Questions? Contact Amy McGorry of SLAS Headquarters by e-mail or by calling 1.​630​.256​.7527​ ext.​ 101.

Tuesday, 24 June 2014

Meeting - EMBL-EBI/Wellcome Trust Workshop on Resources for Computational Drug Discovery

It's that time of year again, when we advertise our fun and engaging course on computational drug discovery. This year it is held on the 17th to 21st November 2014, at the Wellcome Trust Genome Campus, Hinxton, Cambs UK.

This workshop provides participants with the underlying principles of computational chemical biology and addresses how these methods are applied in the field of drug discovery. It will explore approaches to accessing data, combining different data types and introduce the tools available to assist analysis work. Practical sessions will guide participants in retrieving and analysing chemogenomic, proteomic and metabolomic data for target analysis. 

Target Audience 
This workshop is suitable for both academic and industrial researchers interested in drug discovery from a range of biological disciplines. An undergraduate level of biology is essential and participants should have a basic understanding of UNIX, programming and running simple scripts (such as python).   

Syllabus, Tools and Resources 
During this workshop you will learn about:  
- Approaches and strategy in computational drug discovery 
EMBL-EBI chemical biology resources - ChEMBL & ChEBI
PDBe for structural models 
ZINC purchasable compound database 
canSAR drug discovery platform 
- approaches to drug repositioning

Link to registration.

Deadline for applications is 1st August 2014.

Monday, 23 June 2014

myChEMBL on Bare Metal

myChEMBL is distributed as a Virtual Machine (VM), which is good because you can treat it like another file on your filesystem. It can be transmitted, copied, renamed, deleted, etc. The myChEMBL VM behaves like a sandbox, so software installed there can't harm your computer.

But there are sometimes costs associated with using a VM, for example VMs are usually several percent slower than the host they are running on. There are also a number of scenarios where using a VM may not optimal or even possible, for example:
  • You just want to enrich your existing machine with chemistry-related software
  • The only machine you have is itself virtual - VM provisioning software often prevents you from installing a VM within a VM
  • When performance is critical
In these cases you may not want the whole myChEMBL VM, only the software that it ships with.

Fortunately we have a script, that automates the process of creating our customized VM. But not only that - we keep it publicly available along with all other resources, necessary to build myChEMBL! The main entry point is a bash script called '', which when executed it performs following steps:
  • Creates user called 'chembl' and adds it to sudoers list
  • Updates software distribution channels and upgrades OS
  • Installs common software libraries required by our tools
  • Installs python/ipython notebook/postgres DB
  • Sets up python environment using virtualenvwrapper/virtulenv
  • Downloads ChEMBL_18 data dump, and stores it in freshly created database
  • Installs RDKit and builds postgres cartridge
  • Installs and configures a web server and all resources that will be accessible via browser
  • Configures network
  • Adds some branding

How to use it?

Just run:

curl -s | bash 

you can optionally wrap it with 'time' to know how long did it take to execute: 

time(curl -s | bash) 

It takes about 6 hours on our machines, but we have fast internet connection. It could take 2-3 times longer on your connection, depending on bandwidth and your computer speed. The script is extremely verbose so you will easily notice what is being installed at the moment. Tip: you can redirect stdout/stderr to file(s) or even /dev/null to make it silent.

What takes the most time?

Depending on your configuration these are most time consuming operations performed during execution of the script:
  • Creating fingerprints and indexes for chemistry cartridge
  • Downloading ChEMBL_18 dump from EBI's FTP (917 MB)
  • Compiling libraries

Currently '' script was tested only on Ubuntu. It should work on every standard Ubuntu release since 12.04 (and probably Linux Mint as well). It's possible that the script will work fine (after some minor tweaks) with Debian since Ubuntu is based on it and they both use the same package manager. In future we are planning to make it work with other systems (CentOS and RedHat).

Furthermore, in order to execute this script you should have root privileges as it uses 'sudo' many times.

Is it safe?

What we are asking you to do is a "curl pipe sh" pattern, which may be of some security concern.
We believe this is fastest, most convenient and elegant way for majority of our users. If you trust:
  • Your internet connection (no man in the middle, would be hard anyway since we are using https here).
  • us at ChEMBL (we hope so!)
If you are not convinced, you can do:
  1. curl -o
  2. Carefully analyze contents, making sure there is no malware
  3. chmod +x
  4. ./
In that case, please note that this process can be recursive since, itself contains "curl pipe sh" pattern many times.

Let us know if you have any follow up questions about this post or about myChEMBL.