ChEMBL Resources

Resources:
ChEMBL
|
SureChEMBL
|
ChEMBL-NTD
|
ChEMBL-Malaria
|
The SARfaris: GPCR, Kinase, ADME
|
UniChem
|
DrugEBIlity

Monday, 14 July 2014

Research Code Distribution Across the Literature

As you will see from many previous blog posts, we like compound research codes (alternatively known as company codes). These are often the first and primal identifier for a compound in the literatures, and reports of activity often pre-date disclosure of structure. They are of key importance in having view of state-of-the-art for progress against a target, etc, and are often the only way to search for bioactivity across a broad set of literature sources. They also have many other applications, such as competitive intelligence, investor/analyst data mining, etc.



Above is a plot of the frequency of occurrence in full text of 808,378 Open Access EuropePMC articles for GSK codes across EuropePMC - I’ve canonicalised them to account for differences in punctuation (“GSK-123456”, “GSK123456” and “GSK 123456”). If you do a google search for the most common ones, you will easily see the importance of these compounds across drug discovery and pharmacology. For example GSK-1521498 is a mu opioid receptor antagonist, with (as of today) 158,000 hits in google).

I chose GSK codes for a few reasons 1) because they are a large, ethical research company, with a commitment to publication of research 2) that they are local to us here in Hinxton, and it’s easy to ask questions of our local friends 3) GSK don’t switch early company codes to development stage codes, or use a duality of names, which complicate simple analysis - a nice example here is for the recently approved kinase inhibitor Alectinib, which goes under the following codes: RG-7853, CH-5424802, AF-802 and Ro-5424802. In fact there are some complexities - for example in the GSK pipeline web-page, they don't themselves use the GSK prefix, just integers; but here, in this context, it is implicit they will be GSK codes. 

Oh, I also stripped salt/batch codes (so GSK-123456A became GSK-123456). There are quite a few out of range values (too small, or unbelievably big) and then there are various ‘mutations’ that occur; either in writing the manuscript, or in typesetting, etc. Examples here are that there are 4 occurrences of GSK-112012 which is a small typo from the far more frequent (221 times) GSK-1120212, and it is easy to see how a simple transposition error could have caused this. To be clear, there will be a GSK-112012, it’s a valid name, but the likelihood is that references to this in the literature, without being supported by other evidence, are in fact about GSK-1120212. Interestingly, the occurrence rate of these mutants is about 10% of all unique GSK numbers (and this is a lower estimate - my first pass attempt at finding these relied on the first few digits being correct, edit-distance based clustering would be the place to start here of course). However, there do seem to be some more common transcription errors as one would expect for strings containing mostly numbers (so GSK-123456 -> GSK-12346 is a lot more likely to happen than becoming GSK-123q56). It’s likely to be the case that a set of 'real world' typos can readily be built to build modified ‘edit distances’ useful in cleaning up data. With such a high potential error rate, this could become critical in real-world use. Interestingly, these errors then propagate from paper to paper as they are copied from one source to another.

Below is the frequency distribution of each cleaned up GSK- code, it shows the classic log-normal/power law distribution - with some compounds, very likely the most interesting, with most data. They are also likely to be the most progressed towards becoming a real drug. The long tail is there too, and one would expect that this long tail is more likely to be full of errors than the commonly referred to compounds.



And here is a frequency scatterplot. Many compounds are mentioned only once in the literature. This dual-domain (1- order in time, not linear though! from ordinal number in the research code, and 2- frequency of mentions in the literature) ’frequency spectrum’ is really interesting and useful, as future posts will outline. There is also another time-domain at work here - the time of disclosure/publication.



This initial analysis is just for EuropePMC full text content, but of course a similar analysis can be done across ChEMBL, SureChEMBL (for patents), the internet (in both search engine index, and with more complexity and difficulty across the dark-web). Of course, this can be combined with the list of research codes, and tracking across company mergers that is part of ChEMBL as well. 

Toodle-pip for now!

jpo and Jee-Hyub Kim (McEntyre group)

Wednesday, 9 July 2014

Conference: SLAS 2015 Call for abstracts


The July 28 deadline for podium abstract submissions for SLAS2015 is just a few weeks away. Please consider responding to this call to play an important role in the industry-leading event located at the intersection of science and technology. Abstract submissions from industry, academia and government professionals are welcome.
Submit Your Abstract »

SLAS is currently accepting both podium and poster presentation abstracts for consideration towards the scientific program at SLAS2015. Program tracks include:
  • Assay Development and Screening
  • Automation and High-Throughput Technologies
  • Bioanalytical Techniques
  • Biomarker Development and Applications
  • Drug Target Strategies
  • Informatics
  • Micro/Nano Technologies
Visit the SLAS2015 Web site for detailed track descriptions.
Presenting at SLAS2015 positions both you and your organization as a leader in your field. Serving as a presenter is also sure to expand your professional network and provide you with qualified insights from peers that will improve your research moving forward. If you have expertise in the field scientific automation - or an interesting story or case study to share - please consider this opportunity to contribute to the SLAS2015 scientific program.
The Tony B. Academic Travel Award
SLAS is proud to once again offer the Tony B. Academic Travel Award to facilitate the attendance of students and emerging academic professionals to present their scientific work at SLAS2015. Award recipients receive airfare or mileage reimbursement, conference registration and shared accommodations at SLAS2015. Undergraduate and graduate students, postdoctoral associates, and junior faculty are encouraged to apply. Click here for complete details on the Tony B. Academic Travel Award.
The SLAS Innovation Award
All presenters selected to deliver podium presentations can nominate themselves to be considered for the 2015 SLAS Innovation Award. The final round of judging for this award will take place at SLAS2015 in Washington, DC. This prestigious award carries with it a $10,000 cash prize to recognize the top-rated podium presentation, among all those nominated, delivered at the SLAS Annual Conference.Click here for complete details on the SLAS Innovation Award.

Questions? Contact Amy McGorry of SLAS Headquarters by e-mail or by calling 1.​630​.256​.7527​ ext.​ 101.

Tuesday, 24 June 2014

Meeting - EMBL-EBI/Wellcome Trust Workshop on Resources for Computational Drug Discovery


It's that time of year again, when we advertise our fun and engaging course on computational drug discovery. This year it is held on the 17th to 21st November 2014, at the Wellcome Trust Genome Campus, Hinxton, Cambs UK.

This workshop provides participants with the underlying principles of computational chemical biology and addresses how these methods are applied in the field of drug discovery. It will explore approaches to accessing data, combining different data types and introduce the tools available to assist analysis work. Practical sessions will guide participants in retrieving and analysing chemogenomic, proteomic and metabolomic data for target analysis. 

Target Audience 
This workshop is suitable for both academic and industrial researchers interested in drug discovery from a range of biological disciplines. An undergraduate level of biology is essential and participants should have a basic understanding of UNIX, programming and running simple scripts (such as python).   

Syllabus, Tools and Resources 
During this workshop you will learn about:  
- Approaches and strategy in computational drug discovery 
EMBL-EBI chemical biology resources - ChEMBL & ChEBI
PDBe for structural models 
ZINC purchasable compound database 
canSAR drug discovery platform 
- approaches to drug repositioning

Link to registration.

Deadline for applications is 1st August 2014.

Monday, 23 June 2014

myChEMBL on Bare Metal




myChEMBL is distributed as a Virtual Machine (VM), which is good because you can treat it like another file on your filesystem. It can be transmitted, copied, renamed, deleted, etc. The myChEMBL VM behaves like a sandbox, so software installed there can't harm your computer.

But there are sometimes costs associated with using a VM, for example VMs are usually several percent slower than the host they are running on. There are also a number of scenarios where using a VM may not optimal or even possible, for example:
  • You just want to enrich your existing machine with chemistry-related software
  • The only machine you have is itself virtual - VM provisioning software often prevents you from installing a VM within a VM
  • When performance is critical
In these cases you may not want the whole myChEMBL VM, only the software that it ships with.

Fortunately we have a script, that automates the process of creating our customized VM. But not only that - we keep it publicly available along with all other resources, necessary to build myChEMBL! The main entry point is a bash script called 'bootstrap.sh', which when executed it performs following steps:
  • Creates user called 'chembl' and adds it to sudoers list
  • Updates software distribution channels and upgrades OS
  • Installs common software libraries required by our tools
  • Installs python/ipython notebook/postgres DB
  • Sets up python environment using virtualenvwrapper/virtulenv
  • Downloads ChEMBL_18 data dump, and stores it in freshly created database
  • Installs RDKit and builds postgres cartridge
  • Installs and configures a web server and all resources that will be accessible via browser
  • Configures network
  • Adds some branding

How to use it?

Just run:

curl -s https://raw.githubusercontent.com/chembl/mychembl/master/bootstrap.sh | bash 

you can optionally wrap it with 'time' to know how long did it take to execute: 

time(curl -s https://raw.githubusercontent.com/chembl/mychembl/master/bootstrap.sh | bash) 

It takes about 6 hours on our machines, but we have fast internet connection. It could take 2-3 times longer on your connection, depending on bandwidth and your computer speed. The script is extremely verbose so you will easily notice what is being installed at the moment. Tip: you can redirect stdout/stderr to file(s) or even /dev/null to make it silent.

What takes the most time?

Depending on your configuration these are most time consuming operations performed during execution of the script:
  • Creating fingerprints and indexes for chemistry cartridge
  • Downloading ChEMBL_18 dump from EBI's FTP (917 MB)
  • Compiling libraries
Requirements

Currently 'bootstrap.sh' script was tested only on Ubuntu. It should work on every standard Ubuntu release since 12.04 (and probably Linux Mint as well). It's possible that the script will work fine (after some minor tweaks) with Debian since Ubuntu is based on it and they both use the same package manager. In future we are planning to make it work with other systems (CentOS and RedHat).

Furthermore, in order to execute this script you should have root privileges as it uses 'sudo' many times.

Is it safe?

What we are asking you to do is a "curl pipe sh" pattern, which may be of some security concern.
We believe this is fastest, most convenient and elegant way for majority of our users. If you trust:
  • Your internet connection (no man in the middle, would be hard anyway since we are using https here).
  • github.com.
  • us at ChEMBL (we hope so!)
If you are not convinced, you can do:
  1. curl -o bootstrap.sh  https://github.com/chembl/mychembl/blob/master/bootstrap.sh
  2. Carefully analyze contents, making sure there is no malware
  3. chmod +x bootstrap.sh
  4. ./bootstrap.sh
In that case, please note that this process can be recursive since, bootstrap.sh itself contains "curl pipe sh" pattern many times.

Let us know if you have any follow up questions about this post or about myChEMBL.

Tuesday, 17 June 2014

How to install myChEMBL using two Vagrant commands

 

 

 

TL;DR

  1. install Vagrant and VirtualBox
  2. run vagrant init chembl/myChEMBL && vagrant up
  3. wait a bit...
  4. go to http://127.0.0.1:8000/
  5. enjoy!

What have I just done?


Vagrant is a tool for building and deploying complete development environments and myChEMBL 18 now supports installation via Vagrant. We achieved this by first creating a myChEMBL Vagrant Box, which we then register on the Vagrant Cloud. This then allows users to install myChEMBL on there local system using two simple commands*:

vagrant init chembl/myChEMBL
vagrant up


*assumes you have vagrant installed

It's that simple! After you type this into your system console, the expected output should be similar to the one below:




What are those two commands doing?


The first command initializes the current directory to be a Vagrant environment by creating an initial Vagrantfile and prepopulates the config.vm.box setting in the created Vagrantfile. This happens immediately. This command should be executed only once.

The second one, when executed for the first time, downloads myChEMBL box file from the public repository, unpacks it and store downloaded files into a dedicated directory on your machine, so you won't have to download it again. Then, it configures myChEMBL virtual machines and runs it using VirtualBox. The box file is big (7GB) so it will take some time to download and uncompress, but this will happen only once.

OK, so I've executed them successfully, now what?


Just open your favorite browser and type this URL: http://127.0.0.1:8000/. You should see the my ChEMBL LaunchPad (We have seen apache take a little while to fire up, so if you do not see LaunchPad straight away give a couple more minutes).

Can I ssh into the machine?

 

Yes, just type:

vagrant ssh

and you are there. You are logged in as a special 'vagrant' user with root privileges and no password. If you prefer to work as a standard 'chembl' user just type:

sudo su - chembl

Great, now how can I switch the machine off?


Do it by typing:

vagrant halt

You can bring it up anytime using already known command:

vagrant up

which should execute very fast for the second time as everything is already downloaded.

Are there any software dependencies?


Yes, but not many. In order to use Vagrant you should install it. Our configuration currently supports only VirtualBox as a provider so you should install it as well.

What advantages does the myChEMBL Vagrant install offer over the standard myChEMBL installation?

  1. First of all, you don't have to remember the download URL of where the myChEMBL Vagrant box lives. All you need to know is a box name: 'chembl/myChEMBL'.   
  2. You don't have to think about any configuring VirtualBox image (e.g. network settings), as all of the configuration is stored in a Vagranfile prepared by us.
  3. Using command line interface it is easy to deploy myChEMBL on multiple machines automatically. You can run it easily on your headless server.
  4. The download is smaller, so it will download faster. The box file size is 7.1GB compared to 8.5 GB of standard myChEMBL-18_0-disk1.vmdk file.
If future, we plan to support more providers, in particular Amazon Web Services and Docker.
We also plan to make provisioning more flexible so if needed you will be able to choose what Linux distribution should be used as a base for your myChEMBL (Red Hat and CentOS are the most important distros we would like to support).

Bugs and help


We've found a couple of minor bugs since the release of myChEMBL 18. They are all fixed now and mychembl vagrant box includes all the fixes. If you happen to find any other issue, related to myChEMBL or specific to our vagrant distribution, please report it to mychembl[at]ebi.ac.uk.

You can always check for new updates using following command:

vagrant box outdated

In case new updated are available you can update the box by running:

vagrant box update

Just please remember that updating the box means downloading it from remote repository so it will take some time. Also note that this command will not magically update running Vagrant environments. If a Vagrant environment is already running, you'll have to destroy and recreate it to acquire the new updates in the box.

In our next myChEMBL related blog post we will tell you how to install myChEMBL on local VM or 'metal' machine running Ubuntu (Hint: have a look at this file).

Thursday, 12 June 2014

A python client for accessing ChEMBL web services

Motivation
The CheMBL Web Services provide simple reliable programmatic access to the data stored in ChEMBL database. RESTful API approaches are quite easy to master in most languages but still require writing a few lines of code. Additionally, it can be a challenging task to write a nontrivial application using REST without any examples. These factors were the motivation for us to write a small client library for accessing web services from Python.

Why Python?
We choose this language because Python has become extremely popular (and still growing in use) in scientific applications; there are several Open Source chemical toolkits available in this language, and so the wealth of ChEMBL resources and functionality of those toolkits can be easily combined. Moreover, Python is a very web-friendly language and we wanted to show how easy complex resource acquisition can be expressed in Python.


Reinventing the wheel?
There are already some libraries providing access to ChEMBL data via webservices for example bioservices. These are great, but sometimes suffer from breaking as we implement required schema or implementation details. With the use of this client you can be sure that any future changes in REST API will be immediately reflected in the client. But this is not the only thing we tried to achieve.


Features
During development of the Python client we had in mind all the best practices for creating such a library. We tried to make is as easy to use and efficient as possible. Key features of this implementation are:
  • Caching results locally in sqliteDB for faster access
  • Bulk data retrieval
  • Parallel asynchronous requests to API during bulk retrieval
  • Ability to use SMILES containing URL-unsafe characters in exactly same way as safe SMILES
  • Lots of syntactic sugar
  • Extensive configuration so you can toggle caching, asynchronous requests, change timeouts, point the client to your local web services instance and much more.
These features guarantee that your code using the ChEMBL REST API will be as fast as possible (if you know of any faster way,  drop us a line and we will try to include in the code).


Demo code
OK, so let's see the live code (all examples are included in tests.py file). If you are an ipython notebook fan (like most of us) and prefer reading from notebooks instead of github gists you can click here:


So many features, just for compounds! Let's see targets:


So far, so good! What about assays?


Can I get bioactovities as well?

It's completely optional to change any settings in the ChEMBL client. We believe that default values we have chosen are optimal for most reasonable applications. However if you would like to have a play with settings to make our client work with your local webservices instance this is possible:



GET or POST?




Benchmarks

We've decided to compare our client with existing bioservices implementation. Before we describe method and results, let's say a few words about installation process. Both packages can be installed from PIP, but bioservices are quite large (1.8MB) and require dependencies not directly related to web retrieval (such as pandas or SOAPpy). On the other hand our client is rather small (<0.5 MB) and require requests, request-cache and grequests.

To compare two libraries, we've decided to measure time of retrieval first thousand of compounds with IDs from CHEMBL1 to CHEMBL1000. We've ignored 404 errors. This is how the code looks for our client:


And for bioservices:


Both snippets were run 5 times and the average time was computed.


Results:
chembl client with cache: 4.5s
chembl client no cache: 6.7s
bioservices: 9m40s


which means, that our client is 86-145 times faster than the bioservices client.


Installation and source code



Our client is already available at Python Package Index (PyPI), so in order to install it on your machine machine, you should type:

sudo pip install chembl_webresource_client

or:

pip install chembl_webresource_client

if you are using virtualenv.

The original source code is open sourced and hosted at github so you can clone it by typing:

git clone https://github.com/chembl/chembl_webresource_client.git

install latest development version, using following pip command:

sudo pip install git+https://github.com/chembl/chembl_webresource_client.git  
 
or just browse it online.


What about other programming languages?

Although we don't have plans to release a similar client library for other programming languages, examples highlighting most important API features using Perl and Java are published on Web Services homepage. And since the Web Services have CORS and JSONP support, we will publish JavaScript examples in the form of interactive API browser, so stay tuned!


Michal

Tuesday, 10 June 2014

New Drug Approvals 2014 - Pt. X - Albiglutide (Eperzan™ or Tanzeum™)



Wikipedia: Albiglutide
ChEMBLCHEMBL2107841

On April 15th the FDA approved Tanzeum (albiglutide) subcutaneous injection to improve glycemic control, along with diet and exercise, in adults with type 2 diabetes.

Type II diabetes

Type II diabetes is a metabolic disorder that is characterized by high blood sugar (hyperglycemia) due to insulin resistance or relative lack of insulin. The disease affects millions of patient world-wide and can lead to long-term complications if the blood levels are not lowered in the patients: heart diseases, strokes and kidney failure.

Albiglutide

The drug is a dipeptidyl peptidase-4-resistant glucagon-like peptide-1 dimer fused to human albumin.
Schematic representation of the albiglutide (EMA)

Mode of action

Traditionally, a decrease in the glucose blood level of affected patients is triggered using insulin injections. One alternative mechanism consists at indirectly stimulating insulin release using a glucagon-like peptide-1 (GLP-1) or an analogue of the corresponding receptor.
GLP-1 receptor agonists are of particular interest, as they naturally stop simulating insulin release when plasma glucose concentration is in the fasting range, and hence preventing hypoglycemia in the patient too.
The natural half-life of GLP-1 is less than 2 minutes in the human blood, the peptide is rapidly degraded by an enzyme called dipeptidyl peptidase-4. On the other hand, albiglutide half-life ranges between four to seven days (resistance to dipeptidyl peptidase-4), a considerably longer time than endogenous peptide and than the others GLP-1 analogous drugs (exenatide and liraglutide). This property allows to reduce the number of injections in diabetic patient to biweekly or weekly instead of daily, hence considerably increasing treatment overheads.

Clinical trials

A series of eight clinical trials involving over 2,000 patients with type II diabetes demonstrated the safety and effectiveness of the drug. Patients reported improved HbA1c level (hemoglobin A1c or glycosylated hemoglobin, a measure of blood sugar control). The most common side-effects observed were diarrhea, nausea, and injection site reactions.

Indication and warnings

Albiglutide can be used as a stand-alone as well as in combination therapy (with metformin, glimepiride, pioglitazone, or insulin for instance). The drug is not suited to treat type I diabetes and not indicated for patients with increased ketones in their blood or urine. Albiglutide should be used only when diet and exercise therapies are not successful.
The drug has an FDA boxed warning, as cases of tumors of the thyroid gland have been observed in rodent studies with some other GLP-1 receptor agonists. The FDA further required post-marketing studies regarding dose, efficacy and safety in pediatric patients and for cardiovascular outcomes in patients with high baseline risk of cardiovascular disease.

Tradenames

The drug was invented by Human Genome Sciences and was developed in collaboration with GSK. Albiglutide is marketed as Eperzan in Europe and Tanzeum in the USA.

Recruitment: Two Biological Data Manager positions within the team


We have two positions in the group currently open, both working on bioassay data aspects of the ChEMBL database. One is funded by a Wellcome Trust Grant, and the other under funding from the newly established EMBL-EBI/Sanger Center/GSK Center for Therapeutic Target Validation (CTTV). These are both great opportunities to work in a fun group committed to working on Open Data to help drug discovery.
  • Curating and organising bioactivity data in the ChEMBL database in a structured way.
  • Mapping measured bioactivity data from the scientific literature to other biological entities (proteins, cell-lines etc) using a variety of biological resources and ontologies.
  • Storing the information in the ChEMBL database in a structured format.
  • Implementing curation pipelines for data from complex and phenotypic assays such as those from cell-based and whole organism studies.
  • Manually checking and annotating data by reference to the original data sources.
  • Developing semi-automated processes for identifying and correcting erroneous data.
Links to both positions are here - Post 1 and Post 2. Closing date is 6th July 2014, with interviews taking place shortly after for shortlisted candidates.

Monday, 9 June 2014

myChEMBL LaunchPad......Launched!

We are pleased to announce that the latest myChEMBL release (based on ChEMBL_18), is available to download. For users not familiar with myChEMBL, the aim of the project is to create an open platform, which combines public domain bioactivity data with open source web, database and cheminformatics technologies. More details about the project can be found in this paper and more details about a recent award it helped pick up can be found here.

Like the previous release, once you have installed the myChEMBL virtual machine, you will have access to an Ubuntu linux machine which comes preloaded with the ChEMBL data in a RDKit enabled PostgreSQL database and the original myChEMBL web application. We have added a lot of new features and enhancements to the new myChEMBL release, which include:
  1. A local copy the ChEMBL Web Services, which uses the local PostgreSQL database as a backend.
  2. A suite of interactive tutorials, created using IPython Notebooks. Topics covered include introductory material using the local RDKit and accessing the ChEMBL Web Services via its new python client. More complex topics include building predictive target models and multi-dimensional scaling (MDS) analysis of small molecule bioactivity data.
  3. The ability run SQL queries against the ChEMBL database using the web-based SQL browser phpPgAdmin.
  4. Details on how to integrate the local myChEMBL PostgreSQL with KNIME.
  5. The file download size has reduced by more than half (down from 17GB to just over 8GB).
  6. The networking between the host machine and the myChEMBL VM is much easier to configure. 
To help users navigate around the platform, we have created the myChEMBL LaunchPad page, which provides quick links to each of the resources listed above as well as the original myChEMBL web application.



Most of the code for the system can be be found on github; you should also expect a blog post on how to install a local myChEMBL instance using Vagrant in the next week or so.

We hope you find the myChEMBL system useful. Please get in touch, if you would like to report any problems or suggest any enhancements.


The myChEMBL Team


Wednesday, 4 June 2014

Novelty of a chemical structure


A brief post of a few thoughts about testing for chemical novelty, especially in the context of patent filing. It's a little bit odd, but interesting.

The concept of 'chemical novelty' is core to the filing of patent protection on pharmaceuticals, and as part of most patent filings, checks are done but the inventor to ensure that the invention is actually novel. People will search patent databases (historically these were largely commercial, but public resources such as PubChem & SureChEMBL also now contain significant amounts of patent derived data). They will then also need to search non-patent databases, since lots of chemicals are published without patents being filed. There is a good and accessible overview of the field and some databases here.

These datasources, and even more so the workflows, are heterogeneous, fragmented, and the broader the search the more expensive it becomes. As a general rule, the resources that are built from the patent literature are well designed around the date of disclosure/publication of a chemical structure, public resources a lot less so - if at all. There is actually also another two or three times when this novelty checking is important - firstly during patent examination - where a patent examiner has a short amount of available time to perform novelty checks, and these checks are of course relative to the filing date of the invention; thirdly by lawyers/other scientists who may try and wriggle around the constraints of the patent, trying to invalidate it by showing that the compound wasn't novel after all. Because often there are very large sums of money involved, people can become very determined and creative in looking for such 'prior art'! Publication can be anywhere in the world, so this adds to cost and complexity yet further.

Imagine now a free public, novelty checker, with 'strong' time-stamping of first 'publication' for all structures, and also great tautomer searching, correct treatment of parents/salts (in the field of patents, salt forms are often central to actual product properties and so are sometimes critically important). Go on, close your eyes and imagine just this, for a moment - feels good doesn't it?

There are a number of basic problems with implementing such a system, despite the huge cost savings and efficiencies in innovation it would no doubt bring. It would need to be done by a 'trusted third party' (so no one could pay to retrospectively add a compound with a retrospective publication time), and validated in some way (so the timestamps are 'provable' in some way - there are now cool informatics approaches to this). It would need to contain all previously 'published' compound structures, and have great internal provenance tracking (so where and when was the original source of this structure). It would also need to be big, probably of the order of a billion or so structures - this alone would make such a system out of reach for the vast majority of organisations. Of course, I am not considering Markush structures in this discussion - good luck with reliably enumerating these from patents, for the moment at least, but eventually you'll need to consider these as well! 

Lurking here also are the GDB databases, the elephant in the room, which in a few years could make the discussion of chemical novelty moot.

Remember that nice fuzzy feeling you had a few paragraphs back, it's gone, hasn't it? Welcome to the real, painful world! UniChem has some elements of such a novelty checking system - at least it's possible to establish a snapshot of it's local chemical world at some arbitrary time point in the past, according to it's own reference frame. Since you are only interested in identical structures, it can already do the required searches - no need for Tanimoto, substructures etc for this particular use case. Maybe it needs some work on scalability, maybe it needs from work on Proof of Knowledge, etc.; maybe. But it's an interesting place to start thinking. At some point in the future, there will be the ability for you to run your own local instance of UniChem, with regular feeds of SureChEMBL structures, merged PubChem structures, etc. It's interesting to pose the question, just how much of the exemplified chemical universe be catered with reasonable investment on this problem?

For me, there's a lot of really interesting and deep technical challenges here, but also the potential to radically change the cost structure of chemical (and specifically drug) invention, freeing more investment for the discovery process itself (yeah, i am an idealist).

Update: Here's a great article on how this sort of thing can be done right now at www.proofofexistence.com

jpo

The picture above is of a novel, written by one of my oldest friends (literally!). He writes under a pseudonym but in the interests of attribution, his orchid is 0000-0001-5528-0087. It would make an excellent holiday read.