ChEMBL Resources

The SARfaris: GPCR, Kinase, ADME

Saturday, 2 February 2013

Chemical data provenance and a big pile of mess that should be celebrated on National Mess Day

One of the increasing problems for those working in the chemical database field is one of distinguishing between primary and secondary sources, especially if salt stripping, normalization, errors, and assignment of new id's happen in this process. Here is a tale....

Alice is publicly spirited and puts her chemical database of 1000 compounds in the public domain. She assigns a clear permissive license to her data, and encourages reuse by providing a suitable download format. Her hope is that people, somewhere will see that she has these particular compounds, and they will contact her and collaborate Alice, maybe.

Bob has a similarly public spirit, and has 500 compounds to share. He has the same hope that people will search his database and collaborate with him.

Carlos is interested in compounds containing a pyridine group. So he wants to search both Alice's and Bob's databases - he could either search both Alice's and Bob's separate databases (assuming that they have the facilities in their labs to run a database system) - or he could use a system that Trent has built, Trent doesn't have any actual compounds, but he is a good programmer, and can integrate the data from Alice and Bob, so that Carlos doesn't need to do two searches, and then merge the results. Trent's system is liked by quite a few people, and so others start to contact Trent with their data. Trent contributes a lot to operation of the community, he is an independent broker of data, and simplifies the field by providing federated searches.

Alice and Bob are both busy people, and now some compounds they had are no longer available, Carlos found a few compounds, and so did a few of Carlos' friends, so these were sent in the post and disappeared from Alice and Bob's collections. However, Alice and Bob also made some new compounds and wanted to share these to, since maybe they would help Dan discover new drugs for trypanosomiasis. Alice and Bob update their local databases to reflect the deleted and new chemicals, and since they like to know a little about the history of the chemicals that they had, and need to write reports about how many new compounds they have made over the past year, they provide versions on their databases, since they can track when new things appeared. Trent should of course keep an eye on Alice's and Bob's sites and update his system appropriately. 

Trent largely works on his own though, and doesn't have a lot of time, in fact his employer wasn't really keen on this work, but tolerated Trent's activities, and managed to justify it under another grant. Of course when Trent first planned his system he wrote it without thinking about future updates, it was just a fun hack - build it once, merge this cool chemical data and show the value. All that code to administer the federated database are just a drag, and the users never see all that effort.

Trent gets a new job in a different city, and he and his partner adopt a child, this means Trent has even less time to keep things in his database up to date. 

Dan is getting increasingly frustrated though, since Trent's system is now out of date - although he relied on Trent having all of Alice's and Bob's data in one place, when he looks at it he sees it's old data, say five years out of date and Alice has got a whole bunch of new compounds that no-one seems interested in, because they use Trent's system.

Erin sees this problem and tries to sort it out; she then takes Trent's data, and decides to add Frank and George's data, which is new to her system. But relying on Trent's versions of Alice's and Bob's data (and Trent had assigned new identifiers as part of his merging of the compounds from Alice and Bob) has meant that she is serving old data, and no-one can possibly work out where on earth anything has actually come from.

Trent gets a new job again, and his kid goes to kindergarten, so he gets stuck back into this project, he decides to build on top of Erin's work and adds new data from Henry and Irma.

James doesn't know too much about chemoinformatics, he's a lab biologist just needing some ideas for compounds to test in an assay he's developed that models DNA methylation in Aplysia, and thinks that Trent's pages contain everything.

You'll agree that this is a fictional story - but how far divorced from reality is it? We ourselves face big problems when we try and take secondary sources (those that consolidate data from other 'primary' sources), and work out what is new data and what is processed old data. There is a clear subset of ChEMBL that is primary in the sense that we are the first point of entry on the data onto the internet, and we actively curate this "primary ChEMBL" set - but we certainly can't curate other people's data, and have this survive refreshes from the primary sources. We also make statements like "yeah, we have PubChem structures in ChEMBL", but of course we took a subset, applied some filters, and there is some delay in us taking a snapshot and processing it into ChEMBL. So if you want the latest, most definitive data from PubChem, go to PubChem.

Over this year, we will address the provenance of data that we load into UniChem and ChEMBL, making it easier for our users to know exactly what versions of what they are using. If others are interested in working with us on some community mechanism for versioning and persistence of parent source identifiers, let us know. Continuing the explosion of alternate secondary, tertiary, and so forth identifiers for compounds is not a good way forward, for anyone.

"Farm Fresh ChEMBLTM" is always available from us, there are a number of other places you can get fresh ChEMBL (a number of groups are already working hard on ChEMBL 15 integration into their systems). Some of the ChEMBLs out there aren't so fresh, so like every good consumer, check the "produced on" date!

The names have some some deeper semantic meaning here (look at the Alice and Bob wikipedia page - of course there are inevitably a few Chuck's around as well trying to lock data behind paywalls or search boxes).



Christopher Southan said...

Useful analogies to illustrate the big problem but my specic comment relates to a detail. Could you delinate in some way the PubChem BioAssay data you import and filter? Sumitting them "back" with a ChEMBL SID might be an option, and or even a new source tag (ChEMBLPubChem?)

John Overington said...

This is easy for ChEMBL - you can select them as one of the Selected Bioactivity Sources (either include them or exclude them) so they're pretty easy to isolate). This link is just under the UK flag - sorry for the bad description of how to find it!

The only data taken into PubChem Bioassay (as far as we know) is the primary 'unique' ChEMBL content (again another ChEMBL source ID).

Christopher Southan said...

John do these numbers make sense?
"ChEMBL"[SourceName] = 760,340 CIDs, = 719,420 SIDs (why is CID > “your” SIDs?)

Latest SID date was 2012-07-25, thus assume ChEMBL 14

So assume 15 should be in PubChem next week-ish
ChEMBL 14 was 1,213,242 distinct cpds; therefore you pulled over ~ 453,000 cpds from BioAssay

In BioAssay (now) + dose response + confirmed = 2556 assays = 451,421 CIDs (probably a few more since Dec) = 193028 active (pivot query nearly brought Entrez to its knees but it coped!)

The union was 1,197,521 i.e. about right for ChEMBL 14 (JFTR the intersect was 14,238)

So while I can select and/or see these on the ChEMBL side ~ 40% of ChEMBL is unselectable (other than the clunky way above) on the PubChem side.

This is why it would be useful to have a way to do this (e.g. push new SIDs “back”). BTW what do you filter out from the pulled in BioAssay data?

John Overington said...

It will take me a little time to work out what you have done!

But why would you want to just look at the ChEMBL data in PubChem - are there some cool things we are missing; what about trying the ChEMBL interface?

bst wishes,