ChEMBL Resources

Sunday, 3 February 2013

The Ontogeny and Evolution of Compound Names


Here's a post prompted by a wipe board discussion with a visitor the other day - they asked if I had ever written this down, and so here it is.

The naming of compounds is one of the most enabling and frustrating aspects of chemoinformatics - and getting a handle on compound naming conventions is one of the big gotcha's for people entering the field from bioinformatics, where things are frankly a lot simpler (yes, I know it's not easy there - but it's a lot easier than in bioinformatics).

It's interesting to view the naming of compounds in a timeline manner, and for those of you with a software bent, in the context of variable names....

I describe here a prototypical typical system - electronic lab-books and so forth will of course change the details. When a chemist makes a compound they know the 2-D structure (or they will work it out once they purify the product) and it has a lab-book reference. This could be something like 2341/23/4 (which could encode that the particular compound is labelled as compound 4 on page 23 of a lab book numbered 2341), once the compound is purified and registered in a compound database at an institute it will typically get a research code - something like EBI-7821-1 - there is usually some convention here to the naming - EBI means it was made at the European Bioinformatics Institute (this is a made up example, we don't make compounds!), and it is the 7,821th compound made, and then the -1 means that it is batch 1 - since when we make it again, it is likely to have a different purity, contaminants, and potentially differing biological activity. These are all private names, the world doesn't see any of these.

Then a chemist writes up the work, and publishes it in a journal - for brevity (and other reasons) compounds are typically published with a paper identifier like 12d within a particular paper - it is just a tag within a particular paper, with the scope of the name limited to a particular publication - just like a local variable in a software function. There will be many other completely unrelated compounds spread across the literature - this type of name is not useful for integration purposes.

If you write a patent on the compound, it will typically be named with it's IUPAC name, and/or with a patent identifier number - e.g. example 27, again a local variable within a patent. There will be no relationship between the patent local name and the publication identifier. These are public names.

So why don't people more generally use IUPAC names - well they are often long, difficult to say and pronounce without sounding like a fool, difficult to remember, difficult to type, and there are often many valid alternate IUPAC names for a particular compound - so they are usually avoided completely when talking about a compound. CAS numbers are a further alternative, the issue there is that a private organisation holds the naming rights, they are not public, there are tights restrictions on how many you can store, and using them is very expensive.

So far we have gone from a lab book reference (private), to a research code (private), to a local identifier (public) in a paper/patent, and for all of these we have a 2-D structure.

If the compound is sufficiently interesting and more people use it, and it is published in other publications, there comes a point when the research code is published and the therefore becomes public. Occasionally this happens on first publication, but it's pretty unusual.



More usual though is that a research code will appear in the literature with some interesting biological properties - something like EBI-9437, which imagine - makes stem cells differentiate into competent beta cells. The batch is not usually reported (and referees would rightly be sniffy about batch specific effects, and by the stage that people report these results the synthesis, purification and assay variability will be sorted out). It is probably the case that the structure of EBI-9437 is already published, in a patent usually - it's just that you don't know what particular compound out of many hundreds that it actually is. So for some of the most interesting recent compounds we know their research code (and usually where they come from from the syntax of the name) but not the structure. Many people outside the organization that made EBI-9437 would like to know the structure.

At some point, usually at a big meeting or in a high profile publication, the structure of EBI-9437 will be shown to the world. At this point, it can be mapped back to all the previous names (on the basis of 2D structure).

If the compound is commercially interesting, it could be licensed to another company, who often want to make it clear to the world that they control the compound, or to imply that it came from their own labs, so they may rename it. So in this case imagine our EBI-9437 becomes WTF-433932, after the rights to the compound are acquired by WTF Discovery Inc.

Many compounds stop here in terms of their progression of names - they may be too toxic for wide use, they may just not be enough money to develop them further, or they may have just been tried in the wrong assay. For the 'luckier' compounds they will typically progress to full clinical development, still using their research codes, and typically scientific documents submitted to regulators (as part of and IND or review package) use Research Codes - many of these documents are in the public domain.

When a compound gets to this stage, compound vendors offer for the compound for sale, often assigning their own catalog numbers, these are easily confused with research codes, and it is an unfortunate feature that these are starting to contaminate both public chemical databases, and to some extent the literature; with this variable scope view these are really just local variables applicable over a particular compound suppliers catalog.

For those that look good in development, and there is the potential that the compound will be used in a product, a formal non-proprietary name is assigned - nowadays there are two major names USAN, and INN (both of these are public) - I've discussed these in the past, both the plusses and the minuses in their respective systems, so I won't discuss them here further - do a search on the blog for some more pointers if you are interested. In our case, for the fictional WTF-433932, imagine it's a tyrosine kinase inhibitor and gets assigned the name faybelitinib. At this point, further discussion and publications on the compounds should use these public USAN names. This is a great help, it is a single unambiguous name for a compound, but typically comes quite late in the lifecycle of a compound - and to get more information on the properties of a compound with a USAN name, you need to know the 2D structure, Research Code, patent ids, paper ids, etc. It is interesting that with the development of their own independent pharmaceutical discovery and development capabilities, nations such as China are assigning their own non-proprietary names, following stem conventions. These are not currently easily accessible though.

Finally, a compound will get a Trade or Brand Name, this is often geographically local in scope (so a drug will have different names in different markets e.g. Glivec and Gleevec, or different names in the same geographical market for different diseases e.g. Revatio and Viagra). So, for scientific reasons these are not reliable useful names, and should, in general be avoided.

So what we have is a hierarchy of names on increasing scope and consistency as a compound progresses from round-bottomed flask to patient. Because of the conditional and temporal assignments of these names of ever increasing scope and formality, the tracking to previous names is an essential task. Suffice it to say that Research Codes are where the action is.

jpo

4 comments:

Peter Norman said...

Useful insights but a couple of other points you didn't cover John.

Sometimes the Tradename is assigned/registered before the INN is sought. This is not uncommon with some smaller companies but less common with larger ones.

Some companies also switch from a compound code number to a development compound number once the decision has been made to progress to development. The latter numbers are assigned randomly rather than sequentially.

Both Merck: L-123,456 to MK-9999, and AstraZeneca; AZ-123,456 to AZD-1111, follow this practice.

Christopher Southan said...

We covered some of this in http://www.ncbi.nlm.nih.gov/pubmed/23159359 but its pretty congruent.

JFTR Peter: AZ's or the old AH's are not typically comma'd but other companies do this (just to make it even more of a mess)

My impression is GSK do not change nos in dev. Can anyone list the major ones that do or dont ?

Peter Norman said...

from looking at company presentations and reports I believe that the following do not usually change compound IDs during development.

Pfizer, Eli Lilly, GSK, Sanofi, J&J, Bayer, BMS, and Novartis.

Novartis' format is rather different NVS or NVP ABC-123 rather than a company identifier and a 6 or 7 digit number. The numbers may, or may not be separated by a comma while Bayer uses the format BAY 12-3456

I think all other major companies switch to a 3 or 4 digit number in combination with a corporate identifier.

Roche can confuse the issue by often having both Roche (RG-1234) and Genentech (GDC-9999) development compound numbers as well as Ro-123456 numbers for the same compound

Many companies that have specific development compound codes rarely provide details of compound codes in publications but occasionally some IDs may be found in patent applications.

John Overington said...

Just to follow up a little, some companies like Pfizer tend to assimilate the code of the acquired company/compound. There is also value of tracking this historical acquisition-based switching of research code stems. There is a table in ChEMBL with this info in, but not on the interface I'm afraid.

We used one of the indices at the back of the Merck Catalog for a seed, but it needed a lot of work and curation.

Knowing the ranges for the research codes would also be useful - so for example what was the highest GW code ever published?