Lots of helpful comments from people on my question about what to use as a good identifier of chemicals? I thought it might be useful to re-phrase what it was that I wanted because I think some of the comments, while important discussion points don’t really impinge directly on my current issue.
I have in mind a special type of page on our LaBLog system that will easily allow the generation of a post that describes a new bottle of material that comes into the lab. From a user perspective you want to enter the minimum amount of necessary information, probably a name, a company, perhaps a lot number and/or catalogue number to enable reordering. From the system perspective you want to try and grab as many different ways of describing the material as possible, including where appropriate SMILES, InChi, CML, or whatever. My question was, how do I provide a simple key that will enable the system to go off and find (if possible) these other identifiers. This isn’t really a database per se but a collection of descriptors on a page (although we would like to pull the data out and into a proper database at a later stage). CAS numbers are great because they are written on most bottles and are a well curated system. However I thought that the only way of converting from CAS to anything else was to go through a CAS service. Therefore I thought PubChem CID’s (or SID’s) might be a good way to do this.
So from my perspective a lot of the technical issues with substances versus chemicals versus structures aren’t so important. All I want is to, on a best efforts basis, pull down as many other descriptors as possible to expose in the post. For some things (e.g. yeast extract) the issues of substances versus compounds (not to mention supplier) get right out of hand (I am slightly bemused that it has a CAS number, and there are multiple SID’s in PubChem). Certainly it ain’t going to have an InChi. But if you try and get nothing it doesn’t really matter. Also we are dealing here with common materials. If as Dan Zaharevitz points out, we were dealing with compounds from synthetic chemists we would get into serious trouble, but in this case I think we could rely on our collaborating chemists to get InChi’s/SMILES/CML correct and use those directly. In the ideal Open Notebook Science world we would simply point to their lab books anyway.
So the fundamental issue for me; is there something written on the bottle of material that we can use as a convenient search key to pull down as many other descriptors as we can?
Now I am with Antony Williams on this, if CAS got its act together and made their numbers an open standard then that would be the best solution. It is curated and all pervasive as an identifier. Both Antony and Rich Apodaca have pointed out that I was wrong to say that CAS numbers aren’t in PubChem (and Rich pointed to two useful posts [1], [2] on how to get into PubChem using CAS numbers). So actually, my problem is probably solved by an application of Rich’s instructions on hacking PubChem (even if it turns out we have to download the entire database). The issue here is whether they will stay there or whether they may in the end get pulled.
I do think that for my purposes that PubChem CID’s and SID’s will do the job in this specific case. However as has been pointed out there are issues with reliability and curation. So I will accept that it is probably too early to start suggesting that suppliers label their bottles with PubChem IDs. This may happen anyway (Aldrich seem to have them in the catalogue at least; haven’t been able to check a bottle yet) in the longer term and I guess we have to wait and see what happens.
Peter Murray-Rust has also updated with a series of posts [1], [2], [3] around the issues of chemical substance identity, CAS, Wikipedia et al. Peter Suber has aggregated many of the related posts together. And Glyn Moody has called us to the barricades.