What to use as a the primary key for chemicals?
We are in the process of rolling out the LaBLog system to the new bioscience laboratory within ISIS at the Rutherford Appleton Laboratory. Because this is a new lab we have a real opportunity to embed the system in the way we run the laboratory and the way we practise our science. One of the things we definitely want to do is to use it to maintain a catalogue of all our stocks of chemicals. This is important to us because we are a user laboratory and expect people to come in on a regular basis to do their experiments. This in turn means we need to keep track of everything they bring in to the lab and any safety implications. Thus we want to use our system to log in every bottle of material that comes into the lab.
Now following on from my post about feeds it is clear that we also want to provide a good range of searchable indexes for people to be able to tell what we are using. So we would ideally want to expose InChi, InChiKey, SMILES, CML perhaps, PubChem Ids etc. These can all be converted one to the other using web services so we don’t need to type all of them in manually. All that is required is a nice logging screen where we can drop in one type of index key, the size of the bottle, supplier, lot numbers, perhaps a link to safety data. The real question is what is the index key that is easiest to input? For those of you in or near a laboratory I suggest an exercise. Go and pick up the nearest bottle of commodity stuff from a commercial supplier (i.e. not oligos or peptides). What is written on it? What is a nice short identifier that can consistently be found on pretty much any bottle of chemicals? For those unlucky people who don’t have a laboratory at their fingertips I have provided a clue below.
The Chemical Abstracts Service number is the one identifier that can reasonably reliably be found on most commercially supplied substances. Yet, as described by Peter Murray-Rust and Antony Williams recently you can’t look these up without paying for them. And indeed by recording them for your own purposes (say in a database of the compounds we have in the laboratory) we may be violating the terms of the license.
So what to do? Well we can adopt another standard or standards. Jean-Claude Bradley argued in a comment on my recent post that InChiKey is the way to go, but for this specific purpose (logging materials in) this may be too much to type in many cases (certainly SMILES, InChi and CML would be). You can’t expect people to draw in the structure each time a compound comes in, particularly if we get into arguments about which precise salt of cAMP we are using today. What is required is a simple, relatively short number. This is what makes the CAS number so appealling; it is short, easily typed in, and printed on most bottles.
So, along with Peter I think the answer is to use PubChem CID numbers. PubChem doesn’t use CAS numbers and CAS actively lobbied the US government to limit the scope of PubChem. PubChem CIDs are relatively short, and there are a range of web services from which other descriptions can be retrieved (see e.g. PubChem Power User Gateway). The only thing that is missing is the addition of CID’s on bottles. If we can get wide enough agreement on this I think the answer is to start writing to the suppliers. It’s not great effort on their part to add CIDs (or if there is something better, some other index) to the bottles I would have thought and it provides a lot of extra value for them. PubChem can provide links through to up to date safety data (without the potential legal issues that maintaining a database of MSDS forms with CAS numbers creates), it provides free access to a supplier index through which customers can find them, and it could also save them a small fortune in CAS license fees.
There is another side to this, which is that if there is a wholesale shift (or even the threat of a shift) away from CAS as the only provider of chemical indexing, then perhaps the ACS will wake up and realise that not only is this protectionism bad for chemistry, but it is bad for their business. The database of CAS numbers has no real value in its own right. It is only useful as a pointer to other information. If the ACS were to make the use and indexing of CAS numbers free then it would be driving traffic to its own value added services. The ACS needs to move into the 21st (or perhaps the 20th) century in terms of both its attitudes and business models. We often criticise the former, but without shifts in the latter there is a real risk of critical damage to an organisation that still has the potential to make a big contribution to the chemical sciences. If the major chemical suppliers were to start printing PubChem CID’s on their bottles it might start to persuade the powers that be within the ACS that things need to change.
So, to finish; do people agree that CID is a good standard index to aggregate around? If so we should start writing to the major chemical manufacturers, perhaps through open letters in the general literature (obviously not JACS), to suggest that they include these on their packaging. I’m up for drafting something if people are prepared to sign up to it.