March 8, 2008December 30, 2009 by Cameron Neylon

What to use as a the primary key for chemicals?

We are in the process of rolling out the LaBLog system to the new bioscience laboratory within ISIS at the Rutherford Appleton Laboratory. Because this is a new lab we have a real opportunity to embed the system in the way we run the laboratory and the way we practise our science. One of the things we definitely want to do is to use it to maintain a catalogue of all our stocks of chemicals. This is important to us because we are a user laboratory and expect people to come in on a regular basis to do their experiments. This in turn means we need to keep track of everything they bring in to the lab and any safety implications. Thus we want to use our system to log in every bottle of material that comes into the lab.

Now following on from my post about feeds it is clear that we also want to provide a good range of searchable indexes for people to be able to tell what we are using. So we would ideally want to expose InChi, InChiKey, SMILES, CML perhaps, PubChem Ids etc. These can all be converted one to the other using web services so we don’t need to type all of them in manually. All that is required is a nice logging screen where we can drop in one type of index key, the size of the bottle, supplier, lot numbers, perhaps a link to safety data. The real question is what is the index key that is easiest to input? For those of you in or near a laboratory I suggest an exercise. Go and pick up the nearest bottle of commodity stuff from a commercial supplier (i.e. not oligos or peptides). What is written on it? What is a nice short identifier that can consistently be found on pretty much any bottle of chemicals? For those unlucky people who don’t have a laboratory at their fingertips I have provided a clue below.

The Chemical Abstracts Service number is the one identifier that can reasonably reliably be found on most commercially supplied substances. Yet, as described by Peter Murray-Rust and Antony Williams recently you can’t look these up without paying for them. And indeed by recording them for your own purposes (say in a database of the compounds we have in the laboratory) we may be violating the terms of the license.

So what to do? Well we can adopt another standard or standards. Jean-Claude Bradley argued in a comment on my recent post that InChiKey is the way to go, but for this specific purpose (logging materials in) this may be too much to type in many cases (certainly SMILES, InChi and CML would be). You can’t expect people to draw in the structure each time a compound comes in, particularly if we get into arguments about which precise salt of cAMP we are using today. What is required is a simple, relatively short number. This is what makes the CAS number so appealling; it is short, easily typed in, and printed on most bottles.

So, along with Peter I think the answer is to use PubChem CID numbers. PubChem doesn’t use CAS numbers and CAS actively lobbied the US government to limit the scope of PubChem. PubChem CIDs are relatively short, and there are a range of web services from which other descriptions can be retrieved (see e.g. PubChem Power User Gateway). The only thing that is missing is the addition of CID’s on bottles. If we can get wide enough agreement on this I think the answer is to start writing to the suppliers. It’s not great effort on their part to add CIDs (or if there is something better, some other index) to the bottles I would have thought and it provides a lot of extra value for them. PubChem can provide links through to up to date safety data (without the potential legal issues that maintaining a database of MSDS forms with CAS numbers creates), it provides free access to a supplier index through which customers can find them, and it could also save them a small fortune in CAS license fees.

There is another side to this, which is that if there is a wholesale shift (or even the threat of a shift) away from CAS as the only provider of chemical indexing, then perhaps the ACS will wake up and realise that not only is this protectionism bad for chemistry, but it is bad for their business. The database of CAS numbers has no real value in its own right. It is only useful as a pointer to other information. If the ACS were to make the use and indexing of CAS numbers free then it would be driving traffic to its own value added services. The ACS needs to move into the 21st (or perhaps the 20th) century in terms of both its attitudes and business models. We often criticise the former, but without shifts in the latter there is a real risk of critical damage to an organisation that still has the potential to make a big contribution to the chemical sciences. If the major chemical suppliers were to start printing PubChem CID’s on their bottles it might start to persuade the powers that be within the ACS that things need to change.

So, to finish; do people agree that CID is a good standard index to aggregate around? If so we should start writing to the major chemical manufacturers, perhaps through open letters in the general literature (obviously not JACS), to suggest that they include these on their packaging. I’m up for drafting something if people are prepared to sign up to it.

18 Replies to “What to use as a the primary key for chemicals?”

Joerg Kurt Wegner says:

March 8, 2008 at 5:14 pm

I personally would prefer InChIKeys or ChemSpiderID’s over PubChem identifiers. Take just the Ginkgolide example Antony discussed recently http://www.chemspider.com/blog/how-big-is-the-challenge-of-curation-and-what-is-the-structure-of-ginkgolide-b.html
http://www.chemspider.com/blog/more-on-ginkgolide-b.html

PubChem has on contrast to ChemSpider no curation options, so is defnitely the most suspicious source, without any validation/correction options for molecular structural.
Joerg Kurt Wegner says:

March 8, 2008 at 5:14 pm

I personally would prefer InChIKeys or ChemSpiderID’s over PubChem identifiers. Take just the Ginkgolide example Antony discussed recently http://www.chemspider.com/blog/how-big-is-the-challenge-of-curation-and-what-is-the-structure-of-ginkgolide-b.html
http://www.chemspider.com/blog/more-on-ginkgolide-b.html

PubChem has on contrast to ChemSpider no curation options, so is defnitely the most suspicious source, without any validation/correction options for molecular structural.
ChemSpiderMan says:

March 9, 2008 at 7:09 am

I have commented on your position on my blog. I have encouraged people to come here and read your post first. I believe more people should be seeing what you write about…I read with interest.

My comments are here: http://www.chemspider.com/blog/enforcing-copyright-of-cas-numbers.html

Great post Cameron! I differ in my view of what we should do in the near future but ultimately you might be right with the direction. there is work to be done technologically to move this forward but don’t let anyone fool you…curation is a MUST or it will get worse.
ChemSpiderMan says:

March 9, 2008 at 7:09 am

I have commented on your position on my blog. I have encouraged people to come here and read your post first. I believe more people should be seeing what you write about…I read with interest.

My comments are here: http://www.chemspider.com/blog/enforcing-copyright-of-cas-numbers.html

Great post Cameron! I differ in my view of what we should do in the near future but ultimately you might be right with the direction. there is work to be done technologically to move this forward but don’t let anyone fool you…curation is a MUST or it will get worse.
Rich Apodaca says:

March 9, 2008 at 3:39 pm

> …you canâ€™t look these up [CAS Numbers] without paying for them.

Not quite true. Not only can you look them up, you can download and repurpose large numbers of them through PubChem:

http://depth-first.com/articles/2007/05/21/simple-cas-number-lookup-with-pubchem

http://depth-first.com/articles/2006/09/29/hacking-pubchem-direct-access-with-ftp
Rich Apodaca says:

March 9, 2008 at 3:39 pm

> …you canâ€™t look these up [CAS Numbers] without paying for them.

Not quite true. Not only can you look them up, you can download and repurpose large numbers of them through PubChem:

http://depth-first.com/articles/2007/05/21/simple-cas-number-lookup-with-pubchem

http://depth-first.com/articles/2006/09/29/hacking-pubchem-direct-access-with-ftp
Rich Apodaca says:

March 9, 2008 at 3:43 pm

PubChem CIDs may seem like the answer, but the real work in creating a numerical compoound identifier is in keeping junk and duplicates out of the database backing it – something the PubChem team has been fairly quiet on:

http://depth-first.com/articles/2006/12/12/the-problem-with-ferrocene
Rich Apodaca says:

March 9, 2008 at 3:43 pm

PubChem CIDs may seem like the answer, but the real work in creating a numerical compoound identifier is in keeping junk and duplicates out of the database backing it – something the PubChem team has been fairly quiet on:

http://depth-first.com/articles/2006/12/12/the-problem-with-ferrocene
ChemSpiderMan says:

March 9, 2008 at 6:12 pm

Rich is correct with his comments and using ferrocene as an example. Most systems handle organics very well. As soon as you step into inorganics, organometallics, polymers, Markush things get difficult very quickly. Now, we could all rush off to resolve those issues right now but I say get the organics under control first (apologies to other chemistries) but there is much to learn about processes there which can translate over to the other chemistries. Also, regarding PubChem…think about their charter. It was to support the national Screening initiative NOT to become a repository for chemical vendors and act as a replacement for CAS. This flavor is creeping in as people want to push on CAS business practices. Is it appropriate to push PubChem into that battle when even the initial thrust of PubChem resulted in such negative press and turned into quite a debacle. Does the National Institute of Health want to get into managing information around “other chemistry”. That’s a funding issue and a mandate issue to be discussed.
ChemSpiderMan says:

March 9, 2008 at 6:12 pm

Rich is correct with his comments and using ferrocene as an example. Most systems handle organics very well. As soon as you step into inorganics, organometallics, polymers, Markush things get difficult very quickly. Now, we could all rush off to resolve those issues right now but I say get the organics under control first (apologies to other chemistries) but there is much to learn about processes there which can translate over to the other chemistries. Also, regarding PubChem…think about their charter. It was to support the national Screening initiative NOT to become a repository for chemical vendors and act as a replacement for CAS. This flavor is creeping in as people want to push on CAS business practices. Is it appropriate to push PubChem into that battle when even the initial thrust of PubChem resulted in such negative press and turned into quite a debacle. Does the National Institute of Health want to get into managing information around “other chemistry”. That’s a funding issue and a mandate issue to be discussed.
DrZZ says:

March 10, 2008 at 11:59 am

It is not completely clear from your post, but if you are literally asking about database primary keys, let me state from experience: NEVER USE EXTERNALLY MEANINGFUL THINGS AS INTERNAL DATABASE KEYS. It causes terrible problems for data integrity. It sounds more like you are asking how best to connect to structural data. I think it is overly optimistic to think that single answer will work. In practical terms, you need to worry about how the structural information will come to you. If all that is given is a CAS number or a catalog number, that is what you have to work with and you will have to design your workflow accordingly. You also have to recognize that not all compounds will have a PubChem CID or a CAS. That may not be a problem if you are only using well known and well published compounds, but if you are going to be working with synthetic chemists, you almost certainly will have compounds that have not been registered. If you are going to use PubChem CID, you might want to think about the difference between the Compound ID and the Substance ID. There’s a difference between saying “the substance was obtained from Sigma-Aldrich catalog number XXXX” and “the substance has the same structure as Sigma-Aldrich catalog number XXXX”. Of course if the source of the compound hasn’t submitted to PubChem, using SID is not an option. Along those lines I would point out the ChemSpiderMan doesn’t have it quite right; although PubChem was never envisioned as a replacement for CAS, it was always envisioned as a potential repository for chemical vendors and anyone else that could submit chemical information. The notion that it was to be very narrowly focused on just the screening library was what CAS wanted it to be restricted to and what the NIH fought against. I’m not sure what he is referring to when he says “debacle” but the development of PubChem has proceeded according to the original plan.
DrZZ says:

March 10, 2008 at 11:59 am

It is not completely clear from your post, but if you are literally asking about database primary keys, let me state from experience: NEVER USE EXTERNALLY MEANINGFUL THINGS AS INTERNAL DATABASE KEYS. It causes terrible problems for data integrity. It sounds more like you are asking how best to connect to structural data. I think it is overly optimistic to think that single answer will work. In practical terms, you need to worry about how the structural information will come to you. If all that is given is a CAS number or a catalog number, that is what you have to work with and you will have to design your workflow accordingly. You also have to recognize that not all compounds will have a PubChem CID or a CAS. That may not be a problem if you are only using well known and well published compounds, but if you are going to be working with synthetic chemists, you almost certainly will have compounds that have not been registered. If you are going to use PubChem CID, you might want to think about the difference between the Compound ID and the Substance ID. There’s a difference between saying “the substance was obtained from Sigma-Aldrich catalog number XXXX” and “the substance has the same structure as Sigma-Aldrich catalog number XXXX”. Of course if the source of the compound hasn’t submitted to PubChem, using SID is not an option. Along those lines I would point out the ChemSpiderMan doesn’t have it quite right; although PubChem was never envisioned as a replacement for CAS, it was always envisioned as a potential repository for chemical vendors and anyone else that could submit chemical information. The notion that it was to be very narrowly focused on just the screening library was what CAS wanted it to be restricted to and what the NIH fought against. I’m not sure what he is referring to when he says “debacle” but the development of PubChem has proceeded according to the original plan.
Physchim62 says:

March 11, 2008 at 2:25 pm

I completely agree with DrZZ’s comments: you MUST not use CAS numbers as your primary database key! This will work, well, until the moment that it doesn’t work any more… which is the moment that you most need the system! As for the problems of curating CAS numbers, you should take a practical stance: do you use CAS databases to find the boiling point of acetone? Of course you don’t! Neither are CAS databases used to find the CAS numbers of most commercially available compounds. I can only hope that CAS’s stupidity and greed will not affect the normal progression of science.
Physchim62 says:

March 11, 2008 at 2:25 pm

I completely agree with DrZZ’s comments: you MUST not use CAS numbers as your primary database key! This will work, well, until the moment that it doesn’t work any more… which is the moment that you most need the system! As for the problems of curating CAS numbers, you should take a practical stance: do you use CAS databases to find the boiling point of acetone? Of course you don’t! Neither are CAS databases used to find the CAS numbers of most commercially available compounds. I can only hope that CAS’s stupidity and greed will not affect the normal progression of science.
Cameron Neylon says:

March 11, 2008 at 2:50 pm

I probably shouldn’t have used the word key. We are not intending to put together a database per se (we might later but we certainly wouldn’t use CAS as the primary key). We really just want assert that ‘Post ID X is about a material that is the same as CAS/InChi/PubChem SID/ChemSpider ID Y’ and the same for whatever identifiers we can find. Its not a primary database key in any sense, just the most practical search term that I’m after.
Cameron Neylon says:

March 11, 2008 at 2:50 pm

I probably shouldn’t have used the word key. We are not intending to put together a database per se (we might later but we certainly wouldn’t use CAS as the primary key). We really just want assert that ‘Post ID X is about a material that is the same as CAS/InChi/PubChem SID/ChemSpider ID Y’ and the same for whatever identifiers we can find. Its not a primary database key in any sense, just the most practical search term that I’m after.
Duncan Hull says:

March 14, 2008 at 2:50 pm

What about KEGG?
Duncan Hull says:

March 14, 2008 at 2:50 pm

What about KEGG?

Comments are closed.