Who’s got the bottle?

Lots of helpful comments from people on my question about what to use as a good identifier of chemicals? I thought it might be useful to re-phrase what it was that I wanted because I think some of the comments, while important discussion points don’t really impinge directly on my current issue.

I have in mind a special type of page on our LaBLog system that will easily allow the generation of a post that describes a new bottle of material that comes into the lab. From a user perspective you want to enter the minimum amount of necessary information, probably a name, a company, perhaps a lot number and/or catalogue number to enable reordering. From the system perspective you want to try and grab as many different ways of describing the material as possible, including where appropriate SMILES, InChi, CML, or whatever. My question was, how do I provide a simple key that will enable the system to go off and find (if possible) these other identifiers. This isn’t really a database per se but a collection of descriptors on a page (although we would like to pull the data out and into a proper database at a later stage). CAS numbers are great because they are written on most bottles and are a well curated system. However I thought that the only way of converting from CAS to anything else was to go through a CAS service. Therefore I thought PubChem CID’s (or SID’s) might be a good way to do this.

So from my perspective a lot of the technical issues with substances versus chemicals versus structures aren’t so important. All I want is to, on a best efforts basis, pull down as many other descriptors as possible to expose in the post. For some things (e.g. yeast extract) the issues of substances versus compounds (not to mention supplier) get right out of hand (I am slightly bemused that it has a CAS number, and there are multiple SID’s in PubChem). Certainly it ain’t going to have an InChi. But if you try and get nothing it doesn’t really matter. Also we are dealing here with common materials. If as Dan Zaharevitz points out, we were dealing with compounds from synthetic chemists we would get into serious trouble, but in this case I think we could rely on our collaborating chemists to get InChi’s/SMILES/CML correct and use those directly. In the ideal Open Notebook Science world we would simply point to their lab books anyway.

So the fundamental issue for me; is there something written on the bottle of material that we can use as a convenient search key to pull down as many other descriptors as we can?

Now I am with Antony Williams on this, if CAS got its act together and made their numbers an open standard then that would be the best solution. It is curated and all pervasive as an identifier. Both Antony and Rich Apodaca have pointed out that I was wrong to say that CAS numbers aren’t in PubChem (and Rich pointed to two useful posts [1], [2] on how to get into PubChem using CAS numbers). So actually, my problem is probably solved by an application of Rich’s instructions on hacking PubChem (even if it turns out we have to download the entire database). The issue here is whether they will stay there or whether they may in the end get pulled.

I do think that for my purposes that PubChem CID’s and SID’s will do the job in this specific case. However as has been pointed out there are issues with reliability and curation. So I will accept that it is probably too early to start suggesting that suppliers label their bottles with PubChem IDs. This may happen anyway (Aldrich seem to have them in the catalogue at least; haven’t been able to check a bottle yet) in the longer term and I guess we have to wait and see what happens.

Peter Murray-Rust has also updated with a series of posts [1], [2], [3] around the issues of chemical substance identity, CAS, Wikipedia et al. Peter Suber has aggregated many of the related posts together. And Glyn Moody has called us to the barricades.

8 Replies to “Who’s got the bottle?”

  1. I disagree on a few things. First, CAS numbers function best as an index into the work done by CAS to extract information from the chemical literature. The work is very valuable and many people will find it quite worth the price they have to pay to access it, but I think there are too many practical and political problems in trying to make them serve more global (and open) purposes.

    Second, I really don’t understand the supposed problem with curation if substance IDs are used. You would be saying I obtained the compound from X and X has claimed the structure is Y and that fact is recorded in a publicly accessible database. Of course X could be wrong about the structure or use names or other identifiers that are inconsistent with other usages, but at least you could see that if it was in PubChem. Also, X could upload at least summaries of of analytical data on lots of the compounds and a host of other data that is only relevant to the actual samples they distribute, not to the compound globally. I think it is vitally important that we get suppliers of compounds involved because they are the ones in the best position to provide the details of what exactly is in the bottles they ship. (Speaking of which, we are starting to get a reasonable amount of analytical data, so I should upload that as proof of principle). I also think it is very important for researchers to recognize that what’s in the bottle doesn’t always line up with what is supposed to be in the bottle and I think it is hard to understand that if you think of what’s in the bottle only by its chemical representation. For a brutal example of how wrong things can go, Google “MDMA retraction”.

  2. I disagree on a few things. First, CAS numbers function best as an index into the work done by CAS to extract information from the chemical literature. The work is very valuable and many people will find it quite worth the price they have to pay to access it, but I think there are too many practical and political problems in trying to make them serve more global (and open) purposes.

    Second, I really don’t understand the supposed problem with curation if substance IDs are used. You would be saying I obtained the compound from X and X has claimed the structure is Y and that fact is recorded in a publicly accessible database. Of course X could be wrong about the structure or use names or other identifiers that are inconsistent with other usages, but at least you could see that if it was in PubChem. Also, X could upload at least summaries of of analytical data on lots of the compounds and a host of other data that is only relevant to the actual samples they distribute, not to the compound globally. I think it is vitally important that we get suppliers of compounds involved because they are the ones in the best position to provide the details of what exactly is in the bottles they ship. (Speaking of which, we are starting to get a reasonable amount of analytical data, so I should upload that as proof of principle). I also think it is very important for researchers to recognize that what’s in the bottle doesn’t always line up with what is supposed to be in the bottle and I think it is hard to understand that if you think of what’s in the bottle only by its chemical representation. For a brutal example of how wrong things can go, Google “MDMA retraction”.

  3. @DrZZ ‘I also think it is very important for researchers to recognize that what’s in the bottle doesn’t always line up with what is supposed to be in the bottle’

    Couldn’t agree more, that is why we are planning to log every bottle of stuff into our system individually. Each bottle gets its own URI. Part of our aim is actually to be able to make a precise (and machine readable) assertion when something in that bottle is not what someone else asserts it to be.

    I take your point that using a system for something its not built for can cause problems. I think the concern that people had with PubChem is that there are misattributions and ambiguities but as you say as long as you can point at something that makes the assertion. The key point is being able to do that pointing (and that the assertion can be pinned down to an authority of some description – the triple has to be a ‘quad’). For my purposes the substance ID’s are perfectly adequate I think, especially as we use many things, like yeast extract, that aren’t identifiable chemicals.

    If we go down the SID route we will need buy in from both suppliers and PubChem.

  4. @DrZZ ‘I also think it is very important for researchers to recognize that what’s in the bottle doesn’t always line up with what is supposed to be in the bottle’

    Couldn’t agree more, that is why we are planning to log every bottle of stuff into our system individually. Each bottle gets its own URI. Part of our aim is actually to be able to make a precise (and machine readable) assertion when something in that bottle is not what someone else asserts it to be.

    I take your point that using a system for something its not built for can cause problems. I think the concern that people had with PubChem is that there are misattributions and ambiguities but as you say as long as you can point at something that makes the assertion. The key point is being able to do that pointing (and that the assertion can be pinned down to an authority of some description – the triple has to be a ‘quad’). For my purposes the substance ID’s are perfectly adequate I think, especially as we use many things, like yeast extract, that aren’t identifiable chemicals.

    If we go down the SID route we will need buy in from both suppliers and PubChem.

Comments are closed.