Home » Blog

The trouble with semantics…

8 September 2008 8 Comments

…is knowing what you mean…

I posted last week about the spontaneous CMLReact hackfest held around Peter Murray-Rust’s dining room table the day after Science Blogging in London. There were a number of interesting things that came out of the exercise for me. The first was that it would be relatively easy to design a moderately strict, but pretty standard, description format for a synthetic chemistry lab notebook that could be automatically scraped into CMLReact.

Automatic conversions from lab book to machine readable XML

CMLReact files have (roughly) three sections. In the first, all the molecules that are relevant to the description are described, or in the ideal semantic web world pointed to at an external authority such as Chemspider, PubChem, or other source. In the second section the relationships between input materials, solvents, products, and samples are described. In general all of these will be molecules which are referred to in the first session but this is not absolutely required (and this will be important later). The final section describes observables, procedures, yields, and other descriptions of what happened or what was measured.

If we take a look at the UsefulChem experiment that we converted to CMLReact you can see that most of this information is available in one form or another. The molecules are described via InChi/InChiKey at the bottom of the page. This could be used as they are to populate the molecules section. A little additional markup to distinguish between reactants, solvents, reagents, and products would make it possible to start populating the second section describing the relationships between these molecules.

The third section is the most tricky, and this will always be an 80:20 game. The object is to abstract as much information as can be reasonably garnered without putting in the vast amount of work required to get close to 100% retrieval. At the end of the day, if someone wants the real detail they can go back to the lab book. Peter has demonstrated text scraping tools that do a pretty good job of extracting a lot of this information. In combination with a bit of markup it is reasonable to expect that some basic information (amounts of reagents, yield, temperature of reaction, some descriptive terms) could reasonably be extracted. Again, getting 80-90% of a subset of regularly used terms  would be very powerful.

But what are we describing?

There is a problem with grabbing this descriptive information from the lab notebook however, and it is a problem that is very general and something I believe we need to grapple with urgently. There is a fundamental question as to what it is that this file is describing. Does it describe the plan of the experiment? The record of carrying out a specific example of this experiment? An ‘averaged’ description of a set of equivalent experiments? A general description of the reaction? Or a description of a model of what we expect or think is happening?

If you look closely at the current version of the CMLReact file you will see that the yield is expressed as a percentage with a standard deviation. This is actually describing the average of three independent reactions but that is not actually made explicit anywhere in this file. Is this important? Well I think it is because it has an effect on what any outward links back to the lab book mean. There is a significant difference between – ‘this link points to an example of this kind of reaction’ (which might in fact be significantly different in the details) and ‘this link points to this exact experiment’ or indeed ‘this link points to an index of relevant experimental results’. Those distinctions need to be encoded in the links, or perhaps more likely made explicit in the abstracted file.

The CMLReact file is an abstraction of the experimental record. It is therefore important to make it clear what the level of abtraction is and what has been abstracted out of that description. This relates to the distinction I have made before between the flexibility required to record an experiment versus the ability to use a more structured vocabulary to describe the experiment after it has happened. My impression is that people who work in developing these controlled vocabularies are focussed on description rather than recording and don’t often make the distinction between the two. There is also often a lack of distinction between describing an experiment and describing a model of what happened in that experiment.  This is important because the model may need to be modified in the future whereas the description of the experiment should be accurate.

Summary

My view remains that when recording an experiment the system used should be as flexible as possible. Structure can be added to this primary record when convenient to make the process of abstracting from this primary record to a controlled vocabulary easier. The primary goal for me, for the moment, remains making a human readable record available. The process of converting the primary record into a controlled vocabulary, such as CMLReact, FuGE, or workflow system such as Taverna, should be enabled via domain specific automated or semi-automated tools that help the user to structure their description of the experiment in a way that makes it more directly useful to them but maintains the links with the primary record. Where the same controlled vocabulary is used for more abstracted descriptions of studies, experiments, or the models that purport to describe them, this distinction must be made clear.

Semantics depends absolutely on being clear about what you are describing. There is absolutely no point in having absolute clarity about the description of an object if the nature of that object is fuzzy. Get it right and we could have a very sophisticated description of the scientific record. Get it wrong and that description could be at best unclear and at worst downright misleading.


  • Jim Procter

    as ever, a well reasoned analysis, Cameron. The problem reminds me a little of the process of historical research. I’m not a historian, but I distantly remember that much emphasis is placed on rigorously describing the relationship between the ‘unstructured’ primary source and the more structured interpretation (evidence of stoneworking = toolbuilding, for instance). However, in the case of primary records of experiments, the scientist/observer can build in tags to the structure at the time or after the fact (aided by the labbook tool) which unambigously telegraphs the precise nature and origin of key pieces of evidence.

    The transformation from primary record to structured data instance, however, is always going to require some work – even if it is simply manual curation that the transformed result is an accurate reflection of the primary (ideally, this could be round-tripped too, and the result compared to the ‘relevant’ parts of the original).

  • Jim Procter

    as ever, a well reasoned analysis, Cameron. The problem reminds me a little of the process of historical research. I’m not a historian, but I distantly remember that much emphasis is placed on rigorously describing the relationship between the ‘unstructured’ primary source and the more structured interpretation (evidence of stoneworking = toolbuilding, for instance). However, in the case of primary records of experiments, the scientist/observer can build in tags to the structure at the time or after the fact (aided by the labbook tool) which unambigously telegraphs the precise nature and origin of key pieces of evidence.

    The transformation from primary record to structured data instance, however, is always going to require some work – even if it is simply manual curation that the transformed result is an accurate reflection of the primary (ideally, this could be round-tripped too, and the result compared to the ‘relevant’ parts of the original).

  • Fun to read and finger on the pulse. With a focus on chemistry articles (my passion for the moment) it is clear that an author needs to write them the way they choose need to to write them to communicate the nature of their work – the purpose, the process, the execution, the conclusions, the future etc. Publishers already provide us templates to write into and that brings some structure and not too constraining (but a LOT of work sometimes…I recently gave up on submitting an article to one journal since I simply couldn’t figure out how to navigate their Word template (and I am NOT naive around computers…).

    With the inherent freedom required for writing an article some structure needs to be layered to facilitate searching and “semanticising”. This can be automated and semi-automated but a way for authors to assist the process, and tweak the results of automated analysis is necessary. This is the vision of the markup system we are building.. http://www.chemspider.com/blog/chemistry-document-markup-and-free-access-structure-based-searching-of-publications.html

    The same system could be applied to blog posts, wiki-pages, ONS pages. Don’t remove the freedom and layer on too much structure up front but allow semantic touchpoints it to be recovered later by others..primarily the author in my opinion, but then others also as they see value. This is the value of the WikiProfessionals direction too.

  • Fun to read and finger on the pulse. With a focus on chemistry articles (my passion for the moment) it is clear that an author needs to write them the way they choose need to to write them to communicate the nature of their work – the purpose, the process, the execution, the conclusions, the future etc. Publishers already provide us templates to write into and that brings some structure and not too constraining (but a LOT of work sometimes…I recently gave up on submitting an article to one journal since I simply couldn’t figure out how to navigate their Word template (and I am NOT naive around computers…).

    With the inherent freedom required for writing an article some structure needs to be layered to facilitate searching and “semanticising”. This can be automated and semi-automated but a way for authors to assist the process, and tweak the results of automated analysis is necessary. This is the vision of the markup system we are building.. http://www.chemspider.com/blog/chemistry-document-markup-and-free-access-structure-based-searching-of-publications.html

    The same system could be applied to blog posts, wiki-pages, ONS pages. Don’t remove the freedom and layer on too much structure up front but allow semantic touchpoints it to be recovered later by others..primarily the author in my opinion, but then others also as they see value. This is the value of the WikiProfessionals direction too.

  • Gordon Rae

    What is the purpose of translating laboratory notes in this way? Who are the readers? You have to be a scientist to be able to read another scientist’s work books, and no amount of translation will alter that.

    Lofty Zadeh once pointed out that the more precise you make a statements, the less meaningful it is. This is evident in complexity theory but it’s a problem with induction and theory building in all empirical disciplines.

  • Gordon Rae

    What is the purpose of translating laboratory notes in this way? Who are the readers? You have to be a scientist to be able to read another scientist’s work books, and no amount of translation will alter that.

    Lofty Zadeh once pointed out that the more precise you make a statements, the less meaningful it is. This is evident in complexity theory but it’s a problem with induction and theory building in all empirical disciplines.

  • Gordon, the purpose of translating the notes is to generate a machine readable form of the laboratory record. So for instance Peter Murray-Rust has converted a synthetic chemistry thesis to enable computational rendering of the overall synthetic schemes, details of the reactions, products, inputs, conditions etc.

    This ultimately will allow people to ask questions like ‘Do Ugi reactions done with amine X work in THF?’

    There is also the benefit of precisions. It is one thing to say that ‘you need to be a scientist to read another scientist’s work books’ but in practice most scientists can’t read any other scientists work books. Abstracting a little to a controlled vocabulary (and then presenting it in a nice formatted manner) can actually make things more comprehensible to humans as well as computers.

  • Gordon, the purpose of translating the notes is to generate a machine readable form of the laboratory record. So for instance Peter Murray-Rust has converted a synthetic chemistry thesis to enable computational rendering of the overall synthetic schemes, details of the reactions, products, inputs, conditions etc.

    This ultimately will allow people to ask questions like ‘Do Ugi reactions done with amine X work in THF?’

    There is also the benefit of precisions. It is one thing to say that ‘you need to be a scientist to read another scientist’s work books’ but in practice most scientists can’t read any other scientists work books. Abstracting a little to a controlled vocabulary (and then presenting it in a nice formatted manner) can actually make things more comprehensible to humans as well as computers.