The heavyweights roll in…distinguishing recording the experiment from reporting it
Frank Gibson of peanutbutter has left a long comment on my post about data models for lab notebooks which I wanted to respond to in detail. We have also had some email exchanges. This is essentially an incarnation of the heavyweight vs lightweight debate when it comes to tools and systems for description of experiments. I think this is a very important issue and that it is also subject to some misunderstandings about what we and others are trying to do. In particular I think we need to draw a distinction between recording what we are doing in the lab and reporting what we have done after the fact.
You can see the full comment here, I will reply to a couple of the specifics.
“…My arguement is that we need to be able to use all of these vocabs, or none, to describe experiments to a practical level of detail….”
I agree, FuGE will allow you to do this.
We may need to agree to disagree on this. Looking at what we are doing and what I can understand of the FuGE model I am not sure we can capture the experiments that we do in practise (but see below as to why this may be more because we are at cross purposes than anythinge else)
With all these things, especially modelling, there are lots of ways to reach the end goal. With respect to vocabularies, often these are just recommend strings (which are valuable in themselves). However ontologies, specifically OBI are an effort to assign meaning and machine interpretability to terms as a community wide consensus, to avoid the issues you mention.
One problem is then when you try to re-use these things slightly outside the setting where they were defined. I have found it difficult to see how to apply for e.g. OBO to the work we do in protein engineering. Bio-ontologies tend to be focussed on the biology whereas my work in using biological systems as tools is not well described. For instance, while I love the look of GoPubMed it doesn’t actually perform any better than the old PubMed when I search for the use of something (usually in itself a heavily modified protein) as a tool. Nonetheless all these things are very valuable in helping us to agree what it is we are talking about.
Tools: tools are a big issue and often a stumbling block, ideally as you suggest the information or datafiles should be automatically exported from instrumentation. This is currently not the case and therefore there is a high cost involved, for some this is an unacceptable cast, to the “humble experimenter”, often the benefits of annotation and sharing are not obvious to the person that generated the data, but I am preaching to the converted :)
Absolutely. Proper export of data from instruments is critical. Making my proposal useful relies on good metadata being associated with the datafiles. Here FuGE or other agreed descriptions will be immensely valuable. We need to make the case that metadata is worth the effort, loud and strong. It does take effort; there is no way all of it can be automated; but it can be an enormous help. This is clearer for people who work in collaborations or who are involved in large scale projects but our experience is that it also makes a huge different in a small scale operation.
Another point about the necessity of tools is that the descriptions of frameworks and ontologies are basically incomprehensible. I tried to read the paper you reference below, I really did. The text even made sense but I couldn’t for the life of me figure out what the diagrams meant. As a user the paper doesn’t help me figure out how to apply this to what I do. We need nice comprehensible tools that help us work through this.
In regards to the data model. FuGE is the data model [PMID: 17921998]. It is a heavy weight solution, although it is designed to represent the common aspects of a scientific experiment. The GelML case is an extension of the FuGE data model to be more specific in the representation of a gel electrophoresis experiment.
Having thought about this, I think we are at cross purposes. What we are doing is trying to implement systems that will capture activity in the laboratory, an smart, electronic version of the traditional laboratory notebook. FuGE is a system for reporting experiments. This is a crucial distinction and I don’t think it is ones that is thought about very often.
What we do in the lab is very hard to record. This is what I meant when I said ‘No data model survives contact with reality‘. If I write down a protocol of what I am going to do in the lab then 50% of the time that will change. Not the parameters, not the details, but a fundamental change to what I planned to do. This is even more the case in synthetic chemistry which is often ‘mix and see what happens; then make a decision about what to do next’. Now this could be captured in a FuGE framework but only in a very artificial or almost meaningless way. This is why I lean towards lightweight approaches and very free form capture.
But this is very different when it comes to reporting the experiment after the fact. Most of these initiatives have been driven by the need to describe experiments in the published literature. Here we can place the results and what we did into a sensible framework because we know what the result was. Here we should work to towards structuring things as much as possible. What we should be asking is how to structure our data and procedure capture processes so that we already have as much of the relevant metadata as possible. And then how do we convert what we have generated into a more structured data model. Fundamentally capture is separate to reporting. The two can only really be the same for a completely automated process (such as running an instrument)
Its feasible that text from wikis or blogs could be extracted, by text mining or rdf queries and then placed in a model like this, However this is close to the data integration paradigm that currently exists in bioinformatics. Its alot of wasted effort. I would argue it is better to structure an annotate the data at the source on day one, than trying to extract information later. I have this problem with my current project, in that semantic extraction involves picking up the phone and asking the experimenter what you did and what does it all mean. I can see similar parallels here, the only difference is that the lab book is electronic and not paper, its still unstructured information.
Our lab book is not quite unstructured, it does have metadata and it does define different aspects of the process. We are also very in favour of the use of templates for repeated processes which can be used to structure specific aspects of the capture process. Don’t get too focussed on the notion that a wiki or blog is free text. It is just like FuGE. It can be completely free or it can be a highly structured XML document. It depends on how you are interacting with it and what layers are between you and the underlying database.
The key to our blog system is that the structure is arbitrary; the system itself is semantically unaware. How you choose to process the information is up to you. New keys and key values can be added at any time. We have found this crucial and it is this which makes me sceptical about any semantically structure data model in this context.
Like I say, there are many ways to achieve the same end. A common data model for scientific experiments exists, which has undergone a long, cross community and public consultation process. It does need tool support. However, just hypothesising, even an xslt representation maybe able to be “plugged” into a blog or a wiki to provide the structure.
I think this is similar to what I mean when I think of a template as a widget. I want to run a gel, and someone has helpfully written a little widget which pops up, asks me the relevant questions, and then goes away and generates the appropriate GelML file, probably mostly based on default values. But I’m not going to wait around until someone writes me one of these for everything we do. Ideally there would be widget that could go to the appropriate data standard and parse that to generate the interface that asks the right questions. If it can’t find the right thing then you revert to free text and tables.
I am afraid I am in the data model, structured and semantic data camp. Electronic notebooks and open data are a step in the right direction – a massive step. However I still believe that unless the data comes structured with semantics, either data models RDF or whatever the Zeitgeist technology, we are still not maximising the potential form this information.
I can’t disagree with this. However I don’t beleive that we can capture what actually happens in the lab within any sophisticated data model except for extremely specific cases. Reporting it is very different and is subject to different requirements including community and publishing standards. What we need are systems that help us capture what a happened (which is much more than the data) and then help us to report it in a coherent way.