Data models for capturing and describing experiments – the discussion continues
Frank Gibson has continued the discussion that kicked off here and has continued here [1, 2, 3, 4] and in other places [1, 2] along the way. Frank’s exposition on using FuGE as a data model is very clear in what it says and does not say and some of his questions have revealed sloppiness in the way I originally described what I was trying to do. Here I will respond to his responses and try to clarify what it is that I want, and what I want it to achieve. I still feel that we are trying to describe and achieve different things, but that this discussion is a great way of getting to the bottom of this and achieving some clarity in our description and language.
CN: What I’m suggesting is a standard format to describe experiments;…
A “standard” in the true sense of the word (established by consensus and approved by a recognized body) already exists to describe life-science experiments. It is a data model represented in UML called FuGE.
I got off to a very bad start here. I should have used the word ‘capture’ here. This to me is about capturing the data streams that come out of lab work. FuGE is indeed a model to describe life science experiments. I have concerns about whether it is extendable in a useful way to other disciplines but it is well structured for what it was designed to do; which is to describe life science experiments, particularly those involving high throughput analysis. What I am less sure about is it’s practical application to capturing whatis going on in the researcher’s head in the laboratory during an experiment.
[CN] …..a default format for online notebooks. The object is to do a number of things. Firstly identify the page(s) as being an online laboratory notebook so that they can be aggregated or auto-processed as appropriate.
[FG] I see this as two different and separate things, the data model which represents experiments, and the presentation of the model to the user, in this case described as an online notebook. Page numbers are an arbitrary visual aid, they are not integral to modelling experiments
Here we start to see how the different motivations are driving our views. What I want here is a marker on a web document that says ‘I am a scientific experiment’ (page was a more term to use – I simply mean any web document, generally accessed through discrete web pages). This will allow aggregation and distribution of the notebook a la PostGenomic or Chemical Blogspace. To me this is more important than the format of the underlying data. If I can find interesting data I will probably put the work into extracting it in a form useful to me To Frank I suspect the aggregation and indexing is a peripheral issue. If the data isn’t in a agreed format it isn’t useful for him.
[CN] Another issue is the tables. My original thinking was that if we had a data model for tables then most of our problems would go away.
[FG] I am not sure I agree here. What is a table? I see it as a particular visual display mechanism that you have chosen to represent you results. The results can be modelled more accurately within the data model such as chemical-has_measurement, measurement has_numerical value and has_unit. I believe this statement is confusing the visual presentation of data with structuring the data.
Again this is a central user interface issue for us. Capturing an experiment in the wet lab, whether noting it as it happens or planning what you are going to in advance, is often most easily done with a table. Tables are not well implemented in the wiki and blog frameworks we are using for these systems. Therefore providing a table to capture the experiment is critical if you actually want anyone to use your system. Our users consistently identify this as the single biggest barrier to them using our system.
Now the heavyweight approach to this is to say; ‘That’s why you need a data model. Once you have that you can generate a nice web form to capture the necessary data’. The problem with this comes when you do something slightly different. As an example I had a template set up in our system for capturing the setup of SDS-PAGE gels. This would go and look for anything that tagged as ‘protein’ as potential samples and present these in a drop down menu. This was fine until the day I wanted to run a DNA-protein conjugate on the gel. Essentially I had broken my own data model. This could be fixed, and I did fix it, by changing the way my template looked for potential samples. But in the cut and thrust of real lab work (as opposed to an academic pottering under sufferance of his students) this isn’t feasible. We can’t extend the data model every time we do something new – we are always doing something new.
Bottom line is that tables are a familiar and easy way for users to structure information visually. If you can’t provide them people will walk away. Now how we implement those in a general way I think is an open question. I would argue against anything more complex than; ‘This is an input. This is an output.’ But then again this is because we are at the sample production end rather than the sample analysis end.
[FG]….FuGE is designed so that it provides a generic structure which can then be described or further specialised by the user/application by extending the model itself or by using cv’s/ontologies or free text. This provides the flexibility and in theory future proof.
But does this require that the user does the extension every time they move on to something new. As a matter of interest, how much time and effort went into agreeing the GelML? Is it practical to do this extension over and over again? And who will fund it? My concern is that achieving added value requires the controlled vocabulary. If we are going to just end up using free text because a cv doesn’t exist for the experiment we are doing then why use a complex data structure?
[CN] Overall an experiment has inputs and outputs. These may be data or material objects. Procedures take inputs and generate outputs.[..] Broadly speaking there seem to be three types of item; material objects , data, and procedures (possibly also comments). For each of these we require a provenance (author), and a date
[FG] I would agree with you assessment of what classes are needed. This corresponds to what FuGE contains as illustrated in the digram below (click [to see] image […])
Yep, we can agree on something! FuGE certainly has the categories of items and a bunch more. I don’t think this is terribly controversial except that we will start having arguements over what is ‘data’ and what is a ‘material property’ as we move into the more chemical sciences.
[FG] In summary, the position I want to present is that FuGE is a data model to represent scientific experiments. Several domains are using it to represent their experiments from traditional biology/molecular biology to neurophysiology. I believe FuGE could form the underlying model for a “notebook” via an abstraction/presentation layer to the user. In how should it be implemented, blog, wiki, database, latex, XML, RDF, OWL, I am not going to hypothesis [CN – my emphasis]. However, a database implementation of the FuGE schema is already in development called SyMBA which abstracts away from the user presenting simple web forms to fill out the XML which is then stored as a relation database.
I think this defines our different perspectives quite well. I am all about hypothesising how this should be implemented and how this will effect the user interface. Simple web based forms will not work for the experimental work that is done in my lab, and I think Jean-Claude, and possibly Peter Murray-Rust will agree with me. Unless you are doing very structured experiments (of the type that FuGE was designed to describe) then my experience is that a form based approach is inadequate to capture what is actually happening and the decision making process that is driving that. You can however apply form based approaches to specific well defined protocols but only if such well defined protocols exist for what you are doing.
Finally I want to return to the begining of Frank’s post where he notes:
I think a large part of this discussion confuses and conflates 3 issues which I believe to be separate;
- the representation of experiments – the data model
- the presentation or level of abstraction to the user (probably some what dependent on 3.)
- the implementation of the data model
Yep. Couldn’t agree more. And I don’t think my use of language has been helpful here. I am still groping for the right terminology to use. But there is one thing that is missing here which is the mapping of the scientist’s data model onto all of these. If we don’t consider the way the user thinks about their experiments then we will run into trouble very rapidly.
If I can crystallise my concerns about FuGE (which are not based, as yet on a real attempt to implement it, hopefully we can work together to put together a few examples of how it could be used to represent our experiments and nail some of these concerns down properly) I think they do fundamentally revolve around the user experience in the lab recording the experiment as it happens.
- Are the data models there for what I am doing? If not who is going to develop them and then develop the UI implmentations?
- If we simply use a framework and the rest is free text is FuGE going to add more than a simpler model?
- Does the model extend effectively into other domains, particularly on the hinterlands between chemistry and biology?
- Does the data model effectively capture the process of an experiment as it happens? Particularly, how does it deal with capturing the basis of decisions to change (or abandon) a protocol midstream, to entirely re-purpose an experiment, or to reassess the identity of a specific sample or material? How is an exciting new observation incorporated?
None of which should be taken as a rejection. I am strongly in favour of incorporating structured descriptions of experiments where they exist and can be implemented in a way that is good for the user. I am also definitely against re-inventing the wheel. Structured descriptions of processes can be used effectively to make it easier for the researcher to capture what is happening during an experiment as long as they are accessible and implemented well. UML/XML/RDF/OWL systems have great appeal because our ‘lab book system’ can decide to look these up and say ‘Ah hah! Running a gel are you? Well we’ll grab GelML and set up some forms for you to fill in then.’
Perhaps the right question to be asking is, what is the minimal implementation of FuGE, and is it actually very different to what I in fact proposed to start with? Perhaps we should look for the similarities first?