Proposing a data model for Open Notebooks
‘No data model survives contact with reality’ – Me, Cosener’s House Workshop 29 February 2008
This flippant comment was in response to (I think) Paolo Missier asking me ‘what the data model is’ for our experiments. We were talking about how we might automate various parts of the blog system but the point I was making was that we can’t have a data model with any degree of specificity because we very quickly find the situation where they don’t fit. However, having spent some time thinking about machine readability and the possibility of converting a set of LaBLog posts to RDF, as well as the issues raised by the problems we have with tables, I think we do need some sort of data model. These are my initial thoughts on what that might look like.
What do I mean by a data model?
What I’m suggesting is a standard format to describe experiments; a default format for online notebooks. The object is to do a number of things. Firstly identify the page(s) as being an online laboratory notebook so that they can be aggregated or auto-processed as appropriate. Secondly to make rich metadata available in a human readable and machine processable form making mashups and other things possible using tools such as Yahoo! Pipes, Dapper, and the growing range of other interesting tools, but not to impose any unnecessary limitations on what that metadata might look like. Finally it should ideally make the input and connection of data easier.
This comes with a health warning. The technical side of this is well beyon my expertise so I will probably be unclear and use the wrong terms. Such an object will be an XML wrapper with a minimal number of required attributes. There will need to be an agreed vocabulary for a few relationships but it will need to be kept to an absolute minimum.
The need for a data model
Several recent things have convinced me that we have a need for some sort of data model. The first is the general issues of getting more rich metadata into our feeds. This can be achieved in our system from the existing metadata but there are some things that might benefit from a complete re-think. I have thought for a while that the metadata should be exposed through either microformats (easy enough) or RDF (not so easy, where is the vocabulary going to go, what would it include). By incorporating some of this directly into the data model we may circumvent some problem.
Another issue is the tables. My original thinking was that if we had a data model for tables then most of our problems would go away. If there was a bit of codethat could be used to describe the columns and rows then this would enable both the presentation of a table to input data as well as the rendering of the table once complete. This would reduce the ‘time suck’ associated with data entry. However there are problems with this data model being associated with the rendering of the table. Some people like tables going one way, some another. There are may different possible formats for representing the same information and some are better for some purposes than another. Therefore the data model for the table should encode the meaning behind the table, usually input materials (or data), amounts, and products (materials or data).
Thinking about the table issue also lead me to realise something about our LaBLog system, and in particular the ‘one item-one post’ approach. We encode the relationship between inputs, outputs, and procedures through links. When I think about this I have in mind a set of inputs coming from the left and a set of outputs going tot he right. However our current ‘data model’ cannot distinguish between inputs and outputs. To achieve this we absolutely need to encode information into the link (this is obviously a more general problem that just for us).
However the argument against still stands. Anything that requires a fixed vocabulary is going to break. So how can we structure something simple enough, yet not so simple that it is useless. The following is some ideas, not based on any experience of building such things, but coming from the coal face as it were. Its probably structurally wrong in detail but my question is a) whether the overall structure is sound and b) whether it could be built.
Proposed data model
Overall an experiment has inputs and outputs. These may be data or material objects. Procedures take inputs and generate outputs. Some procedures can be encapsulated in simple agreed descriptions, some of which have existing vocabularies such as CML. Some are much more free form and may be contingent on what happens earler in the procedure.
Data should be on the cloud somewhere, in a form that can be pointed at. Material objects need to have a post or item somewhere that describes what they are. In our LaBLog system we create a blog post for each object to give it a presence in the system. More generally these could be described elsewhere in someone else’s system or lab book. The material objects require a URI. Ideally we would like our data model to work even if the associated data or placeholder does not conform.
Broadly speaking there seem to be three types of item; material objects , data, and procedures (possibly also comments). For each of these we require a provenance (author), and a date (last modified, I am assuiming a Wiki like history system in the backend somewhere). For material objects and for data there are a wide range of other metadata which may or may not be relevant or even appropriate. This would be associated with the items themselves in any case so do not necessarily matter to the present discussion.
Our data model centres on the procedure. This might look something like this;
<Open Notebook Experiment>
<title> {Text} </title>
<author> {Link} </author>
<date> {Text} </date>
[<location> </location>
<metadata> {XML block} </metadata>
<procedure>
<input> {Link} </input> [<amount> </amount>]
<input> {Link} </input> [<amount> </amount>]....
<description> [reference to controlled vocabulary if appropriate]
[Free text]
</description>
<output> {Link} </output>
<output> {Link} </output>....
</procedure>
<procedure>...
</procedure>...
</Open Notebook Experiment>
Comments
This overall structure can accomodate both many identical reactions in parallel (all procedure blocks are the same) and sequential processing types of experiments (output from one procedure is an input for the next). The procedure descriptions can call out to external vocabularies is these are available making it possible to structure this as much as is desired (or possible). Metadata is completely arbitrary and can refer to other vocabularies as appropriate. Again as I understand it, as long as this is wrapped properly it shouldn’t cause a problem. Metadate can also be included in procedure blocks if appropriate.
This all assumes that the datafiles and material descriptions carry sufficient metadata on their own to allow reasonable processing. Whether or not this is true the bottom line to me is that if you are pointing at external data you can’t rely on it having certain metadata in general. It will always be a case by case point. We can argue about a format for descriptions of material objects or samples some other time :)
How does this address user interface issues?
Arguably this has the potential to complicate the user interface issues even further. But I think that can be handled. It is important to remember that this is the ‘file’ that is generated and presented to the world. Not necessarily the file that the user generates. But in general terms you can see how this might work with a user interface.
First thing we start with is what do you want to do? How many reactions/samples do you want to run in parallel? How may things go into each reaction/sample?
The system then generate a table for input (defined as per preferences in a style sheet somewhere) where the user can then select the appropriate links (or paste them in) and amounts where appropriate. There may also be the option of putting in metadata, either arbitrary tags or (via a widget) suggestions from a controlled vocabulary.
The procedure again can be either a free text description (which could be grabbed from a blog or wiki feed or a word processor document as appropriate). Conversely a widget may be available that helps the user describe the experiment via a controlled vocab. The user may also indicate where the analysis results might be found so the system can keep an eye out for it on the appropriate feed or it might require the user to come back afterwards and make sure the connections are ok. If a product is a material object then the system should automatically generate a post on the user request, possibly when a label is printed for the object.
Finally the user is asked whether they want to continue the same experiment (moving on to workup, doing the next step) or finish this one up and move on.
Much of this can and would be made even quicker with templates for commonly used procedures or types of analysis but even without the interaction process can be reasonably simple. This still doesn’t capture the analysis process ‘fiddle until it looks right’ but I don’t think we ever will manage that. What is key is ‘this was the input data, this was the output data, this is converted one to the other’
Conclusions
None of the above is meant to be an argument against controlled vocabularies, RDF, or agreeing data formats. These are all good things and should be used where possible and practicable. However in many cases it is impossible or even inappropriate to use or agree vocabularies or formats. My object is to try and describe the simplest framework which could work, which would allow automated systems to understand as much as they can about our experiments while dealing with the practical requirements of getting it done. I hope this makes some sense. I will try to put together some pictures tomorrow to make it clearer.
Are you including interpretation as an output or are you just worrying about reaction conditions and raw data here? How would you handle error correction?
Are you including interpretation as an output or are you just worrying about reaction conditions and raw data here? How would you handle error correction?
I think it depends on how you want to handle it. I would put it in the procedure or conversely handle it at a higher level in a separate note/blog/wiki post. I am not trying to cover all parts of the process here, just the recording of facts. So you could treat intepretation either as a running note in the procedures, (I think this is working), or a direct characteristic of a product (you can isolate a precipitate), or perhaps more formally as a product of a consideration of your analysis data and other products (e.g. input = product [weighs X], NMR, MS etc; procedure is I decided whether I’d made the right thing; product is ‘yes’) but that seems a little ridiculous. A piece of metadata that says ‘this worked’ would probably do the job in many cases. More sophisticated intepretation will probably require more words and explanation.
And I am assuming this would sit on a versioning system like a wiki. One thing that this system doesn’t capture well is the way you interact with students on the UsefulChem Wiki by pointing out gaps or mistakes. But maybe this is captured by the version system which logs who made what additions (which perhaps means you don’t need an author as a required tag?).
In any case I am sure there are serious problems with this as it stands, not least that if you do a time course experiment, it requires you to keep splitting up the procedure. But may we can iterate this a few times and see whether it can ever work.
The key aspect of my proposal, I think, is that things need a URI, and that the rest should be as flexibile as possible.
I think it depends on how you want to handle it. I would put it in the procedure or conversely handle it at a higher level in a separate note/blog/wiki post. I am not trying to cover all parts of the process here, just the recording of facts. So you could treat intepretation either as a running note in the procedures, (I think this is working), or a direct characteristic of a product (you can isolate a precipitate), or perhaps more formally as a product of a consideration of your analysis data and other products (e.g. input = product [weighs X], NMR, MS etc; procedure is I decided whether I’d made the right thing; product is ‘yes’) but that seems a little ridiculous. A piece of metadata that says ‘this worked’ would probably do the job in many cases. More sophisticated intepretation will probably require more words and explanation.
And I am assuming this would sit on a versioning system like a wiki. One thing that this system doesn’t capture well is the way you interact with students on the UsefulChem Wiki by pointing out gaps or mistakes. But maybe this is captured by the version system which logs who made what additions (which perhaps means you don’t need an author as a required tag?).
In any case I am sure there are serious problems with this as it stands, not least that if you do a time course experiment, it requires you to keep splitting up the procedure. But may we can iterate this a few times and see whether it can ever work.
The key aspect of my proposal, I think, is that things need a URI, and that the rest should be as flexibile as possible.
There is a data model for scientific experiments that should set you on your way. Its called the Functional Genomics Experiment model (FuGE: http://fuge.sourceforge.net/). Its a UML model therefore this facilitates it to be realised in several concrete formats, for example, XML or relational tables. I declare a bias, as I have extended it to create an XML transfer format for gel electrophoresis (GelML: http://psidev.info/index.php?q=node/254) and therefore highly recommend it. The specificity comes for the use of ontologies or controlled vocabularies to add semantics to the generic structure. For example, the prospective use of the Ontology of Biomedical Investigations, or any number from the OBO foundry. Even though, the name suggests its for functional genomics, I am currently using it to represent neuroscience electrophysiology experiments within CARMEN (http://carmen.org.uk/), via Symba, (http://symba.sourceforge.net/) which is an open source database implementation of the FuGE data model providing a versioning mechanism and a user/query interface, although it is still in early development.
Mm, managed to get a few plugs in there :) but I do think these technologies can provide a platform for a standard representation of scientific experiments. Is a wiki the best option…..? It mimics a traditional paper based labbook very well, however, it is not very computationally amenable. Do we need to think more outside the box? If we make our data open in our lab book do we not also want to perform analysis over it where it sits, rather than downloading it back to our desktop?
There is a data model for scientific experiments that should set you on your way. Its called the Functional Genomics Experiment model (FuGE: http://fuge.sourceforge.net/). Its a UML model therefore this facilitates it to be realised in several concrete formats, for example, XML or relational tables. I declare a bias, as I have extended it to create an XML transfer format for gel electrophoresis (GelML: http://psidev.info/index.php?q=node/254) and therefore highly recommend it. The specificity comes for the use of ontologies or controlled vocabularies to add semantics to the generic structure. For example, the prospective use of the Ontology of Biomedical Investigations, or any number from the OBO foundry. Even though, the name suggests its for functional genomics, I am currently using it to represent neuroscience electrophysiology experiments within CARMEN (http://carmen.org.uk/), via Symba, (http://symba.sourceforge.net/) which is an open source database implementation of the FuGE data model providing a versioning mechanism and a user/query interface, although it is still in early development.
Mm, managed to get a few plugs in there :) but I do think these technologies can provide a platform for a standard representation of scientific experiments. Is a wiki the best option…..? It mimics a traditional paper based labbook very well, however, it is not very computationally amenable. Do we need to think more outside the box? If we make our data open in our lab book do we not also want to perform analysis over it where it sits, rather than downloading it back to our desktop?
Hi Frank, I don’t think we fundamentally disagree. I think controlled vocabs are great where they can be applied but in my experience you either end up in a situation where your vocab no longer fits (should I use FUGE or CML?) or worse, where your vocab actually constrains your experiment.
Where they can be used, and where things can be agreed, then tools should be written to help people use them. But at the moment I don’t think those tools exist for the humble experimenter such as myself.
To give an example. Your gelML documentation runs to N pages. From my perspective I would just say ‘I ran a gel’, see e.g. http://chemtools.chem.soton.ac.uk/projects/blog/blogs.php/bit_id/2733
Now what would be brilliant would be if our system could automatically say, ‘well this is a gel, therefore I should try and expose the description in gelML with all the info that I have availablee’. But then someone has to implement that for gels, for columns, for NMR, for neutron scattering data, …
My arguement is that we need to be able to use all of these vocabs, or none, to describe experiments to a practical level of detail. My proposal isn’t to ‘use a wiki’, in fact we use a blog. Most other people who are really using these systems in anger in a wetlab are using wikis because they prefer the functionality available. We need to have an extremely general framework that we can then slot all of these different systems into.
Hi Frank, I don’t think we fundamentally disagree. I think controlled vocabs are great where they can be applied but in my experience you either end up in a situation where your vocab no longer fits (should I use FUGE or CML?) or worse, where your vocab actually constrains your experiment.
Where they can be used, and where things can be agreed, then tools should be written to help people use them. But at the moment I don’t think those tools exist for the humble experimenter such as myself.
To give an example. Your gelML documentation runs to N pages. From my perspective I would just say ‘I ran a gel’, see e.g. http://chemtools.chem.soton.ac.uk/projects/blog/blogs.php/bit_id/2733
Now what would be brilliant would be if our system could automatically say, ‘well this is a gel, therefore I should try and expose the description in gelML with all the info that I have availablee’. But then someone has to implement that for gels, for columns, for NMR, for neutron scattering data, …
My arguement is that we need to be able to use all of these vocabs, or none, to describe experiments to a practical level of detail. My proposal isn’t to ‘use a wiki’, in fact we use a blog. Most other people who are really using these systems in anger in a wetlab are using wikis because they prefer the functionality available. We need to have an extremely general framework that we can then slot all of these different systems into.
Just to add to that. Looking at the gelML documentation, I would see that as a data file. Ideally this would be generated by some electrophoresis system. The ‘lab book’ would then point at that file as the ‘product’ of the procedure ‘ran gel’ with the inputs ‘these samples’. A clever system would either a) pull the procedure and the inputs from the gelML file, or take these from the lab book and then add them to the gelML file.
My thinking primarily relates to the generation of samples. Once you get into instrumentation a lot of the issues are already taken care of by the systems available.
Just to add to that. Looking at the gelML documentation, I would see that as a data file. Ideally this would be generated by some electrophoresis system. The ‘lab book’ would then point at that file as the ‘product’ of the procedure ‘ran gel’ with the inputs ‘these samples’. A clever system would either a) pull the procedure and the inputs from the gelML file, or take these from the lab book and then add them to the gelML file.
My thinking primarily relates to the generation of samples. Once you get into instrumentation a lot of the issues are already taken care of by the systems available.
“…My arguement is that we need to be able to use all of these vocabs, or none, to describe experiments to a practical level of detail….”
I agree, FuGE will allow you to do this.
With all these things, especially modelling, there are lots of ways to reach the end goal. With respect to vocabularies, often these are just recommend strings (which are valuable in themselves). However ontologies, specifically OBI are an effort to assign meaning and machine interpretability to terms as a community wide consensus, to avoid the issues you mention.
Tools: tools are a big issue and often a stumbling block, ideally as you suggest the information or datafiles should be automatically exported from instrumentation. This is currently not the case and therefore there is a high cost involved, for some this is an unacceptable cast, to the “humble experimenter”, often the benefits of annotation and sharing are not obvious to the person that generated the data, but I am preaching to the converted :)
In regards to the data model. FuGE is the data model [PMID: 17921998]. It is a heavy weight solution, although it is designed to represent the common aspects of a scientific experiment. The GelML case is an extension of the FuGE data model to be more specific in the representation of a gel electrophoresis experiment.
However, if you are looking for a data model for the representation of scientific experiments, FuGE should be your first port of call. You do not have to use-it out of the box as one entity but re-use components. It allows the representation of protocols – the standard instructions/templates and the application of those protocols – capturing the day to day changeable information such as dates and lot numbers without having to fill in the full protocol again. Ontologies or controlled vocabularies are recommended although not restricted, you can use free text.
Its feasible that text from wikis or blogs could be extracted, by text mining or rdf queries and then placed in a model like this, However this is close to the data integration paradigm that currently exists in bioinformatics. Its alot of wasted effort. I would argue it is better to structure an annotate the data at the source on day one, than trying to extract information later. I have this problem with my current project, in that semantic extraction involves picking up the phone and asking the experimenter what you did and what does it all mean. I can see similar parallels here, the only difference is that the lab book is electronic and not paper, its still unstructured information.
Like I say, there are many ways to achieve the same end. A common data model for scientific experiments exists, which has undergone a long, cross community and public consultation process. It does need tool support. However, just hypothesising, even an xslt representation maybe able to be “plugged” into a blog or a wiki to provide the structure.
I am afraid I am in the data model, structured and semantic data camp. Electronic notebooks and open data are a step in the right direction – a massive step. However I still believe that unless the data comes structured with semantics, either data models RDF or whatever the Zeitgeist technology, we are still not maximising the potential form this information.
“…My arguement is that we need to be able to use all of these vocabs, or none, to describe experiments to a practical level of detail….”
I agree, FuGE will allow you to do this.
With all these things, especially modelling, there are lots of ways to reach the end goal. With respect to vocabularies, often these are just recommend strings (which are valuable in themselves). However ontologies, specifically OBI are an effort to assign meaning and machine interpretability to terms as a community wide consensus, to avoid the issues you mention.
Tools: tools are a big issue and often a stumbling block, ideally as you suggest the information or datafiles should be automatically exported from instrumentation. This is currently not the case and therefore there is a high cost involved, for some this is an unacceptable cast, to the “humble experimenter”, often the benefits of annotation and sharing are not obvious to the person that generated the data, but I am preaching to the converted :)
In regards to the data model. FuGE is the data model [PMID: 17921998]. It is a heavy weight solution, although it is designed to represent the common aspects of a scientific experiment. The GelML case is an extension of the FuGE data model to be more specific in the representation of a gel electrophoresis experiment.
However, if you are looking for a data model for the representation of scientific experiments, FuGE should be your first port of call. You do not have to use-it out of the box as one entity but re-use components. It allows the representation of protocols – the standard instructions/templates and the application of those protocols – capturing the day to day changeable information such as dates and lot numbers without having to fill in the full protocol again. Ontologies or controlled vocabularies are recommended although not restricted, you can use free text.
Its feasible that text from wikis or blogs could be extracted, by text mining or rdf queries and then placed in a model like this, However this is close to the data integration paradigm that currently exists in bioinformatics. Its alot of wasted effort. I would argue it is better to structure an annotate the data at the source on day one, than trying to extract information later. I have this problem with my current project, in that semantic extraction involves picking up the phone and asking the experimenter what you did and what does it all mean. I can see similar parallels here, the only difference is that the lab book is electronic and not paper, its still unstructured information.
Like I say, there are many ways to achieve the same end. A common data model for scientific experiments exists, which has undergone a long, cross community and public consultation process. It does need tool support. However, just hypothesising, even an xslt representation maybe able to be “plugged” into a blog or a wiki to provide the structure.
I am afraid I am in the data model, structured and semantic data camp. Electronic notebooks and open data are a step in the right direction – a massive step. However I still believe that unless the data comes structured with semantics, either data models RDF or whatever the Zeitgeist technology, we are still not maximising the potential form this information.
License
To the extent possible under law, Cameron Neylon has waived all copyright and related or neighboring rights to Science in the Open. Published from the United Kingdom.
I am also found at...
Tags
Recent posts
Recent Posts
Most Commented