Proposing a data model for Open Notebooks
‘No data model survives contact with reality’ – Me, Cosener’s House Workshop 29 February 2008
This flippant comment was in response to (I think) Paolo Missier asking me ‘what the data model is’ for our experiments. We were talking about how we might automate various parts of the blog system but the point I was making was that we can’t have a data model with any degree of specificity because we very quickly find the situation where they don’t fit. However, having spent some time thinking about machine readability and the possibility of converting a set of LaBLog posts to RDF, as well as the issues raised by the problems we have with tables, I think we do need some sort of data model. These are my initial thoughts on what that might look like.
What do I mean by a data model?
What I’m suggesting is a standard format to describe experiments; a default format for online notebooks. The object is to do a number of things. Firstly identify the page(s) as being an online laboratory notebook so that they can be aggregated or auto-processed as appropriate. Secondly to make rich metadata available in a human readable and machine processable form making mashups and other things possible using tools such as Yahoo! Pipes, Dapper, and the growing range of other interesting tools, but not to impose any unnecessary limitations on what that metadata might look like. Finally it should ideally make the input and connection of data easier.
This comes with a health warning. The technical side of this is well beyon my expertise so I will probably be unclear and use the wrong terms. Such an object will be an XML wrapper with a minimal number of required attributes. There will need to be an agreed vocabulary for a few relationships but it will need to be kept to an absolute minimum.
The need for a data model
Several recent things have convinced me that we have a need for some sort of data model. The first is the general issues of getting more rich metadata into our feeds. This can be achieved in our system from the existing metadata but there are some things that might benefit from a complete re-think. I have thought for a while that the metadata should be exposed through either microformats (easy enough) or RDF (not so easy, where is the vocabulary going to go, what would it include). By incorporating some of this directly into the data model we may circumvent some problem.
Another issue is the tables. My original thinking was that if we had a data model for tables then most of our problems would go away. If there was a bit of codethat could be used to describe the columns and rows then this would enable both the presentation of a table to input data as well as the rendering of the table once complete. This would reduce the ‘time suck’ associated with data entry. However there are problems with this data model being associated with the rendering of the table. Some people like tables going one way, some another. There are may different possible formats for representing the same information and some are better for some purposes than another. Therefore the data model for the table should encode the meaning behind the table, usually input materials (or data), amounts, and products (materials or data).
Thinking about the table issue also lead me to realise something about our LaBLog system, and in particular the ‘one item-one post’ approach. We encode the relationship between inputs, outputs, and procedures through links. When I think about this I have in mind a set of inputs coming from the left and a set of outputs going tot he right. However our current ‘data model’ cannot distinguish between inputs and outputs. To achieve this we absolutely need to encode information into the link (this is obviously a more general problem that just for us).
However the argument against still stands. Anything that requires a fixed vocabulary is going to break. So how can we structure something simple enough, yet not so simple that it is useless. The following is some ideas, not based on any experience of building such things, but coming from the coal face as it were. Its probably structurally wrong in detail but my question is a) whether the overall structure is sound and b) whether it could be built.
Proposed data model
Overall an experiment has inputs and outputs. These may be data or material objects. Procedures take inputs and generate outputs. Some procedures can be encapsulated in simple agreed descriptions, some of which have existing vocabularies such as CML. Some are much more free form and may be contingent on what happens earler in the procedure.
Data should be on the cloud somewhere, in a form that can be pointed at. Material objects need to have a post or item somewhere that describes what they are. In our LaBLog system we create a blog post for each object to give it a presence in the system. More generally these could be described elsewhere in someone else’s system or lab book. The material objects require a URI. Ideally we would like our data model to work even if the associated data or placeholder does not conform.
Broadly speaking there seem to be three types of item; material objects , data, and procedures (possibly also comments). For each of these we require a provenance (author), and a date (last modified, I am assuiming a Wiki like history system in the backend somewhere). For material objects and for data there are a wide range of other metadata which may or may not be relevant or even appropriate. This would be associated with the items themselves in any case so do not necessarily matter to the present discussion.
Our data model centres on the procedure. This might look something like this;
<Open Notebook Experiment>
<title> {Text} </title>
<author> {Link} </author>
<date> {Text} </date>
[<location> </location>
<metadata> {XML block} </metadata>
<procedure>
<input> {Link} </input> [<amount> </amount>]
<input> {Link} </input> [<amount> </amount>]....
<description> [reference to controlled vocabulary if appropriate]
[Free text]
</description>
<output> {Link} </output>
<output> {Link} </output>....
</procedure>
<procedure>...
</procedure>...
</Open Notebook Experiment>
Comments
This overall structure can accomodate both many identical reactions in parallel (all procedure blocks are the same) and sequential processing types of experiments (output from one procedure is an input for the next). The procedure descriptions can call out to external vocabularies is these are available making it possible to structure this as much as is desired (or possible). Metadata is completely arbitrary and can refer to other vocabularies as appropriate. Again as I understand it, as long as this is wrapped properly it shouldn’t cause a problem. Metadate can also be included in procedure blocks if appropriate.
This all assumes that the datafiles and material descriptions carry sufficient metadata on their own to allow reasonable processing. Whether or not this is true the bottom line to me is that if you are pointing at external data you can’t rely on it having certain metadata in general. It will always be a case by case point. We can argue about a format for descriptions of material objects or samples some other time :)
How does this address user interface issues?
Arguably this has the potential to complicate the user interface issues even further. But I think that can be handled. It is important to remember that this is the ‘file’ that is generated and presented to the world. Not necessarily the file that the user generates. But in general terms you can see how this might work with a user interface.
First thing we start with is what do you want to do? How many reactions/samples do you want to run in parallel? How may things go into each reaction/sample?
The system then generate a table for input (defined as per preferences in a style sheet somewhere) where the user can then select the appropriate links (or paste them in) and amounts where appropriate. There may also be the option of putting in metadata, either arbitrary tags or (via a widget) suggestions from a controlled vocabulary.
The procedure again can be either a free text description (which could be grabbed from a blog or wiki feed or a word processor document as appropriate). Conversely a widget may be available that helps the user describe the experiment via a controlled vocab. The user may also indicate where the analysis results might be found so the system can keep an eye out for it on the appropriate feed or it might require the user to come back afterwards and make sure the connections are ok. If a product is a material object then the system should automatically generate a post on the user request, possibly when a label is printed for the object.
Finally the user is asked whether they want to continue the same experiment (moving on to workup, doing the next step) or finish this one up and move on.
Much of this can and would be made even quicker with templates for commonly used procedures or types of analysis but even without the interaction process can be reasonably simple. This still doesn’t capture the analysis process ‘fiddle until it looks right’ but I don’t think we ever will manage that. What is key is ‘this was the input data, this was the output data, this is converted one to the other’
Conclusions
None of the above is meant to be an argument against controlled vocabularies, RDF, or agreeing data formats. These are all good things and should be used where possible and practicable. However in many cases it is impossible or even inappropriate to use or agree vocabularies or formats. My object is to try and describe the simplest framework which could work, which would allow automated systems to understand as much as they can about our experiments while dealing with the practical requirements of getting it done. I hope this makes some sense. I will try to put together some pictures tomorrow to make it clearer.