Recording the fiddly bits of experimental and data analysis work

We are in the slow process of gearing up within my group at RAL to adopting the Chemtools LaBLog system and in the process moving properly to an Open Notebook status. This has taken much longer than I had hoped but there have been some interesting lessons along the way. Here I want to think a bit about a problem that has been troubling me for a while.

I haven’t done a very good job of recording what I’ve been doing in the times that I have been in a lab over the past couple of months. Anyone who has been following along will have seen small bursts of apparently unrelated activity where nothing much ever seems to come to a conclusion. This has been divided up mainly into a) a SANS experiment we did in early November which has now moved into a data analysis phase, b) some preliminary, and thus far fairly unconvincing experiments, attempting to use a very new laser tweezers setup at the Central Laser Facility to measure protein-DNA interactions at the single molecule level and c) other random odds and sods that have come by. None of these have been very well recorded for a variety of reasons.

Data analysis, particularly when it uses a variety of specialist software tools, is something I find very challenging to record. A common approach is to take some relatively raw data, run it through some software, and repeat, while fiddling with parameters to get a feel for what is going on. Eventually the analysis is run “for real” and the finalised (at least for the moment) structure/number/graph is generated. The temptation is obviously just to formally record the last step but while this might be ok as a minimum standard if only one person is involved, when more people are working through data sets it makes sense to try and keep track of exactly what has been done and which data has been partially processes in which ways. This helps us both in terms of being able to quickly track where we are with the process but also reduces the risk of replicating effort.

The laser tweezers experiment involves a lot of optimising of buffer conditions, bead loading levels, instrumental parameters and whatnot. Essentially a lot of fiddling, rapid shifts from one thing to another and not always being too sure exactly what is going on. We are still at the stage of getting a feel for things rather than stepping through a well ordered experiment. Again the recording tends to be haphazard as you try on thing  and then another. We’re not even absolutely sure what we should be recording for each “run” or indeed really what a “run” is yet.

The common theme here is “fiddling” and the difficulty of recording it efficiently, accurately, and usefully. What I would prefer to be doing is somehow capturing the important aspects of what we’re doing as we do it. What is less clear is what the best way to do that is. In the case of data analysis we have good model for how to do this well. Good use of repositories and the use of versioned scripts for handling data conversions, in the way the Michael Barton in particular has talked about provide an example of good practice. Unfortunately it is good practice that is almost totally alien to experimental biochemists and is also not easily compatible with a lot of the software we use.

The ideal would be a work bench using a graphical representation of data analysis tools and data repositories that would automatically generate scripts and deposit these and versioned data files into an appropriate repository. This would enable the “docking” of arbitrary web services, software packages and whatever, as well as connection to shared data stores. The purpose of the workbench would be to record what is done, although it might also provide some automation tools. In many ways this is what I think of when look at work flow engines like Taverna and platforms for sharing workflows like MyExperiment.

Its harder in the real world. Here the workbench is, well, the workbench but the idea of recording everything along with contextual metadata is pretty similar. The challenge lies in recording enough different aspects of what is going on to capture the important stuff without generating a huge quantity or data that can never be searched effectively. It is possible to record multiple video streams, audio, screencast any control computers , but it will be almost impossible to find anything in these data streams.

A challenge that emerges over and over again in laboratory recording is that you always seem to not be recording the thing that you really now need to have. Yet if you record everything you still won’t have it because you won’t be able to find it. Video, image, and audio search will one day make a huge difference to this but in the meantime I think we’re just going to keep muddling on.

The distinction between recording and presenting – and what it means for an online lab notebook

Something that has been bothering me for quite some time fell into place for me in the last few weeks. I had always been slightly confused by my reaction to the fact that on UsefulChem Jean-Claude actively works to improve and polish the description of the experiments on the wiki. Indeed this is one of the reasons he uses a wiki as the process of making modifications to posts on blogs is generally less convenient and in most cases there isn’t a robust record of the different versions. I have always felt uncomfortable about this because to me a lab book is about the record of what happened – including any mistakes in recording you make along the way. There is some more nebulous object (probably called a report) which aggregates and polishes the description of the experiments together.

Now this is fine, but point is that the full history of a UsefulChem page is immediately available from the history. So the full record is very clearly there – it is just not what is displayed. In our system we tend to capture a warts and all view of what was recorded at the time and only correct typos or append comments or observations to a post. This tends not be very human readable in most cases – to understand the point of what is going on you have to step above this to a higher level – one which we are arguably not very good at describing at the moment.

I had thought for a long time that this was a difference between our respective fields. The synthetic chemistry of UsefulChem lends itself to a slightly higher level description where the process of a chemical reaction is described in a fairly well defined, community accepted, style. Our biochemistry is more a set of multistep processes where each of those steps is quite stereotyped. In fact for us it is difficult to define where the ‘experiment’ begins and end. This is at least partly true, but actually if you delve a little deeper and also have a look at Jean-Claude’s recent efforts to use a controlled vocabulary to describe the synthetic procedures a different view arises. Each line of one these ‘machine readable’ descriptions actually maps very well onto each of our posts in the LaBLog. Something that maps on even better is the log that appears near the bottom of each UsefulChem page. What we are actually recording is rather similar. It is simply that Jean-Claude is presenting it at a different level of abstraction.

And that I think is the key. It is true that synthetic chemistry lends itself to a slightly different level of abstraction than biochemistry and molecular biology, but the key difference actually comes in motivation. Jean-Claude’s motivation from the beginning has been to make the research record fully available to other scientists; to present that information to potential users. My focus has always been on recording the process that occurs in the lab and particular to capture the connections between objects and data files. Hence we have adopted a fine grained approach that provides a good record, but does not necessarily make it easy for someone to follow the process through. On UsefulChem the ideal final product contains a clear description of how to repeat the experiment. On the LaBLog this will require tracking through several posts to pick up the thread.

This also plays into the discussion I had some months ago with Frank Gibson about the use of data models. There is a lot to be said for using a data model to present the description of an experiment. It provides all sorts of added value to have an agreed model of what these descriptions look like. However it is less clear to me that it provides a useful way of recording or capturing the research process as it happen, at least in a general case. Stream of consciousness recording of what has happened, rather than stopping halfway through to figure out how what you are doing fits into the data model, is what is required at the recording stage. One of the reasons people feel uncomfortable with electronic lab notebooks is that they feel they will lose the ability to scribble such ‘free form’ notes – the lack of any presuppositions about what the page should loook like is one of the strengths of pen and paper.

However, once the record, or records, have been made then it is appropriate to pull these together and make sense of them – to present the description of an experiment in a structured and sensible fashion. This can of course be linked back to the primary records and specific data files but it provides a comprehensible and fine grained descriptionof the rationale for and conduct of the experiment as well as placing the results in context. This ‘presentation layer’ is something that is missing from our LaBLog but could relatively easily be pulled together by writing up the methodology section for a report. This would be good for us and good for people coming into the system looking for specific information.

Person Frank Gibson

Right click for SmartMenu shortcuts