The problem with data…
Our laboratory blog system has been doing a reasonable job of handling protocols and simple pieces of analysis thus far. While more automation in the posting would be a big benefit, this is more a mechanical issue than a fundamental problem. To re-cap our system is that every “item” has its own post. Until now these items have been samples, or materials. The items are linked by posts that describe procedures. This system provides a crude kind of triple; Sample X was generated using Procedure A from Material Z. Where we have some analytical data, like a gel, it was generally enough to drop that in at the bottom of the procedure post. I blithely assumed that when we had more complicated data, that might for instance need re-processing, we could treat it the same way as a product or sample.
By co-incidence both Jenny and I have generated quite a bit of data over the last few weeks. I did a Small Angle Neutron Scattering (SANS) experiment at the ILL on Sunday 10 December, and Jenny has been doing quite a lot of DNA sequencing for her project. To deal with the SANS data first; the raw data is a non-standard image format. This image needs a significant quantity of processing which uses at least three different background measurements. I did a contrast variation series, which means essentially repeating the experiment with different proportions of H2O and D2O, each of which require their own set of backgrounds.
Problem one is just that this creates a lot of files. Given that I am uploading these by hand you can see here, here and here (and bearing mind that I still have these ones and five others to do), that this is going to get a bit tiring. Ok, so this is an argument for some scripting. However what I need to do is create a separate post for all 50-odd data files. Then I need to describe the data reduction, involving all of these files, down to the relatively small number of twelve independent data files (each with their own post). All of this ‘data reduction’ is done on specially written software, and is generally done by the instrument scientist supporting the experiment so describing it is quite difficult.
Then I need to actually start on the data analysis. Describing this is not straightforward. But it is a crucial part of the Open Notebook Science programme. Data is generally what it is – there is not much argument about it. It is the analysis where the disagreement comes in – is it valid, was it done properly, was the data appropriate? Recording the detail of the analysis is therefore crucial. The problem is that the data analysis for this involves fiddling. Michael Barton put it rather well in a post a week or so ago;
It would be great, every week, to write “Hurrah! I’ve discovered to this new thing to do with protein cost. Isn’t it wonderful?”. However, in the real world it’s “I spent three days arguing with R to get it to order the bars in my chart how I want”.
Data analysis is largely about fiddling until we get something right. In my case I will be writing some code (desperate times call for desperate measures) to deconvolute the contributions from various things in my data. I will be battling, not with R but with a package called Igor Pro. How do I, or should I, record this process? SVN/Sourceforge/Google Code might be a good plan but I’m no proper coder – I wouldn’t really know what to do with these things. And actually this is a minor part of the problem, I can at least record the version of the code whenever I actually use it.
The bigger problem is actually capturing the data analysis itself. As I said, this is basically fiddling with parameters until they look right. Should I attempt to capture the process by which I refine the paramaters? Or just the final values? How important is it to capture the process. I think there is at core here the issue that divides the experimental scientists from the computational scientist. I’ve never met a primarily computer based scientists that kept a notebook in a form that I recognised. Generally there is a list of files, perhaps some rough notes on what they are, but there is a sense that the record is already there in those files and that all that is really required is a proper index. I think this difference was at the core of the disagreement over whether the Open NMR project is ONS – we have very different views of what we mean by notebook and what it records. All in all I think I will try to output log files of everything I do and at least put those up.
In the short term I think we just need to swallow hard and follow our system to its logical conclusion. The data we are generating makes this a right pain to do it manually but I don’t think we have actually broken the system per se. We desperately need two things to make this easier. Some sort of partly automated posting process, probably just a script, maybe even something I could figure out myself. But for the future we need to be able to run programs that will grab data themselves and then post back to blog. Essentially we need a web service framework that is easy for users to integrate into their own analysis system. Workflow engines have a lot of potential here but I am not convinced they are sufficiently useable yet. I haven’t managed to get Taverna onto my laptop yet – but before anyone jumps on me I will admit I haven’t tried very hard. On the other hand that’s the point. I shouldn’t have to.
If I have time I will get on to Jenny’s problem in another post. Here the issue is what format to save the data in and how much do we need to divide this process up?