Home » Blog

The problem with data…

17 December 2007 14 Comments

Our laboratory blog system has been doing a reasonable job of handling protocols and simple pieces of analysis thus far. While more automation in the posting would be a big benefit, this is more a mechanical issue than a fundamental problem. To re-cap our system is that every “item” has its own post. Until now these items have been samples, or materials. The items are linked by posts that describe procedures. This system provides a crude kind of triple; Sample X was generated using Procedure A from Material Z. Where we have some analytical data, like a gel, it was generally enough to drop that in at the bottom of the procedure post. I blithely assumed that when we had more complicated data, that might for instance need re-processing, we could treat it the same way as a product or sample.

By co-incidence both Jenny and I have generated quite a bit of data over the last few weeks. I did a Small Angle Neutron Scattering (SANS) experiment at the ILL on Sunday 10 December, and Jenny has been doing quite a lot of DNA sequencing for her project. To deal with the SANS data first; the raw data is a non-standard image format. This image needs a significant quantity of processing which uses at least three different background measurements. I did a contrast variation series, which means essentially repeating the experiment with different proportions of H2O and D2O, each of which require their own set of backgrounds.

Problem one is just that this creates a lot of files. Given that I am uploading these by hand you can see here, here and here (and bearing mind that I still have these ones and five others to do), that this is going to get a bit tiring. Ok, so this is an argument for some scripting. However what I need to do is create a separate post for all 50-odd data files. Then I need to describe the data reduction, involving all of these files, down to the relatively small number of twelve independent data files (each with their own post). All of this ‘data reduction’ is done on specially written software, and is generally done by the instrument scientist supporting the experiment so describing it is quite difficult.

Then I need to actually start on the data analysis. Describing this is not straightforward. But it is a crucial part of the Open Notebook Science programme. Data is generally what it is – there is not much argument about it. It is the analysis where the disagreement comes in – is it valid, was it done properly, was the data appropriate? Recording the detail of the analysis is therefore crucial. The problem is that the data analysis for this involves fiddling. Michael Barton put it rather well in a post a week or so ago;

It would be great, every week, to write “Hurrah! I’ve discovered to this new thing to do with protein cost. Isn’t it wonderful?”. However, in the real world it’s “I spent three days arguing with R to get it to order the bars in my chart how I want”.

Data analysis is largely about fiddling until we get something right. In my case I will be writing some code (desperate times call for desperate measures) to deconvolute the contributions from various things in my data. I will be battling, not with R but with a package called Igor Pro. How do I, or should I, record this process? SVN/Sourceforge/Google Code might be a good plan but I’m no proper coder – I wouldn’t really know what to do with these things. And actually this is a minor part of the problem, I can at least record the version of the code whenever I actually use it.

The bigger problem is actually capturing the data analysis itself. As I said, this is basically fiddling with parameters until they look right. Should I attempt to capture the process by which I refine the paramaters? Or just the final values? How important is it to capture the process. I think there is at core here the issue that divides the experimental scientists from the computational scientist. I’ve never met a primarily computer based scientists that kept a notebook in a form that I recognised. Generally there is a list of files, perhaps some rough notes on what they are, but there is a sense that the record is already there in those files and that all that is really required is a proper index. I think this difference was at the core of the disagreement over whether the Open NMR project is ONS – we have very different views of what we mean by notebook and what it records. All in all I think I will try to output log files of everything I do and at least put those up.

In the short term I think we just need to swallow hard and follow our system to its logical conclusion. The data we are generating makes this a right pain to do it manually but I don’t think we have actually broken the system per se. We desperately need two things to make this easier. Some sort of partly automated posting process, probably just a script, maybe even something I could figure out myself. But for the future we need to be able to run programs that will grab data themselves and then post back to blog. Essentially we need a web service framework that is easy for users to integrate into their own analysis system. Workflow engines have a lot of potential here but I am not convinced they are sufficiently useable yet. I haven’t managed to get Taverna onto my laptop yet – but before anyone jumps on me I will admit I haven’t tried very hard. On the other hand that’s the point. I shouldn’t have to.

If I have time I will get on to Jenny’s problem in another post. Here the issue is what format to save the data in and how much do we need to divide this process up?


  • About describing your data analysis =>

    SVN/Googlecode/Sourceforge would certainly be excessive for this situation. What you need is a way to reconstruct your image from the original data. For example if you generate a figure in matlab, you can always save the code that you used to generate the plot. The same would be the case for R or gnuplot.

    I don’t know Igor Pro, but from their website it appears that it is a GUI based plotting program that has an underlying language:
    http://www.wavemetrics.com/products/igorpro/programming.htm

    If you could get the program to export the underlying code that it generates when you modify the image from the GUI, you’d essentially have all of the information you needed to allow anyone to understand how you modified your image.

    It would be easier to understand however, if you used more common command-based plotting programs like gnuplot, R, and matlab. Really, I think a text description would be fine too though (e.g. a 2nd order polynomial was fit to our original data, which is plotted on a log scale).

    Describe the process or final values =>
    I would say providing the final parameter values are fine unless you did something completely unintuitive and crazy to arrive at them. This is generally the approach used by computer scientists and it works quite well. You just describe the final function and you only describe how the function came to be if it is in someway “strange”.

    On writing your own script =>
    This might be quite difficult if you don’t have much programming experience. You’re planning on hacking into the blog software to allow it to automate the blog posting? Doing so would require integrating your code into a large base of existing code, which is one of the most difficult programming tasks there is. I don’t want to discourage you. In fact, let me know if you need help along the way, but make sure you do it on a copy of the blog first in case you break anything :)

  • About describing your data analysis =>

    SVN/Googlecode/Sourceforge would certainly be excessive for this situation. What you need is a way to reconstruct your image from the original data. For example if you generate a figure in matlab, you can always save the code that you used to generate the plot. The same would be the case for R or gnuplot.

    I don’t know Igor Pro, but from their website it appears that it is a GUI based plotting program that has an underlying language:
    http://www.wavemetrics.com/products/igorpro/programming.htm

    If you could get the program to export the underlying code that it generates when you modify the image from the GUI, you’d essentially have all of the information you needed to allow anyone to understand how you modified your image.

    It would be easier to understand however, if you used more common command-based plotting programs like gnuplot, R, and matlab. Really, I think a text description would be fine too though (e.g. a 2nd order polynomial was fit to our original data, which is plotted on a log scale).

    Describe the process or final values =>
    I would say providing the final parameter values are fine unless you did something completely unintuitive and crazy to arrive at them. This is generally the approach used by computer scientists and it works quite well. You just describe the final function and you only describe how the function came to be if it is in someway “strange”.

    On writing your own script =>
    This might be quite difficult if you don’t have much programming experience. You’re planning on hacking into the blog software to allow it to automate the blog posting? Doing so would require integrating your code into a large base of existing code, which is one of the most difficult programming tasks there is. I don’t want to discourage you. In fact, let me know if you need help along the way, but make sure you do it on a copy of the blog first in case you break anything :)

  • also, here are my thoughts on computational work in ONS:
    http://blog-di-j.blogspot.com/2007/12/role-of-bioinformatics-in-open-notebook.html
    which is somewhat related to some of the ideas you bring up here. In particular, whether projects like Open NMR are ONS.

  • also, here are my thoughts on computational work in ONS:
    http://blog-di-j.blogspot.com/2007/12/role-of-bioinformatics-in-open-notebook.html
    which is somewhat related to some of the ideas you bring up here. In particular, whether projects like Open NMR are ONS.

  • Not that capturing your workflow is not a good idea and a worthwhile goal, but I think perhaps the very first goal is just getting the data out there with metadata all over it saying “here I am, come get me”.

    I don’t know Igor Pro from Igor Stravinsky, but I know something about DNA sequence data. If I were interested in your sequences I could work with abi files, text files, even scanned chromatograms if I had to — but first I’d have to know the data were there.

    (After that, I think the preferred format would probably be either raw abi files (since most folks already have the conversion apparatus in hand) or FASTA formatted text files. But talk to a bioinformatician — I seem to recall Neil of nodalpoint complaining about sequence formats recently…)

  • Not that capturing your workflow is not a good idea and a worthwhile goal, but I think perhaps the very first goal is just getting the data out there with metadata all over it saying “here I am, come get me”.

    I don’t know Igor Pro from Igor Stravinsky, but I know something about DNA sequence data. If I were interested in your sequences I could work with abi files, text files, even scanned chromatograms if I had to — but first I’d have to know the data were there.

    (After that, I think the preferred format would probably be either raw abi files (since most folks already have the conversion apparatus in hand) or FASTA formatted text files. But talk to a bioinformatician — I seem to recall Neil of nodalpoint complaining about sequence formats recently…)

  • I just posted this comment on Jeremiah’s blog relating to these issues:

    The question of what constitutes Open Notebook Science in non experimental fields is not simple to answer. You are right that software has been created openly for a long time now. In fact when I started doing this work I was using the term “Open Source Science”, thinking that the analogy would be obvious but it was not. Too many people assumed that meant Open Source Scientific Software.

    That’s why I started to use the term Open Notebook Science, bringing the focus to the laboratory notebook. My assumption was that anyone doing experimental science must keep a lab notebook. If that notebook is completely public you are doing ONS – if it isn’t then you’re doing something else.

    When we started collaborating with people doing non-experimental work, like docking, things got a lot more challenging. I’ve tried to maintain “experiment-like” pages on our wiki with links to the libraries, algorithms and result files. But because so much information gets generated I don’t think it is possible to capture all the mistakes like we can with a chemistry experiment.

    I think the key distinction to make is again that of “no insider information”. When a student and PI get together to write a paper, are they using only public data and files to construct that paper? Could someone else not in that group (human or otherwise) construct the equivalent of that same paper using only files made public by the research group? If so then I think they are doing ONS. If not then they are doing something else – it might be a form of Open Science but not ONS.

  • I just posted this comment on Jeremiah’s blog relating to these issues:

    The question of what constitutes Open Notebook Science in non experimental fields is not simple to answer. You are right that software has been created openly for a long time now. In fact when I started doing this work I was using the term “Open Source Science”, thinking that the analogy would be obvious but it was not. Too many people assumed that meant Open Source Scientific Software.

    That’s why I started to use the term Open Notebook Science, bringing the focus to the laboratory notebook. My assumption was that anyone doing experimental science must keep a lab notebook. If that notebook is completely public you are doing ONS – if it isn’t then you’re doing something else.

    When we started collaborating with people doing non-experimental work, like docking, things got a lot more challenging. I’ve tried to maintain “experiment-like” pages on our wiki with links to the libraries, algorithms and result files. But because so much information gets generated I don’t think it is possible to capture all the mistakes like we can with a chemistry experiment.

    I think the key distinction to make is again that of “no insider information”. When a student and PI get together to write a paper, are they using only public data and files to construct that paper? Could someone else not in that group (human or otherwise) construct the equivalent of that same paper using only files made public by the research group? If so then I think they are doing ONS. If not then they are doing something else – it might be a form of Open Science but not ONS.

  • As a bioinformatics guy I’d almost always just want the FASTA file and sometimes the accompaning quality scores (particularly if the FASTA file wasn’t quality trimmed). As an experimentalist I’d want the FASTA + the abi or scf chromatograph (so I can check by eye if a base-call seems off).

  • As a bioinformatics guy I’d almost always just want the FASTA file and sometimes the accompaning quality scores (particularly if the FASTA file wasn’t quality trimmed). As an experimentalist I’d want the FASTA + the abi or scf chromatograph (so I can check by eye if a base-call seems off).

  • I second Jeremiah Faith’s comment. I don’t think you need to explain all the different data analysis tried unless some of them give relevant results aside from the one you are looking for (ex: negative results).

  • I second Jeremiah Faith’s comment. I don’t think you need to explain all the different data analysis tried unless some of them give relevant results aside from the one you are looking for (ex: negative results).

  • Hi Pedro, I disagree somewhat here. To me ‘no insider information’ includes necessarily all the negative results. I think it is important to at least present the parameter space that was investigated. For instance if it turns out that we fit some data wrong, it is important to know whether we really searched the ‘correct’ bit of parameter space or whether for some reason our data fitting didn’t go there.

    But equally we have to be practical. We will put up all the raw stuff at least and then work from there.

  • Hi Pedro, I disagree somewhat here. To me ‘no insider information’ includes necessarily all the negative results. I think it is important to at least present the parameter space that was investigated. For instance if it turns out that we fit some data wrong, it is important to know whether we really searched the ‘correct’ bit of parameter space or whether for some reason our data fitting didn’t go there.

    But equally we have to be practical. We will put up all the raw stuff at least and then work from there.