Following up on data storage issues

There were lots of helpful comments on my previous post as well as some commiseration from Peter Murray-Rust. Also Jean-Claude Bradley’s group is starting to face some similar issues with the combi-Ugi project ramping up. All in the week that the Science Commons Open Data protocol is launched. I just wanted to bring out a few quick points:

The ease with which new data types can be incorporated into UsefulChem, such as the recent incorporation of a crystal structure (see also JC’s Blog Post), shows the flexibility and ease provided by an open ended and free form system in the context of the Wiki. The theory is that our slightly more structured approach provides more implicit metadata, but I am conscious that we have yet to demonstrate the extraction of the metadata back out in a useful form.

Bill comments:

…I think perhaps the very first goal is just getting the data out there with metadata all over it saying “here I am, come get me”.

I agree that the first thing is to simply get the data up there but the next question out of this comment must be how good is our metadata in practise? So for instance, can anyone make any sense out of this in isolation? Remember you will need to track back through links to the post where this was ‘made’. Nonetheless I think we need to see this process through to its end. The comparison with UsefulChem is helpful because we can decide whether the benefits of our system outweigh the extra fiddling invovled, or conversely how much do we have to make the fiddling less challenging to make it worthwhile. At the end of the day, these are experiments in the best approaches to doing ONS.

Things that do make our life easier are an automatic catalogue of input materials. This, and the ability to label things precisely for storage is making a contribution to the way the lab is running. In principal something similar can be achieved for data files. The main distinction at the moment is that we generate a lot more data files than samples so handling them is more logistically difficult.

Jean-Claude and Jeremiah have commented further on Jeremiah’s Blog on some of the fault lines between computational and experimental scientists. I just wanted to bring up a comment made by Jeremiah;

It would be easier to understand however, if you used more common command-based plotting programs like gnuplot, R, and matlab.

This is quite a common perception. ‘If you just used a command line system you could simply export the text file’. The thing is that, and I think I speak for a lot of wet biologists and indeed chemists, that we simply can’t be bothered. It is too much work to learn these packages and fighting with command lines isn’t generally something we are interested in doing – we’d rather be in the lab.

One of the very nice things about the data analysis package I use, Igor Pro, is that it has a GUI built but it also translates menu choices and mouse actions into a command line at the bottom of the screen. What is more it has a quite powerful programming language which uses exactly the same commands. You start using it by playing with the mouse, you become more adept at repeating actions by cutting and pasting stuff in the command line and then you can (almost) write a procedure by pasting a bunch of lines into a procedure file. It is, in my view, the outstanding example of a user interface that not only provides functionality for the novice and expert user in a easily accesible way, it also guides the novice into becoming a power user.

But for most applications we can’t be bothered (or more charitably don’t have the time) to learn MatLab or Perl or R or GnuPlot (and certainly not Tex!). Perhaps the fault line lies on the division between those who prefer to use Word rather than Tex. One consequence of this is that we use programs that have an irritating tendency to have proprietary file systems. Usually we can export a text file or something a bit more open. But sometimes this is not possible. It is almost always an extra step, an extra file to upload, so even more work. Open document formats are definitely a great step forward and XML file types are even better. But we are a bit stuck in the middle of slowly changing process.

None of this is to say that I think we shouldn’t put the effort in, but more to say, that from the perspective of those of us who really don’t like to code, and particularly those of us generating data from ‘beige box’ instruments the challenge of ‘No insider information’ is even harder. As Peter M-R says, the glueware is both critical, and the hardest bit to get right. The problem is, I can’t write glueware, at least not without sticking my fingers to each other.

2 Replies to “Following up on data storage issues”

  1. We’re still at a manageable situation with the CombiUgi project, with about 30 results in our table
    http://spreadsheets.google.com/pub?key=plwwufp30hfpUERhse9y5Kw

    There are many more experiments than that but they were flawed or failed in some way so didn’t make it to the master table. But all the attempts are available here:
    http://usefulchem.wikispaces.com/All+Reactions

    In terms of database storage, we are lucky in that ChemSpider is willing to help us store all the final spectra (including hopefully the X-ray crystal structures shortly). As for monitoring data, organic chemistry does not burden the researcher with unmanageable amounts (at least how we have been doing it)

    Cameron, as you and I have discussed, we are probably going to end up in a similar place – we are just taking different routes. We’re starting with making our experiments as human-understandable as possible and not much concern for format. But we’ll be (hopefully soon) inching towards translating these protocols into versions that are much more machine readable (and probably less human readable).

    If you saw the last proposal I discussed on my blog, this is the kind of thing we’d like to work on with ChemSpider (and anyone else interested in collaborating of course).

  2. We’re still at a manageable situation with the CombiUgi project, with about 30 results in our table
    http://spreadsheets.google.com/pub?key=plwwufp30hfpUERhse9y5Kw

    There are many more experiments than that but they were flawed or failed in some way so didn’t make it to the master table. But all the attempts are available here:
    http://usefulchem.wikispaces.com/All+Reactions

    In terms of database storage, we are lucky in that ChemSpider is willing to help us store all the final spectra (including hopefully the X-ray crystal structures shortly). As for monitoring data, organic chemistry does not burden the researcher with unmanageable amounts (at least how we have been doing it)

    Cameron, as you and I have discussed, we are probably going to end up in a similar place – we are just taking different routes. We’re starting with making our experiments as human-understandable as possible and not much concern for format. But we’ll be (hopefully soon) inching towards translating these protocols into versions that are much more machine readable (and probably less human readable).

    If you saw the last proposal I discussed on my blog, this is the kind of thing we’d like to work on with ChemSpider (and anyone else interested in collaborating of course).

Comments are closed.