data formats – Page 2 – Science in the Open

I’ve been mulling over this for a while, and seeing as I am home sick (can’t you tell from the rush of posts?) I’m going to give it a go. This definitely comes with a health warning as it goes way beyond what I know much about at any technical level. This is therefore handwaving of the highest order. But I haven’t come across anyone else floating the same ideas so I will have a shot at explaning my thoughts.

The Semantic Web, RDF, and XML are all the product of computer scientists thinking about computers and information. You can tell this because they deal with straightforward declarations that are absolute. X has property Y. Putting aside all the issues with the availability of tools and applications, the fact that triple stores don’t scale well, regardless of all the technical problems a central issue with applying these types of strategy to the real world is that absolutes don’t exist. I may assert that X has property Y, but what hppens when I change my mind, or when I realise I made a mistake, or when I find out that the underlying data wasn’t taken properly. How do we get this to work in the real world? Continue reading “Semantics in the real world? Part I – Why the triple needs to be a quint (or a sext, or…)”

There were lots of helpful comments on my previous post as well as some commiseration from Peter Murray-Rust. Also Jean-Claude Bradley’s group is starting to face some similar issues with the combi-Ugi project ramping up. All in the week that the Science Commons Open Data protocol is launched. I just wanted to bring out a few quick points:

The ease with which new data types can be incorporated into UsefulChem, such as the recent incorporation of a crystal structure (see also JC’s Blog Post), shows the flexibility and ease provided by an open ended and free form system in the context of the Wiki. The theory is that our slightly more structured approach provides more implicit metadata, but I am conscious that we have yet to demonstrate the extraction of the metadata back out in a useful form.

Bill comments:

…I think perhaps the very first goal is just getting the data out there with metadata all over it saying â€œhere I am, come get meâ€.

I agree that the first thing is to simply get the data up there but the next question out of this comment must be how good is our metadata in practise? So for instance, can anyone make any sense out of this in isolation? Remember you will need to track back through links to the post where this was ‘made’. Nonetheless I think we need to see this process through to its end. The comparison with UsefulChem is helpful because we can decide whether the benefits of our system outweigh the extra fiddling invovled, or conversely how much do we have to make the fiddling less challenging to make it worthwhile. At the end of the day, these are experiments in the best approaches to doing ONS.

Things that do make our life easier are an automatic catalogue of input materials. This, and the ability to label things precisely for storage is making a contribution to the way the lab is running. In principal something similar can be achieved for data files. The main distinction at the moment is that we generate a lot more data files than samples so handling them is more logistically difficult.

Jean-Claude and Jeremiah have commented further on Jeremiah’s Blog on some of the fault lines between computational and experimental scientists. I just wanted to bring up a comment made by Jeremiah;

It would be easier to understand however, if you used more common command-based plotting programs like gnuplot, R, and matlab.

This is quite a common perception. ‘If you just used a command line system you could simply export the text file’. The thing is that, and I think I speak for a lot of wet biologists and indeed chemists, that we simply can’t be bothered. It is too much work to learn these packages and fighting with command lines isn’t generally something we are interested in doing – we’d rather be in the lab.

One of the very nice things about the data analysis package I use, Igor Pro, is that it has a GUI built but it also translates menu choices and mouse actions into a command line at the bottom of the screen. What is more it has a quite powerful programming language which uses exactly the same commands. You start using it by playing with the mouse, you become more adept at repeating actions by cutting and pasting stuff in the command line and then you can (almost) write a procedure by pasting a bunch of lines into a procedure file. It is, in my view, the outstanding example of a user interface that not only provides functionality for the novice and expert user in a easily accesible way, it also guides the novice into becoming a power user.

But for most applications we can’t be bothered (or more charitably don’t have the time) to learn MatLab or Perl or R or GnuPlot (and certainly not Tex!). Perhaps the fault line lies on the division between those who prefer to use Word rather than Tex. One consequence of this is that we use programs that have an irritating tendency to have proprietary file systems. Usually we can export a text file or something a bit more open. But sometimes this is not possible. It is almost always an extra step, an extra file to upload, so even more work. Open document formats are definitely a great step forward and XML file types are even better. But we are a bit stuck in the middle of slowly changing process.

None of this is to say that I think we shouldn’t put the effort in, but more to say, that from the perspective of those of us who really don’t like to code, and particularly those of us generating data from ‘beige box’ instruments the challenge of ‘No insider information’ is even harder. As Peter M-R says, the glueware is both critical, and the hardest bit to get right. The problem is, I can’t write glueware, at least not without sticking my fingers to each other.

Tag: data formats

Semantics in the real world? Part I – Why the triple needs to be a quint (or a sext, or…)

Proposing a data model for Open Notebooks

Following up on data storage issues