This started out as a comment on Peter Murray-Rust’s response to my post and grew to the point where it seemed to warrant its own post. We need a better medium (or perhaps a semantic markup framework for Blogs?) in which to capture discussions like this, but that’s a problem for another day…
[…quoting my post in part…] PMR: This is very important and I shall draw heavily on this and add my interpretation. Simply put, the whole idea of “putting data in repositories†is misguided. It is not addressing the needs of the scientific community (and I’m not going to expand ideas here because they are only half formed).
Cameron – I’d be grateful for any more thoughts on this issue – public or private. They will be attributed, of course. Your ideas will probably form the “front end†for the work that the Soton group has been doing so attribution will be important there.
Very happy to offer thoughts in public (given I didn’t respond to your original request :) I think the current conversation on my blog will continue to run for a while yet. But it is very much thoughts which are still forming and which I will try to capture as they come out. At the core of this I think is the issue of matching the processes by which we capture the ‘experiment’ to the scientist’s practice. Data in repositories is important; metadata is important; and I am convinced (and will argue where I can and where people will listen) that putting extra effort into recording metadata will make our life easier as experimentalists in the future. But it is only half the story.
I think there is a critical difference between capturing what happens in a lab and reporting the information that arises from an experiment. I suspect part of this comes from my perspective as a chemist/protein engineer as opposed to a biologist or foo-infomatician. Jean-Claude has reflected in the past about how we may be generating more useful data for the sociologists of science trying to understand how scientific disciplines differ in how they work than for ourselves.
Making stuff is a messy business. This is part of what drives my advocacy of our ‘one item – one post’ system. Objects and samples need a URI. It can be very difficult to describe what an object is. It could be a tube of liquid, a crystalline solid, a technological artifact, a piece of paper, the state of a building/organism/culture. What is important is not that we know what something is (because in molecular biology at least we almost never know precisely what our samples contain) but that we can point at it. This fits badly into any repository system with any (useful) semantic framework. But we can talk about its properties, or where it came from, or where it is going to without any idea of what it is. Indeed once we know what it is (to some, hopefully well defined, precision) we can then fit it into a nice pre-existing semantic framework, or if necessary build one to fit.
I think I am heading towards the notion that we capture as much detail as we possibly can, in as flexible a format as possible. This is the idea of the ‘blogging lab’, which includes the ‘blogging experimenter’. The lab notebook system (or its equivalent, whatever it looks like) is the first step in structuring the raw morass which we generate. As we move forward to intepretation we start to structure more, to actually add meaning to what we have generated. The final stage is to sculpt this into a statement, or a finding, and then to formally publish that result (in whatever form is appropriate).
The last stage involves a repository that we already have, if in a slighlty imperfect form, the peer reviewed literature. The stage before that is where deposition into formal repositories is appropriate. Hopefully in the stage before that (the lab notebook) we have generated and collated our metadata in a useful form that makes this deposition easy, and ideally automatic.
We’ve been talking recently about lifestreaming [1], [2], [3], [4], [5] as well as doing it. I haven’t heard anyone complaining about the lack of semantic mark up people have been putting up on Friend Feed :) This is like the ‘LabStream’, just raw data. Meaning requires processing. And processing involves markup.
Perhaps a lot of this really does revolve around definitions. Data is not information. Data, by definition, is raw. This is why we need data models. But the scientific approach is to look at the data and try to derive a plausible model. You can bet if you come up with model and then ask the world to fit into it that it isn’t going to go. Not the way science works (and it’s way too late for me to get into arguments about the philosophy vs practice of science). Partly the resistance to the application of data models is just arrogance ‘No computer scientist can fit my science into their neat model’. But it is also part and parcel of the way we think. Doesn’t mean it’s the best way to do it mind…
Enough stream of consciousness writing; time to go to bed. Hope this is useful. Just to finish a note on attribution. If my ideas are the ‘front end’ to the work of others (mainly Jeremy Frey’s group at Southampton) then the tools they have developed and the discussions we have had are definitely more than just the ‘back end’ to my ideas. In fact, I’ve taken to acknowledging the ‘Open Science Collective’ (yes, that’s all of you) in talks. I’ve lost track of where many of the ideas originally came from and how they developed; how much of these concepts are actually mine, and how many came from others. The process of developing them has been community driven. The data is all in these blogs somewhere. Someone want to come up with a data model and mark them all up to trace the passage and development of the ideas? After all, we’re doing science here aren’t we?
many thanks
many thanks