Sourceforge for science

I got to meet Jeremiah Faith this morning and we had an excellent wide ranging discussion which I will try to capture in more detail later. However I wanted to get down some thoughts we had at the end of the discussion. We were talking about how to publicise and generate more interest and activity for Open Notebook Science. Jeremiah suggested the idea of a Sourceforge for science; a central clearing house somewhere on the web where projects could be described and people could opt in to contribute. There have been some ideas in this direction such as Totally retrosynthetic but I don’t think there has been a lot of uptake there.

This was all tied into the idea of making lab books findable and indexed in places where people might look for them. I have been taken with the way PostGenomic and ChemicalBlogSpace aggregate blogs, particularly blog posts on the peer reviewed literature and in the case of ChemicalBlogSpace aggregate comments on molecules, based on trawling for InChi Keys (I think). So can we propose that one of (both of?) these sites start aggregating online notebook posts? If we could make these point at peer reviewed papers online it would also be possible to use a modified version of the Blue Obelisk Grease Monkey that would popup whenever you were looking at a paper for which there was raw data online.

It wouldn’t be necessary, or perhaps even advisable, to limit these to people strictly practising Open Notebook Science. People could put up data once a paper was published or after a delay. Perhaps we could not even require that all the raw data be put up. If the barriers are lowered more people may do it. A range of appropriate tags (‘Partial Raw Data is available for this paper’, ‘Full raw data is available for this paper’, ‘Full raw data and associated data is available as an open notebook’,) would distinguish between what people are making available. Data could be dropped anywhere online and by aggregation it gains more visibility encouraging people to move from making specific data available towards making all their data available.

Any thoughts?

Talks on Open Notebook Science – some initial thoughts

So I have given three talks in ten days or so, one at the CanSAS meeting at NIST,  one at Drexel University and one at MIT last night. Jean-Claude Bradley was kind enough to help me record the talk at Drexel as a screencast and you can see this in various formats here. He has also made some comments on the talk on the UsefulChem Blog and Scientific Blogging site.

The talks at Drexel and MIT were interesting. I was expecting the focus of questions to be more on the issues of being open, the risks and benefits, and problems. Actually the focus of questions was on the technicalities and in particular people wanting to get under the hood and play with the underlying data. Several of the questions I was asked could be translated as ‘do you have an API?’. The answer to this is at the moment no, but we know it is a direction we need to go in.

We have two crucial things we need to address at the moment: the first is the issue of automating some of the posting. We believe this needs to be achieved through an application or script that sits outside the blog itself and that it can be linked to the process of actually labelling the stuff we make. The second issue is that of an API or web service that allows people to get at the underlying data in an automated fashion. This will be useful for us as we move towards doing analysis of our data as well. Jean-Claude said he was also looking at how to automate processes so clearly this is the next big step forward.

Another question raised at MIT was how you could retro-fit our approach into an existing blog or wiki engine. The key issues here are templates (which is next on my list to describe here in detail) which would probably require some sort of plugin. The other issue is the metadata. Our blog engine goes one step beyond tagging by providing keys with values. Presumably this could be coded into a conventional engine using RDF or microformats – perhaps we should be doing this our Blog in any case?

Incidentally a point I made in both talks, partly in response to the question ‘does anyone really look at it’, is that in many cases it is your own access you are enabling. Making it open means you can always get at your own data, which is a surprisingly helpful thing.

The CanSAS meeting was also interesting. This is traditionally a meeting where Small Angle Scattering instrument scientists, the people who maintain and support these instruments at large scale neutron and X-ray facilities, fail to agree on a standard data format. I wanted to make two points, one was the general point that making data available was a good thing, and secondly that making the instrument data available without a detailed description of the sample was pretty useless. However against all precedent they not only agreed a data format but it is also a flexible XML format allowing different tags for different ‘dialects’. So I can insert a tag into the data file that will point to our lab book, which is what I wanted.

Today I head off to talk to the OpenWetWare developers and the Simile group so that will be very interesting. More details as I have time to post.