Following up on data storage issues

There were lots of helpful comments on my previous post as well as some commiseration from Peter Murray-Rust. Also Jean-Claude Bradley’s group is starting to face some similar issues with the combi-Ugi project ramping up. All in the week that the Science Commons Open Data protocol is launched. I just wanted to bring out a few quick points:

The ease with which new data types can be incorporated into UsefulChem, such as the recent incorporation of a crystal structure (see also JC’s Blog Post), shows the flexibility and ease provided by an open ended and free form system in the context of the Wiki. The theory is that our slightly more structured approach provides more implicit metadata, but I am conscious that we have yet to demonstrate the extraction of the metadata back out in a useful form.

Bill comments:

…I think perhaps the very first goal is just getting the data out there with metadata all over it saying “here I am, come get me”.

I agree that the first thing is to simply get the data up there but the next question out of this comment must be how good is our metadata in practise? So for instance, can anyone make any sense out of this in isolation? Remember you will need to track back through links to the post where this was ‘made’. Nonetheless I think we need to see this process through to its end. The comparison with UsefulChem is helpful because we can decide whether the benefits of our system outweigh the extra fiddling invovled, or conversely how much do we have to make the fiddling less challenging to make it worthwhile. At the end of the day, these are experiments in the best approaches to doing ONS.

Things that do make our life easier are an automatic catalogue of input materials. This, and the ability to label things precisely for storage is making a contribution to the way the lab is running. In principal something similar can be achieved for data files. The main distinction at the moment is that we generate a lot more data files than samples so handling them is more logistically difficult.

Jean-Claude and Jeremiah have commented further on Jeremiah’s Blog on some of the fault lines between computational and experimental scientists. I just wanted to bring up a comment made by Jeremiah;

It would be easier to understand however, if you used more common command-based plotting programs like gnuplot, R, and matlab.

This is quite a common perception. ‘If you just used a command line system you could simply export the text file’. The thing is that, and I think I speak for a lot of wet biologists and indeed chemists, that we simply can’t be bothered. It is too much work to learn these packages and fighting with command lines isn’t generally something we are interested in doing – we’d rather be in the lab.

One of the very nice things about the data analysis package I use, Igor Pro, is that it has a GUI built but it also translates menu choices and mouse actions into a command line at the bottom of the screen. What is more it has a quite powerful programming language which uses exactly the same commands. You start using it by playing with the mouse, you become more adept at repeating actions by cutting and pasting stuff in the command line and then you can (almost) write a procedure by pasting a bunch of lines into a procedure file. It is, in my view, the outstanding example of a user interface that not only provides functionality for the novice and expert user in a easily accesible way, it also guides the novice into becoming a power user.

But for most applications we can’t be bothered (or more charitably don’t have the time) to learn MatLab or Perl or R or GnuPlot (and certainly not Tex!). Perhaps the fault line lies on the division between those who prefer to use Word rather than Tex. One consequence of this is that we use programs that have an irritating tendency to have proprietary file systems. Usually we can export a text file or something a bit more open. But sometimes this is not possible. It is almost always an extra step, an extra file to upload, so even more work. Open document formats are definitely a great step forward and XML file types are even better. But we are a bit stuck in the middle of slowly changing process.

None of this is to say that I think we shouldn’t put the effort in, but more to say, that from the perspective of those of us who really don’t like to code, and particularly those of us generating data from ‘beige box’ instruments the challenge of ‘No insider information’ is even harder. As Peter M-R says, the glueware is both critical, and the hardest bit to get right. The problem is, I can’t write glueware, at least not without sticking my fingers to each other.

The problem with data…

Our laboratory blog system has been doing a reasonable job of handling protocols and simple pieces of analysis thus far. While more automation in the posting would be a big benefit, this is more a mechanical issue than a fundamental problem. To re-cap our system is that every “item” has its own post. Until now these items have been samples, or materials. The items are linked by posts that describe procedures. This system provides a crude kind of triple; Sample X was generated using Procedure A from Material Z. Where we have some analytical data, like a gel, it was generally enough to drop that in at the bottom of the procedure post. I blithely assumed that when we had more complicated data, that might for instance need re-processing, we could treat it the same way as a product or sample.

By co-incidence both Jenny and I have generated quite a bit of data over the last few weeks. I did a Small Angle Neutron Scattering (SANS) experiment at the ILL on Sunday 10 December, and Jenny has been doing quite a lot of DNA sequencing for her project. To deal with the SANS data first; the raw data is a non-standard image format. This image needs a significant quantity of processing which uses at least three different background measurements. I did a contrast variation series, which means essentially repeating the experiment with different proportions of H2O and D2O, each of which require their own set of backgrounds.

Problem one is just that this creates a lot of files. Given that I am uploading these by hand you can see here, here and here (and bearing mind that I still have these ones and five others to do), that this is going to get a bit tiring. Ok, so this is an argument for some scripting. However what I need to do is create a separate post for all 50-odd data files. Then I need to describe the data reduction, involving all of these files, down to the relatively small number of twelve independent data files (each with their own post). All of this ‘data reduction’ is done on specially written software, and is generally done by the instrument scientist supporting the experiment so describing it is quite difficult.

Then I need to actually start on the data analysis. Describing this is not straightforward. But it is a crucial part of the Open Notebook Science programme. Data is generally what it is – there is not much argument about it. It is the analysis where the disagreement comes in – is it valid, was it done properly, was the data appropriate? Recording the detail of the analysis is therefore crucial. The problem is that the data analysis for this involves fiddling. Michael Barton put it rather well in a post a week or so ago;

It would be great, every week, to write “Hurrah! I’ve discovered to this new thing to do with protein cost. Isn’t it wonderful?”. However, in the real world it’s “I spent three days arguing with R to get it to order the bars in my chart how I want”.

Data analysis is largely about fiddling until we get something right. In my case I will be writing some code (desperate times call for desperate measures) to deconvolute the contributions from various things in my data. I will be battling, not with R but with a package called Igor Pro. How do I, or should I, record this process? SVN/Sourceforge/Google Code might be a good plan but I’m no proper coder – I wouldn’t really know what to do with these things. And actually this is a minor part of the problem, I can at least record the version of the code whenever I actually use it.

The bigger problem is actually capturing the data analysis itself. As I said, this is basically fiddling with parameters until they look right. Should I attempt to capture the process by which I refine the paramaters? Or just the final values? How important is it to capture the process. I think there is at core here the issue that divides the experimental scientists from the computational scientist. I’ve never met a primarily computer based scientists that kept a notebook in a form that I recognised. Generally there is a list of files, perhaps some rough notes on what they are, but there is a sense that the record is already there in those files and that all that is really required is a proper index. I think this difference was at the core of the disagreement over whether the Open NMR project is ONS – we have very different views of what we mean by notebook and what it records. All in all I think I will try to output log files of everything I do and at least put those up.

In the short term I think we just need to swallow hard and follow our system to its logical conclusion. The data we are generating makes this a right pain to do it manually but I don’t think we have actually broken the system per se. We desperately need two things to make this easier. Some sort of partly automated posting process, probably just a script, maybe even something I could figure out myself. But for the future we need to be able to run programs that will grab data themselves and then post back to blog. Essentially we need a web service framework that is easy for users to integrate into their own analysis system. Workflow engines have a lot of potential here but I am not convinced they are sufficiently useable yet. I haven’t managed to get Taverna onto my laptop yet – but before anyone jumps on me I will admit I haven’t tried very hard. On the other hand that’s the point. I shouldn’t have to.

If I have time I will get on to Jenny’s problem in another post. Here the issue is what format to save the data in and how much do we need to divide this process up?

Seeking advice and resources on Open Notebook Science

The following comment was posted to the ‘About‘ page by Sharon Sonenblum from Georgia Tech. Rather than leave it there where people might not see it I thought I would bring it to the front for everyone’s attention.

‘I’m looking for some resources or direction for diving into open notebook science. I have been interested in the concept for quite some time and recently began following this blog and a few others. I am excited to see that ONS is real and growing, but I’m not sure the best places to start. I want to find out what other folks are doing, what software they are using and what has and has not worked. I also would love to chat with anyone doing research with human subjects to figure out how IRB restrictions play out in ONS.’

Hi Sharon, great to see people interested in ONS! I am sure others will offer comments and suggestions but I will put my tuppence in first. My main suggestion would be to dive in and see what works for you, within the limitations of what you can do. Depending on the kind of work you are doing and how you are already recording it there are a range of options. As I mentioned in yesterday’s post there are as many different approaches to ONS as there are people doing it. We are definitely at the stage of exploring what is possible, what works, and there is plenty of discussion and indeed disagreement over what the best approach is.

There are really two places you could start. The easiest, and possibly the safest way to dip your toes into the water, is to start up a blog that discusses your lab work in general. There are good examples of this kind of approach with Rosie Redfield’s lab being one of the main proponents (see also Michael Barton’s blog). This can be, but is not necessarily, Open Notebook Science as defined by Jean-Claude Bradley. From what you say there may be real issues with you making your primary data available. If it involves human subjects then I would imagine it will be very difficult, if not impossible, to make the raw data available due to ethical considerations. Certainly I would expect that any review board would require that any data that was released was anonymised and that subjects understood exactly what the release conditions would be. I am no expert in ethics and we don’t (as far as I know) have anyone in the ONS community who is dealing with either human or animal subjects. This is an area that I think is important and that we have yet to explore in detail; if we believe that some science (say chemistry) should be fully open but that some (e.g small scale drug trials) cannot be then can we draw clear boundaries? I don’t know the answer but clearly some care is required with this.

If you can get clearance to go fully to ONS then there are a range of options. I would say it depends a lot on what sort of data you are dealing with. Take a look at your existing lab book and see what it looks like. Is it an electronic document already? Could you simply put that online? Is it an index to a set of data files, spreadsheets, graphs, analysis? If so a Wiki may be the best approach and using a free hosted service, either Wikispaces as used by UsefulChem, or OpenWetWare, could be a good option. Here you can add data files and then add pages that describe, and index them, as well as pages for analysing and discussing the results. Is your lab book more of a journal? Then a Blog may be the best approach, although you need to be careful here about date stamps as many blog engines allow you to change the datestamp. We use an in house developed blog at Southampton that gets around some of these problems but this is definitely an alpha to beta stage product.

Finally, make sure you discuss it with the people around you. Many scientists are deeply uncomfortable with the whole idea of making the lab notebook available. Be sure that you understand and take into account any concerns. In some cases they may not be valid concerns but as with anything there are real risks with the open notebook approach. Take the opportunity to understand any concerns and be prepared to argue where you think they are unjustified, but in a constructive way. Hopefully you can find good discussion points on this blog, at UsefulChem, Open Reading Frames (see also Bill’s excellent three part series at 3 Quark’s Daily), petermr’s blog, Jeremiah Faith’s blog, Michael Barton’s blog, What you’re doing is rather desperate, Public Ramblings, BBGM…who have I missed?

Good luck and keep us updated! The best thing about ONS is the conversations that can get started.

A big few weeks for open (notebook) science

So while I have been buried in the paper- and lab-work there has been quite a lot of interesting stuff going on. Pedro Beltrao has started an Open Notebook style project at Google Code which he describes in a post on Public Ramblings. This in interesting, because once again someone is using a different system as an Open Notebook. We have Wiki’s, Blogs, TeX based documents, and now, software version repositories being used. As Jean-Claude Bradley has said and we have discussed we have a lot to learn from exploring different systems, both in terms of understanding the benefits and limitations of specific systems on the way to designing and implementing better ones, but also from the perspective of what this tells us about how we do our science, and how this differs from discipline to discipline. Indeed, there already seems to be a place where this discussion has started in Pedro’s system. It is great to see this going forward and also great to see other members of the community, including Bill Hooker and Michael Barton already getting in and getting their hands dirty. I only wish I could contribute a bit more on the science itself.

Also good is the publicity that Open Notebooks and Open Notebook Science are getting. An article in Chemistry World, the member’s journal of the Royal Society of Chemistry, features UsefulChem, and discussion from Peter Murray-Rust, Steve Bachrach and others. Our efforts at Southampton even get a mention! What is good about this is not so much the personal publicity but that the mainstream ‘industry’ journals are increasingly starting to pick up the story. Not so long ago there was the article in Wired; Chemistry World has also recently discussed the issues associated with openness in a reasonably balanced manner (see also Peter Suber and Peter Murray-Rust’s commentaries).

In addition there is good coverage on the web. Rosie Redfield’s lab pages got featured by David Ng on World’s Fair on Science Blogs which was also picked up at BoingBoing (thanks to Neil Saunders for bringing this to my attention). Momentum is building as Neil says. The issues are becoming mainstream and the benefits are starting to flow through in specific cases. This is how things start to change. The challenge is in maintaining this forward momentum as it builds.

The OPEN Research Network Proposal – update and reflections

Despite all evidence to the contrary, I have not in fact fallen off the end of the world. I have just been a little run off my feet over the last week or so. A quick weekend trip to the south of France (see here for probably rather too much detail) and a lot of other things, not least some wrangling over allowed costs for the grant, have been keeping me busy.

The research network proposal was successfully submitted on Tuesday 27th November, some six days after I proposed here the possibility of applying for this grant. To echo what Mat Todd said, I haven’t ever been involved with a grant proposal that came together so fast, and while it still involved several days with very little sleep on my part it could not have been put together without a great deal of assistance from a large number of people. The final version of the proposal is here and I will try to put up a page on OpenWetWare for further discussion. The text of the case for support is also available at Nature Precedings. Precedings were uncomfortable about hosting the financial details of the proposal and I think this is interesting in its own right and will write on it later. Here, however, I want to reflect on the process of preparing the grant and what worked well and what didn’t.

Finding the community

The use of this Blog and the subsequent diffusion of the request for help through a number of other blogs was very effective and quite rapid. Diffusion was important and the proposal was featured on a wide range of blogs (1, 2, 3, 4, 5…others?). Given the very short time scale the number of people that became involved was really very high. People are able to move much faster than organisations so on the timescale that we were working it wasn’t possible to get organisations such as PLoS, Nature Publishing Group, BioMedCentral etc. formally involved by the time of the grant submission. I am still very keen to get the involvement of organisations like these and others and it isn’t too late to send a letter of support as I can update these at any time.

It is interesting to contrast this with the response I received to my earlier request for collaborators on the protein-DNA ligation project. In the case of the network proposal I was very rapidly swamped with support whereas for the actual science based project I haven’t had a response as yet. I think this is a good demonstration that while the Open and Connected approach can be effective, it is currently working best for development and networking projects associated with open and connected practises. As a research community we work very well on our common interests, where we have critical mass. However beyond this, in the areas of our ‘real research’, we are not yet seeing the potential benefits to anywhere near the same extent. I believe this is because we don’t yet have either critical mass nor a sufficiently connected network of researchers. In my view a central aim of the Research Network should be to break out of the ghetto and start to enable and demonstrate the benefits we know and have seen in the context of a wider range of scientific disciplines.

Writing the grant

In the process of writing and editing the grant it became clear that contributors have very different ‘contribution styles’ and that different types of contribution had higher or lower chances of making it into the final document. The proposal was written in GoogleDocs based on a first draft that I put together rather rapidly. The structure changed significantly over the course of the six days. Some contributors preferred to email specific comments whereas some got right in and hacked away at the text. At times there were six or ten people simultaneously editing the document. I am particularly grateful to those who spent the last night before submission going through and finding typos (although there are still quite a few I am embarrased to admit). This made the final stages of ‘cleaning up’ much easier.

Those who directly edited the document saw a much higher chance of their changes making it into the final document. Email comments were also valuable and were included or taken account of in many cases but because they were less immediate there was a greater tendency for them to be passed over or simply lost in the rush. At all times I took the arbitrary decision that I would delete, adapt, or add text as I saw fit. While a concensus approach may have worked, if more time was available, with the time restraints imposed it seemed to me that strong ‘editorial’ guidance was required to hit the final target.

Overall, this was a relatively pleasant way to write a proposal. The fact that many eyes went over the text was a great help and made me much more confident, even when I took a final decision to remove something, that a range of views had been explored, and that there was less chance of us missing important details. The full editing record is available in GoogleDocs if you have editing rights to the document. At the moment I don’t think I can expose the history. I considered doing the writing on OpenWetWare but that would have required people getting accounts, and the extra 24 hours involved there may have meant we didn’t make it. A wiki is nonetheless probably a better framework for this kind of writing.

Submitting the grant

The mechanics of the submission process meant that essentially no-one else had access to the financial details and there was little point discussing these. I wrote the justification of resources and Workplan on my own simply due to time contraints. The logistics meant that the text had to be closed off from the GoogleDoc at a specific time and then adapated to fit the available space. You can see what was done by comparing the final GoogleDoc version with the submitted version.

Would this work for a ‘real’ grant application?

As far as I am aware this is the first time a grant application has been written ‘in the open’ like this. However this is not a conventional research project. It is not clear at the moment whether the same benefits would be seen for a conventional project. Part of the reason people contributed was that they could be directly involved in the network. This would not be the case for a conventional project – would be people who would see no personal benefit be prepared to contribute as much? Having said that, the benefits of having many eyes on the proposal were clear, and made it possible to turn around the submission much faster than would otherwise have been the case. Perhaps the question is not so much; would people contribute? as; what is the best way to encourage people to contribute?

Thanks for everyone who helped and all those who offered support. It wouldn’t have happened without the contributions and support of a lot of people. You know, this Open Science thing actually works!

Proposal submitted…

Enough said. Thanks to everyone who helped. I will reflect on the process at a later stage and will put the complete proposal up as soon as I can. If anyone wants to send letters of support or get involved don’t feel that you’ve missed the boat. Whether the money comes up or not we ought to be doing something along these lines and I can always include more material when we reply to referee’s comments.

Cheers

Cameron

Research network proposal – Update III

The text of the proposal is now in a near complete form. I need to add references and a few others things but it is mostly in reasonable shape. If you would like to have your name included as a founder member of the network please drop me a comment on this post, email, or if I have given you editing rights then feel free to add yourself. If you do so please send or post some sort of document that I can take a version of and incorporate as a letter of support.

If you would like editing rights either comment on the original post or drop me an email. In principle all the other commentors should also be able to give you editing rights if I am unavailable (e.g. asleep). I will take a snapshot of the proposal text around 6am GMT tomorrow morning and will need to edit it and add some pictures offline before incorporating it into the whole proposal. The full proposal will be submitted tomorrow and I will put a complete PDF up as soon as I can get to it. If I can gain permission to do so I will also put up referees comments and any other correspondence in the fullness of time.

Thanks to everyone for their help.

The research network proposal – update II

Thanks to all those who have sent letters of support, paragraphs of text, and made comments or modifications to the proposal. Just a quick update on where we are. The text of the proposal is up at GoogleDocs. I believe I have given anyone who has commented on the original post access to edit but if not give me a yell or just send me any comments by email or pop them in as comments here.

There are a lot of other forms to be filled for the proposal which is difficult to do in the open but once it is done I will pdf the whole thing and put it up for people to see. The proposal has to be at the research council by 4pm GMT on Tuesday which means in practise I need to hit the submit button early on Tuesday morning. So if you want to comment or send a letter of support then please do so by close of play on Monday, your time, to ensure that I get it in time.

Thanks for all those who have helped and we will see how we got on!

Oh and if anyone can come up with a catchier title? We thought maybe ETHOS (e-science to help open science) but that seemed a little lame…

Follow on to network proposal

Ok. In this morning’s post I proposed the idea of applying for some UK money to support meetings in the general area of open science. I’ve made a start with an outline on a GoogleDoc which can be viewed here. I have tried to set out some general headings and areas to be fleshed out and added a little text. This is early days but if anyone wishes to add anything then please feel free. I have given editing rights to all those people who have comments on the original post (as of around 9:30 pm GMT on Thursday 22 November) so they should now have editing rights. I have set the document so that those people with invitations can cascade them to others (I hope). I will continue to issue invitations to anyone who comments on the original post. No need to feel obliged to add anything  – I’m not asking you to write the grant for me – but if you feel so inclined then the assistance will be very welcome.

What I will request is from those who are interested is a short letter stating your current post/position/ambitions, your interest in ‘Open Science’ and why you would like to be involved in this network. Either email to me at C [dot] Neylon [at] rl.ac.uk or simply drop it in as a comment.

Thanks

Cameron

e-science for open science – an EPSRC research network proposal

The UK Engineering and Physical Sciences Research Council currently has a call out for proposals to fund ‘Network Activities’ in e-science. This seems like an opportunity to both publicise and support the ‘Open Science’ agenda so I am proposing to write a proposal to ask for ~£150-200k to fund workshops, meetings, and visits between different people and groups. The money could fund people to come to meetings (including from outside the UK and Europe) but could not be used to directly support research activities. The rationale for the proposal would be as follows.

  • ‘Open Science’ has the potential to radically increase the efficiency and effectiveness of research world wide.
  • The community is disparate and dispersed with many groups working on different approaches that do not currently interoperate – agreeing some interchange or tagging standards may enable significant progress
  • Many of those driving the agenda are early career scientists including graduate students and postdocs who do not have independent travel funds and whose PI may not have resources to support attending meetings where this agenda is being developed
  • There is significant interest from academics, some publishers, software and tool developers, and research funders in making more data freely available but limited concensus on how to take this forward and thus far an insufficient committment of resources to make this possible in practice

The proposal would be to support 2-3 meetings over three years, including travel costs, and provide funds for exchange visits. What I would like from the community is an expression of interest, specifically the committment to write a letter of support saying you would like to be involved. It would be great to get these from tenured academics, early career academics, graduate students and PDRAs, publishers (NPG? PLoS?), library and repository people (UKOLN, Simile, others?) and anyone else who is relevant.

The timeline is tight (due Tuesday next week) but if there is enough interest I will push through to get this done. I propose to write the grant in the open and online so will post a Google Doc or OpenWetWare page as soon as I have something to put up. Any help people can offer on the writing would be appreciated. In the meantime please drop comments below. I will be pointing to this page in the grant proposal.