The problem with data…

Our laboratory blog system has been doing a reasonable job of handling protocols and simple pieces of analysis thus far. While more automation in the posting would be a big benefit, this is more a mechanical issue than a fundamental problem. To re-cap our system is that every “item” has its own post. Until now these items have been samples, or materials. The items are linked by posts that describe procedures. This system provides a crude kind of triple; Sample X was generated using Procedure A from Material Z. Where we have some analytical data, like a gel, it was generally enough to drop that in at the bottom of the procedure post. I blithely assumed that when we had more complicated data, that might for instance need re-processing, we could treat it the same way as a product or sample.

By co-incidence both Jenny and I have generated quite a bit of data over the last few weeks. I did a Small Angle Neutron Scattering (SANS) experiment at the ILL on Sunday 10 December, and Jenny has been doing quite a lot of DNA sequencing for her project. To deal with the SANS data first; the raw data is a non-standard image format. This image needs a significant quantity of processing which uses at least three different background measurements. I did a contrast variation series, which means essentially repeating the experiment with different proportions of H2O and D2O, each of which require their own set of backgrounds.

Problem one is just that this creates a lot of files. Given that I am uploading these by hand you can see here, here and here (and bearing mind that I still have these ones and five others to do), that this is going to get a bit tiring. Ok, so this is an argument for some scripting. However what I need to do is create a separate post for all 50-odd data files. Then I need to describe the data reduction, involving all of these files, down to the relatively small number of twelve independent data files (each with their own post). All of this ‘data reduction’ is done on specially written software, and is generally done by the instrument scientist supporting the experiment so describing it is quite difficult.

Then I need to actually start on the data analysis. Describing this is not straightforward. But it is a crucial part of the Open Notebook Science programme. Data is generally what it is – there is not much argument about it. It is the analysis where the disagreement comes in – is it valid, was it done properly, was the data appropriate? Recording the detail of the analysis is therefore crucial. The problem is that the data analysis for this involves fiddling. Michael Barton put it rather well in a post a week or so ago;

It would be great, every week, to write “Hurrah! I’ve discovered to this new thing to do with protein cost. Isn’t it wonderful?”. However, in the real world it’s “I spent three days arguing with R to get it to order the bars in my chart how I want”.

Data analysis is largely about fiddling until we get something right. In my case I will be writing some code (desperate times call for desperate measures) to deconvolute the contributions from various things in my data. I will be battling, not with R but with a package called Igor Pro. How do I, or should I, record this process? SVN/Sourceforge/Google Code might be a good plan but I’m no proper coder – I wouldn’t really know what to do with these things. And actually this is a minor part of the problem, I can at least record the version of the code whenever I actually use it.

The bigger problem is actually capturing the data analysis itself. As I said, this is basically fiddling with parameters until they look right. Should I attempt to capture the process by which I refine the paramaters? Or just the final values? How important is it to capture the process. I think there is at core here the issue that divides the experimental scientists from the computational scientist. I’ve never met a primarily computer based scientists that kept a notebook in a form that I recognised. Generally there is a list of files, perhaps some rough notes on what they are, but there is a sense that the record is already there in those files and that all that is really required is a proper index. I think this difference was at the core of the disagreement over whether the Open NMR project is ONS – we have very different views of what we mean by notebook and what it records. All in all I think I will try to output log files of everything I do and at least put those up.

In the short term I think we just need to swallow hard and follow our system to its logical conclusion. The data we are generating makes this a right pain to do it manually but I don’t think we have actually broken the system per se. We desperately need two things to make this easier. Some sort of partly automated posting process, probably just a script, maybe even something I could figure out myself. But for the future we need to be able to run programs that will grab data themselves and then post back to blog. Essentially we need a web service framework that is easy for users to integrate into their own analysis system. Workflow engines have a lot of potential here but I am not convinced they are sufficiently useable yet. I haven’t managed to get Taverna onto my laptop yet – but before anyone jumps on me I will admit I haven’t tried very hard. On the other hand that’s the point. I shouldn’t have to.

If I have time I will get on to Jenny’s problem in another post. Here the issue is what format to save the data in and how much do we need to divide this process up?

Seeking advice and resources on Open Notebook Science

The following comment was posted to the ‘About‘ page by Sharon Sonenblum from Georgia Tech. Rather than leave it there where people might not see it I thought I would bring it to the front for everyone’s attention.

‘I’m looking for some resources or direction for diving into open notebook science. I have been interested in the concept for quite some time and recently began following this blog and a few others. I am excited to see that ONS is real and growing, but I’m not sure the best places to start. I want to find out what other folks are doing, what software they are using and what has and has not worked. I also would love to chat with anyone doing research with human subjects to figure out how IRB restrictions play out in ONS.’

Hi Sharon, great to see people interested in ONS! I am sure others will offer comments and suggestions but I will put my tuppence in first. My main suggestion would be to dive in and see what works for you, within the limitations of what you can do. Depending on the kind of work you are doing and how you are already recording it there are a range of options. As I mentioned in yesterday’s post there are as many different approaches to ONS as there are people doing it. We are definitely at the stage of exploring what is possible, what works, and there is plenty of discussion and indeed disagreement over what the best approach is.

There are really two places you could start. The easiest, and possibly the safest way to dip your toes into the water, is to start up a blog that discusses your lab work in general. There are good examples of this kind of approach with Rosie Redfield’s lab being one of the main proponents (see also Michael Barton’s blog). This can be, but is not necessarily, Open Notebook Science as defined by Jean-Claude Bradley. From what you say there may be real issues with you making your primary data available. If it involves human subjects then I would imagine it will be very difficult, if not impossible, to make the raw data available due to ethical considerations. Certainly I would expect that any review board would require that any data that was released was anonymised and that subjects understood exactly what the release conditions would be. I am no expert in ethics and we don’t (as far as I know) have anyone in the ONS community who is dealing with either human or animal subjects. This is an area that I think is important and that we have yet to explore in detail; if we believe that some science (say chemistry) should be fully open but that some (e.g small scale drug trials) cannot be then can we draw clear boundaries? I don’t know the answer but clearly some care is required with this.

If you can get clearance to go fully to ONS then there are a range of options. I would say it depends a lot on what sort of data you are dealing with. Take a look at your existing lab book and see what it looks like. Is it an electronic document already? Could you simply put that online? Is it an index to a set of data files, spreadsheets, graphs, analysis? If so a Wiki may be the best approach and using a free hosted service, either Wikispaces as used by UsefulChem, or OpenWetWare, could be a good option. Here you can add data files and then add pages that describe, and index them, as well as pages for analysing and discussing the results. Is your lab book more of a journal? Then a Blog may be the best approach, although you need to be careful here about date stamps as many blog engines allow you to change the datestamp. We use an in house developed blog at Southampton that gets around some of these problems but this is definitely an alpha to beta stage product.

Finally, make sure you discuss it with the people around you. Many scientists are deeply uncomfortable with the whole idea of making the lab notebook available. Be sure that you understand and take into account any concerns. In some cases they may not be valid concerns but as with anything there are real risks with the open notebook approach. Take the opportunity to understand any concerns and be prepared to argue where you think they are unjustified, but in a constructive way. Hopefully you can find good discussion points on this blog, at UsefulChem, Open Reading Frames (see also Bill’s excellent three part series at 3 Quark’s Daily), petermr’s blog, Jeremiah Faith’s blog, Michael Barton’s blog, What you’re doing is rather desperate, Public Ramblings, BBGM…who have I missed?

Good luck and keep us updated! The best thing about ONS is the conversations that can get started.

A big few weeks for open (notebook) science

So while I have been buried in the paper- and lab-work there has been quite a lot of interesting stuff going on. Pedro Beltrao has started an Open Notebook style project at Google Code which he describes in a post on Public Ramblings. This in interesting, because once again someone is using a different system as an Open Notebook. We have Wiki’s, Blogs, TeX based documents, and now, software version repositories being used. As Jean-Claude Bradley has said and we have discussed we have a lot to learn from exploring different systems, both in terms of understanding the benefits and limitations of specific systems on the way to designing and implementing better ones, but also from the perspective of what this tells us about how we do our science, and how this differs from discipline to discipline. Indeed, there already seems to be a place where this discussion has started in Pedro’s system. It is great to see this going forward and also great to see other members of the community, including Bill Hooker and Michael Barton already getting in and getting their hands dirty. I only wish I could contribute a bit more on the science itself.

Also good is the publicity that Open Notebooks and Open Notebook Science are getting. An article in Chemistry World, the member’s journal of the Royal Society of Chemistry, features UsefulChem, and discussion from Peter Murray-Rust, Steve Bachrach and others. Our efforts at Southampton even get a mention! What is good about this is not so much the personal publicity but that the mainstream ‘industry’ journals are increasingly starting to pick up the story. Not so long ago there was the article in Wired; Chemistry World has also recently discussed the issues associated with openness in a reasonably balanced manner (see also Peter Suber and Peter Murray-Rust’s commentaries).

In addition there is good coverage on the web. Rosie Redfield’s lab pages got featured by David Ng on World’s Fair on Science Blogs which was also picked up at BoingBoing (thanks to Neil Saunders for bringing this to my attention). Momentum is building as Neil says. The issues are becoming mainstream and the benefits are starting to flow through in specific cases. This is how things start to change. The challenge is in maintaining this forward momentum as it builds.

The OPEN Research Network Proposal – update and reflections

Despite all evidence to the contrary, I have not in fact fallen off the end of the world. I have just been a little run off my feet over the last week or so. A quick weekend trip to the south of France (see here for probably rather too much detail) and a lot of other things, not least some wrangling over allowed costs for the grant, have been keeping me busy.

The research network proposal was successfully submitted on Tuesday 27th November, some six days after I proposed here the possibility of applying for this grant. To echo what Mat Todd said, I haven’t ever been involved with a grant proposal that came together so fast, and while it still involved several days with very little sleep on my part it could not have been put together without a great deal of assistance from a large number of people. The final version of the proposal is here and I will try to put up a page on OpenWetWare for further discussion. The text of the case for support is also available at Nature Precedings. Precedings were uncomfortable about hosting the financial details of the proposal and I think this is interesting in its own right and will write on it later. Here, however, I want to reflect on the process of preparing the grant and what worked well and what didn’t.

Finding the community

The use of this Blog and the subsequent diffusion of the request for help through a number of other blogs was very effective and quite rapid. Diffusion was important and the proposal was featured on a wide range of blogs (1, 2, 3, 4, 5…others?). Given the very short time scale the number of people that became involved was really very high. People are able to move much faster than organisations so on the timescale that we were working it wasn’t possible to get organisations such as PLoS, Nature Publishing Group, BioMedCentral etc. formally involved by the time of the grant submission. I am still very keen to get the involvement of organisations like these and others and it isn’t too late to send a letter of support as I can update these at any time.

It is interesting to contrast this with the response I received to my earlier request for collaborators on the protein-DNA ligation project. In the case of the network proposal I was very rapidly swamped with support whereas for the actual science based project I haven’t had a response as yet. I think this is a good demonstration that while the Open and Connected approach can be effective, it is currently working best for development and networking projects associated with open and connected practises. As a research community we work very well on our common interests, where we have critical mass. However beyond this, in the areas of our ‘real research’, we are not yet seeing the potential benefits to anywhere near the same extent. I believe this is because we don’t yet have either critical mass nor a sufficiently connected network of researchers. In my view a central aim of the Research Network should be to break out of the ghetto and start to enable and demonstrate the benefits we know and have seen in the context of a wider range of scientific disciplines.

Writing the grant

In the process of writing and editing the grant it became clear that contributors have very different ‘contribution styles’ and that different types of contribution had higher or lower chances of making it into the final document. The proposal was written in GoogleDocs based on a first draft that I put together rather rapidly. The structure changed significantly over the course of the six days. Some contributors preferred to email specific comments whereas some got right in and hacked away at the text. At times there were six or ten people simultaneously editing the document. I am particularly grateful to those who spent the last night before submission going through and finding typos (although there are still quite a few I am embarrased to admit). This made the final stages of ‘cleaning up’ much easier.

Those who directly edited the document saw a much higher chance of their changes making it into the final document. Email comments were also valuable and were included or taken account of in many cases but because they were less immediate there was a greater tendency for them to be passed over or simply lost in the rush. At all times I took the arbitrary decision that I would delete, adapt, or add text as I saw fit. While a concensus approach may have worked, if more time was available, with the time restraints imposed it seemed to me that strong ‘editorial’ guidance was required to hit the final target.

Overall, this was a relatively pleasant way to write a proposal. The fact that many eyes went over the text was a great help and made me much more confident, even when I took a final decision to remove something, that a range of views had been explored, and that there was less chance of us missing important details. The full editing record is available in GoogleDocs if you have editing rights to the document. At the moment I don’t think I can expose the history. I considered doing the writing on OpenWetWare but that would have required people getting accounts, and the extra 24 hours involved there may have meant we didn’t make it. A wiki is nonetheless probably a better framework for this kind of writing.

Submitting the grant

The mechanics of the submission process meant that essentially no-one else had access to the financial details and there was little point discussing these. I wrote the justification of resources and Workplan on my own simply due to time contraints. The logistics meant that the text had to be closed off from the GoogleDoc at a specific time and then adapated to fit the available space. You can see what was done by comparing the final GoogleDoc version with the submitted version.

Would this work for a ‘real’ grant application?

As far as I am aware this is the first time a grant application has been written ‘in the open’ like this. However this is not a conventional research project. It is not clear at the moment whether the same benefits would be seen for a conventional project. Part of the reason people contributed was that they could be directly involved in the network. This would not be the case for a conventional project – would be people who would see no personal benefit be prepared to contribute as much? Having said that, the benefits of having many eyes on the proposal were clear, and made it possible to turn around the submission much faster than would otherwise have been the case. Perhaps the question is not so much; would people contribute? as; what is the best way to encourage people to contribute?

Thanks for everyone who helped and all those who offered support. It wouldn’t have happened without the contributions and support of a lot of people. You know, this Open Science thing actually works!

Proposal submitted…

Enough said. Thanks to everyone who helped. I will reflect on the process at a later stage and will put the complete proposal up as soon as I can. If anyone wants to send letters of support or get involved don’t feel that you’ve missed the boat. Whether the money comes up or not we ought to be doing something along these lines and I can always include more material when we reply to referee’s comments.

Cheers

Cameron

Research network proposal – Update III

The text of the proposal is now in a near complete form. I need to add references and a few others things but it is mostly in reasonable shape. If you would like to have your name included as a founder member of the network please drop me a comment on this post, email, or if I have given you editing rights then feel free to add yourself. If you do so please send or post some sort of document that I can take a version of and incorporate as a letter of support.

If you would like editing rights either comment on the original post or drop me an email. In principle all the other commentors should also be able to give you editing rights if I am unavailable (e.g. asleep). I will take a snapshot of the proposal text around 6am GMT tomorrow morning and will need to edit it and add some pictures offline before incorporating it into the whole proposal. The full proposal will be submitted tomorrow and I will put a complete PDF up as soon as I can get to it. If I can gain permission to do so I will also put up referees comments and any other correspondence in the fullness of time.

Thanks to everyone for their help.

The research network proposal – update II

Thanks to all those who have sent letters of support, paragraphs of text, and made comments or modifications to the proposal. Just a quick update on where we are. The text of the proposal is up at GoogleDocs. I believe I have given anyone who has commented on the original post access to edit but if not give me a yell or just send me any comments by email or pop them in as comments here.

There are a lot of other forms to be filled for the proposal which is difficult to do in the open but once it is done I will pdf the whole thing and put it up for people to see. The proposal has to be at the research council by 4pm GMT on Tuesday which means in practise I need to hit the submit button early on Tuesday morning. So if you want to comment or send a letter of support then please do so by close of play on Monday, your time, to ensure that I get it in time.

Thanks for all those who have helped and we will see how we got on!

Oh and if anyone can come up with a catchier title? We thought maybe ETHOS (e-science to help open science) but that seemed a little lame…

Follow on to network proposal

Ok. In this morning’s post I proposed the idea of applying for some UK money to support meetings in the general area of open science. I’ve made a start with an outline on a GoogleDoc which can be viewed here. I have tried to set out some general headings and areas to be fleshed out and added a little text. This is early days but if anyone wishes to add anything then please feel free. I have given editing rights to all those people who have comments on the original post (as of around 9:30 pm GMT on Thursday 22 November) so they should now have editing rights. I have set the document so that those people with invitations can cascade them to others (I hope). I will continue to issue invitations to anyone who comments on the original post. No need to feel obliged to add anything  – I’m not asking you to write the grant for me – but if you feel so inclined then the assistance will be very welcome.

What I will request is from those who are interested is a short letter stating your current post/position/ambitions, your interest in ‘Open Science’ and why you would like to be involved in this network. Either email to me at C [dot] Neylon [at] rl.ac.uk or simply drop it in as a comment.

Thanks

Cameron

e-science for open science – an EPSRC research network proposal

The UK Engineering and Physical Sciences Research Council currently has a call out for proposals to fund ‘Network Activities’ in e-science. This seems like an opportunity to both publicise and support the ‘Open Science’ agenda so I am proposing to write a proposal to ask for ~£150-200k to fund workshops, meetings, and visits between different people and groups. The money could fund people to come to meetings (including from outside the UK and Europe) but could not be used to directly support research activities. The rationale for the proposal would be as follows.

  • ‘Open Science’ has the potential to radically increase the efficiency and effectiveness of research world wide.
  • The community is disparate and dispersed with many groups working on different approaches that do not currently interoperate – agreeing some interchange or tagging standards may enable significant progress
  • Many of those driving the agenda are early career scientists including graduate students and postdocs who do not have independent travel funds and whose PI may not have resources to support attending meetings where this agenda is being developed
  • There is significant interest from academics, some publishers, software and tool developers, and research funders in making more data freely available but limited concensus on how to take this forward and thus far an insufficient committment of resources to make this possible in practice

The proposal would be to support 2-3 meetings over three years, including travel costs, and provide funds for exchange visits. What I would like from the community is an expression of interest, specifically the committment to write a letter of support saying you would like to be involved. It would be great to get these from tenured academics, early career academics, graduate students and PDRAs, publishers (NPG? PLoS?), library and repository people (UKOLN, Simile, others?) and anyone else who is relevant.

The timeline is tight (due Tuesday next week) but if there is enough interest I will push through to get this done. I propose to write the grant in the open and online so will post a Google Doc or OpenWetWare page as soon as I have something to put up. Any help people can offer on the writing would be appreciated. In the meantime please drop comments below. I will be pointing to this page in the grant proposal.

An experiment in open notebook science – Sortase mediated protein-DNA ligation

In a recent post I extolled the possible virtues of Open Notebook Science in avoiding or ameliorating the risk of being scooped. I also made a virtue of the fact that being open encourages you to take a more open approach; that there is a virtuous circle or positive feedback. However much of this is very theoretical. We don’t have good case studies to point at that show that Open Notebook Science generates positive outcomes in practice. To take a more cynical perspective where is the evidence that I am willing to take risks with valuable data? My aim with this post is to do exactly that, put something out there that is (as far as I know) new and exciting, and kick off a process that may help us to generate a positive example.

I mentioned in the previous post that we have been scooped not once, but twice, on this project. I will come back to the second scooping later but my object here is to try and avoid getting scooped a third time. As I mentioned in the previous post we are using the S. aureus Sortase enzyme to attach a range of molecules to proteins. We have found that this provides a clean, easy, and most importantly general method for attaching things to proteins. Labelling of proteins, attaching proteins to solid supports, and generating various hybrid-protein molecules has a very wide range of applications and new and easy to use methods are desperately needed. We have recently published[1] the use of this to attach proteins to solid supports and others have described the attachment of small molecules[2], peptides[3], PNA[4], PEG[5] and a range of other things.

One type of protein-conjugate that is challenging to generate is one in which a protein is linked to a DNA molecule. Such conjugates have a wide range of potential applications particularly as analytical tools where the very strong and selective binding that can often be found in a protein is linked to the wide range of extremely sensitive techniques available for DNA detection and identification[6]. Such techniques have been limited because it is difficult to find a general and straightforward technique for making such conjugates.

We have used our Sortase mediated ligation to successfully attach oligonucleotides to proteins and I have put up the data we have that supports this in my lab book (see here for an overview of what we have and here for some more specific examples with conditions). I should note that some of this is not strictly open notebook science because this is data from a student which I have put up after the event.

We are confident that it is possible to get reasonable yields of these conjugates and that the method is robust and easy to apply. This is an exciting result with some potentially exciting applications. However to publish we need to generate some data on applications of these conjugates. One obvious target here is to use a DNA array and differently coloured fluorescent proteins attached to different oligonucleotides to form an image on the array. The problem is that we are not well set up to do this in my lab and don’t have the expertise or resources to do this experiment efficiently. We could do it but it seems to me that it would be quicker and more efficient for someone else with the expertise and experience to do this. In return they obviously get an authorship on the paper.

Other experiments we are interested in doing:

  • Analytical experiment using the binding of a protein-DNA conjugate that utilises the DNA part for detection.
  • Pull down of peptide-DNA conjugates onto an array after exposure of the peptides to a protease
  • Attachment of proteins to a full length PCR product containing the gene for the protein. Select one of the protein and then re-amplifity the desired gene. (I had a quick go at this but it didn’t work)

So what I am asking is this:

  • If any reader of this blog is interested in doing these (or any other) experiments to aid us in getting the published paper then get in touch
  • If you feel so inclined then publicise this call wider on your own blog and let’s see whether using the blogosphere to make contacts can really aid the science

We will send the reagents to anyone who would like to do the experiments along with any further information required. In principle people ought to be able to figure out everything they need from the lab book but this will probably not be the case in practise. The idea here is to see whether this notion of a loose collaboration of groups with different resources and expertise that is driven by the science can work and whether it is a competitive way of doing science.

My criteria in accepting collaborators will be as follows:

  1. Willingness to adopt an Open Notebook Science approach for this experiment (ideally using our lab book system but not necessarily)
  2. Interest in and willingness to engage in the development of the published paper (including proposing and/or carrying out any new experiments that would be cool to include)
  3. Ability to actually carry out the experiment in reasonable time (ideally looking for a couple of months here)

So this is notionally a win-win situation for me. We will be getting on and doing our own thing as well but by working with other groups we may be able to get this paper out more efficiently and effectively. Maybe others will come up with clever experiments that would add to the value of the paper. The worst case scenario is that someone comes along and sees this, copies the results, and publishes ahead of us. The best case scenario is that someone else already working in a similar direction may come across this and propose working together on this.

In any case, the results promise to be interesting…

References:

[1] Chan et al, 2007, Covalent attachment of proteins to solid supports via Sortase-mediated ligation, PLoS ONE, e1164

[2] Popp et al, 2007, Sortagging: a versatile method for protein labelling, Nat Chem Biol, 3:707

[3] Mao et al, 2004, Sortase-mediated protein ligation: a new method for protein engineering, J Am Chem Soc, 126:2670

[4] Pritz et al, 2007, Synthesis of biologically active peptide nucleic acid-peptide conjugates by sortase-mediated ligation, J Org Chem, 72:3909

[5] Parasarathy et al, 2007, Sortase A as a novel molecular “stapler” for sequence specific protein conjugation, Bioconj Chem, 18:469

[6] Barbulis et al, 2005, Using protein-DNA chimeras to detect and count small numbers of molecules, Nature Methods, 2:31