Workshop on Blog Based Notebooks

DUE TO SEVERE COMMENT SPAM ON THIS POST I HAVE CLOSED IT TO COMMENTS

On February 28/29 we held a workshop on our Blog Based notebook system at the Cosener’s House in Abingdon, Oxfordshire. This was a small workshop with 13 people including biochemists (from Southampton, Cardiff, and RAL), social scientists (from Oxford Internet Institute and Computing Laboratory), developers from the MyGrid and MyExperiment family and members of the the Blog development team. The purpose of the workshop was to try and identify both the key problems that need to be addressed in the next version of the Blog Based notebook system and also to ask the question ‘Does this system deliver functionality that people want’. Another aim was to identify specific ways in which we could interact with MyExperiment to deliver enhanced functionality for us as well as new members for MyExperiment.

Continue reading “Workshop on Blog Based Notebooks”

A quick update

I have got very behind. I’ve only just realised just how far behind but my excuse is that I have been rather busy. How far behind I was was brought home by the fact that I hadn’t actually commented as yet that the proposal for an Open Science session at PSB that was driven primarily by Shirley Wu has gone in and the proposal is now up at Nature Precedings. The posting there has already generated some new contacts.

On Tuesday I gave a talk at UKOLN at the University of Bath. Brian Kelly kindly videoed the first 10 minutes of the presentation when my attempts to record a screencast failed miserably and has blogged about the talk and on recording talks more generally for public consumption. Jean-Claude does this very effectively but this is something we should perhaps all be more putting a lot more effort into (and can someone tell me what the best software for recording screencasts is?!?). I got a lot of the talk on audio recording and will attempt to record a screencast when I can find time.
The talk was interesting; this was to a group of library/repository/curation/technical experts rather than the usual attempt to convince a group of sceptical scientists. Many of them are already ‘open’ advocates but are focused on technical issues. Lots of smart question on how do you really manage secure identities across multiple systems; how do we make data on the cloud stable for the long term; how do you choose between competing standards for describing and collating data; fundamentally how do you actually make all this work. Interesting discussion all in all and great to meet the people at UKOLN and finally meet Liz Lyon in person.

The other thing happening this week is that tomorrow and Friday we are running a small workshop introducing potential users to our Blog based notebook. Our aim is to see how other people’s working processes do or don’t fit into our system. This is still focused on biochemistry/molecular biology but it will be very interesting to see what comes out of this. I will try to report as soon as possible.

Finally; I think there is something in the air. This week has seen a rush of emails from people who have seen Blog posts, proposals, and other things writing to offer support, and perhaps more crucially access to more contacts.

And further on the PLoS front the biggest story in the UK news on Tuesday morning was about the paper in PLoS Medicine reporting on the results of a meta-study of the effectiveness of SSRIs in treating depression. I woke up to this story on BBC radio and by the time I gave my talk at 10:30 I’d had a chance at least to read the paper abstract. If I’d been on SSRIs this could be really important to me. Perhaps more to the point, if I were a doctor realising I’d be fielding phone calls from concerned patients all day, I could have read the paper. This story tells us a lot about why Open Access and Open Data are crucial. But more on that in another post sometime…I promise.

Reflections from a parallel universe

On Wednesday and Thursday this week I was lucky to be able to attend a conference on Electronic Laboratory Notebooks run by an organization called SMI. Lucky because the registration fee was £1500 and I got a free ticket. Clearly this was not a conference aimed at academics. This was a discussion of the capabilities and implications for Electronic Laboratory Notebooks used in industry, and primarily in big pharma.

For me it was very interesting to see these commercial packages. I am often asked how what we do compares to these packages and I have always had to answer that I simply don’t know, I’ve never had the chance to look at one because they are way to expensive. Having now seen them I can say that they have very impressive user interfaces with lots of integrated tools and widgets. They are fundamentally built around specific disciplines and this allows them to be reasonably structured in their presentation and organisation. I think we would break them in our academic research setting but it might take a while. More importantly we wouldn’t be able to afford the customisation that it looks as though you need to get a product that does just what you want it to. Deployment costs of around around £10,000 per person were being bandied around with total contracts costs clearly in the millions of dollars.

Coming out of various recent discussions I would say that I think the overall software design of these products is flawed going forward. The vendors are being paid a lot by companies who want things integrated into their systems so there is no motivation for them to develop open platforms with data portability and easy integration of web services etc. All of these systems run on thick clients against a central database. Going forward these have to go into web portals as a first step before working towards a full  customisable interface with easily collectable widgets to enable end-user configured integration.

But these were far from the most interesting things at the meeting. We commonly assume that keeping, preserving, and indexing data is a clear good. And indeed many of the attendees were assuming the same thing. Then we got a talk on ‘Compliance and ELNs’ by Simon Coles of Amphora Research Systems. The talk can be found here. In this was an example of just how bizarre the legal process for patent protection can make industrial process. In the process for preparing for a patent suit you will need to pay your lawyers to go through all the relevant data and paperwork. Indeed if you lose you will probably pay for the oppositions lawyers to go through all the relevant paperwork. These are not just lawyers, they are expensive lawyers. If you have a whole pile of raw data floating around this is not just going to give the lawyers a field day finding something to pin you to the wall on, it is going to burn through money like nobody’s business. The simple conclusion: It is far cheaper to re-do the experiment than it is to risk the need for lawyers to go through raw data. Throw the raw data away as soon as you can afford to! Like I said, a parallel universe where you think things are normal until they suddenly go sideways on you.

On a more positive sense there were some interesting talks on big companies deploying ELNs. Now we can look at this at some level as a model of a community adopting open notebooks. At least within the company (in most cases) everyone can see everyone else’s notebook. A number of speakers mentioned that this had caused problems and a couple said that it had been necessary to develop and promulgate standards of behaviour. This is interesting in the light of the recent controversy over the naming of a new dinosaur (see commentary at Blog around the Clock) and Shirley Wu’s post on One Big Lab. It reinforces the need for generally accepted standards of behaviour and the growing importance of these as data becomes more open.

The rules? The first two came from the talk, the rest are my suggestion. Basically they boil down to ‘Be Polite’.

  1. Always ask before using someone else’s data or results
  2. User beware: if you rely on someone else’s results its your problem if it blows up in your face (especially if you didn’t ask them about it)
  3. If someone asks if they can use your data or results you say yes. If you don’t want them to, give them a clear timeline on which they can or specific reasons why you can’t release the data. Give clear warnings about any caveats or concerns
  4. If someone asks you not to use their results (whether or not they are helpful or reasonable about it) think very carefully about whether you should ignore their request. If having done this you still feel you are being reasonable in using them, then think again.
  5. Any data that has not been submitted for peer review after 18 months is fair game
  6. If you incorporate someone else’s data within a paper discuss your results with them. Then include them as an author.
  7. Always, without fail and under any cicrumstances, acknowledge any source of information and do so generously and without conditions.

Following up on data storage issues

There were lots of helpful comments on my previous post as well as some commiseration from Peter Murray-Rust. Also Jean-Claude Bradley’s group is starting to face some similar issues with the combi-Ugi project ramping up. All in the week that the Science Commons Open Data protocol is launched. I just wanted to bring out a few quick points:

The ease with which new data types can be incorporated into UsefulChem, such as the recent incorporation of a crystal structure (see also JC’s Blog Post), shows the flexibility and ease provided by an open ended and free form system in the context of the Wiki. The theory is that our slightly more structured approach provides more implicit metadata, but I am conscious that we have yet to demonstrate the extraction of the metadata back out in a useful form.

Bill comments:

…I think perhaps the very first goal is just getting the data out there with metadata all over it saying “here I am, come get me”.

I agree that the first thing is to simply get the data up there but the next question out of this comment must be how good is our metadata in practise? So for instance, can anyone make any sense out of this in isolation? Remember you will need to track back through links to the post where this was ‘made’. Nonetheless I think we need to see this process through to its end. The comparison with UsefulChem is helpful because we can decide whether the benefits of our system outweigh the extra fiddling invovled, or conversely how much do we have to make the fiddling less challenging to make it worthwhile. At the end of the day, these are experiments in the best approaches to doing ONS.

Things that do make our life easier are an automatic catalogue of input materials. This, and the ability to label things precisely for storage is making a contribution to the way the lab is running. In principal something similar can be achieved for data files. The main distinction at the moment is that we generate a lot more data files than samples so handling them is more logistically difficult.

Jean-Claude and Jeremiah have commented further on Jeremiah’s Blog on some of the fault lines between computational and experimental scientists. I just wanted to bring up a comment made by Jeremiah;

It would be easier to understand however, if you used more common command-based plotting programs like gnuplot, R, and matlab.

This is quite a common perception. ‘If you just used a command line system you could simply export the text file’. The thing is that, and I think I speak for a lot of wet biologists and indeed chemists, that we simply can’t be bothered. It is too much work to learn these packages and fighting with command lines isn’t generally something we are interested in doing – we’d rather be in the lab.

One of the very nice things about the data analysis package I use, Igor Pro, is that it has a GUI built but it also translates menu choices and mouse actions into a command line at the bottom of the screen. What is more it has a quite powerful programming language which uses exactly the same commands. You start using it by playing with the mouse, you become more adept at repeating actions by cutting and pasting stuff in the command line and then you can (almost) write a procedure by pasting a bunch of lines into a procedure file. It is, in my view, the outstanding example of a user interface that not only provides functionality for the novice and expert user in a easily accesible way, it also guides the novice into becoming a power user.

But for most applications we can’t be bothered (or more charitably don’t have the time) to learn MatLab or Perl or R or GnuPlot (and certainly not Tex!). Perhaps the fault line lies on the division between those who prefer to use Word rather than Tex. One consequence of this is that we use programs that have an irritating tendency to have proprietary file systems. Usually we can export a text file or something a bit more open. But sometimes this is not possible. It is almost always an extra step, an extra file to upload, so even more work. Open document formats are definitely a great step forward and XML file types are even better. But we are a bit stuck in the middle of slowly changing process.

None of this is to say that I think we shouldn’t put the effort in, but more to say, that from the perspective of those of us who really don’t like to code, and particularly those of us generating data from ‘beige box’ instruments the challenge of ‘No insider information’ is even harder. As Peter M-R says, the glueware is both critical, and the hardest bit to get right. The problem is, I can’t write glueware, at least not without sticking my fingers to each other.

The problem with data…

Our laboratory blog system has been doing a reasonable job of handling protocols and simple pieces of analysis thus far. While more automation in the posting would be a big benefit, this is more a mechanical issue than a fundamental problem. To re-cap our system is that every “item” has its own post. Until now these items have been samples, or materials. The items are linked by posts that describe procedures. This system provides a crude kind of triple; Sample X was generated using Procedure A from Material Z. Where we have some analytical data, like a gel, it was generally enough to drop that in at the bottom of the procedure post. I blithely assumed that when we had more complicated data, that might for instance need re-processing, we could treat it the same way as a product or sample.

By co-incidence both Jenny and I have generated quite a bit of data over the last few weeks. I did a Small Angle Neutron Scattering (SANS) experiment at the ILL on Sunday 10 December, and Jenny has been doing quite a lot of DNA sequencing for her project. To deal with the SANS data first; the raw data is a non-standard image format. This image needs a significant quantity of processing which uses at least three different background measurements. I did a contrast variation series, which means essentially repeating the experiment with different proportions of H2O and D2O, each of which require their own set of backgrounds.

Problem one is just that this creates a lot of files. Given that I am uploading these by hand you can see here, here and here (and bearing mind that I still have these ones and five others to do), that this is going to get a bit tiring. Ok, so this is an argument for some scripting. However what I need to do is create a separate post for all 50-odd data files. Then I need to describe the data reduction, involving all of these files, down to the relatively small number of twelve independent data files (each with their own post). All of this ‘data reduction’ is done on specially written software, and is generally done by the instrument scientist supporting the experiment so describing it is quite difficult.

Then I need to actually start on the data analysis. Describing this is not straightforward. But it is a crucial part of the Open Notebook Science programme. Data is generally what it is – there is not much argument about it. It is the analysis where the disagreement comes in – is it valid, was it done properly, was the data appropriate? Recording the detail of the analysis is therefore crucial. The problem is that the data analysis for this involves fiddling. Michael Barton put it rather well in a post a week or so ago;

It would be great, every week, to write “Hurrah! I’ve discovered to this new thing to do with protein cost. Isn’t it wonderful?”. However, in the real world it’s “I spent three days arguing with R to get it to order the bars in my chart how I want”.

Data analysis is largely about fiddling until we get something right. In my case I will be writing some code (desperate times call for desperate measures) to deconvolute the contributions from various things in my data. I will be battling, not with R but with a package called Igor Pro. How do I, or should I, record this process? SVN/Sourceforge/Google Code might be a good plan but I’m no proper coder – I wouldn’t really know what to do with these things. And actually this is a minor part of the problem, I can at least record the version of the code whenever I actually use it.

The bigger problem is actually capturing the data analysis itself. As I said, this is basically fiddling with parameters until they look right. Should I attempt to capture the process by which I refine the paramaters? Or just the final values? How important is it to capture the process. I think there is at core here the issue that divides the experimental scientists from the computational scientist. I’ve never met a primarily computer based scientists that kept a notebook in a form that I recognised. Generally there is a list of files, perhaps some rough notes on what they are, but there is a sense that the record is already there in those files and that all that is really required is a proper index. I think this difference was at the core of the disagreement over whether the Open NMR project is ONS – we have very different views of what we mean by notebook and what it records. All in all I think I will try to output log files of everything I do and at least put those up.

In the short term I think we just need to swallow hard and follow our system to its logical conclusion. The data we are generating makes this a right pain to do it manually but I don’t think we have actually broken the system per se. We desperately need two things to make this easier. Some sort of partly automated posting process, probably just a script, maybe even something I could figure out myself. But for the future we need to be able to run programs that will grab data themselves and then post back to blog. Essentially we need a web service framework that is easy for users to integrate into their own analysis system. Workflow engines have a lot of potential here but I am not convinced they are sufficiently useable yet. I haven’t managed to get Taverna onto my laptop yet – but before anyone jumps on me I will admit I haven’t tried very hard. On the other hand that’s the point. I shouldn’t have to.

If I have time I will get on to Jenny’s problem in another post. Here the issue is what format to save the data in and how much do we need to divide this process up?

An experiment in open notebook science – Sortase mediated protein-DNA ligation

In a recent post I extolled the possible virtues of Open Notebook Science in avoiding or ameliorating the risk of being scooped. I also made a virtue of the fact that being open encourages you to take a more open approach; that there is a virtuous circle or positive feedback. However much of this is very theoretical. We don’t have good case studies to point at that show that Open Notebook Science generates positive outcomes in practice. To take a more cynical perspective where is the evidence that I am willing to take risks with valuable data? My aim with this post is to do exactly that, put something out there that is (as far as I know) new and exciting, and kick off a process that may help us to generate a positive example.

I mentioned in the previous post that we have been scooped not once, but twice, on this project. I will come back to the second scooping later but my object here is to try and avoid getting scooped a third time. As I mentioned in the previous post we are using the S. aureus Sortase enzyme to attach a range of molecules to proteins. We have found that this provides a clean, easy, and most importantly general method for attaching things to proteins. Labelling of proteins, attaching proteins to solid supports, and generating various hybrid-protein molecules has a very wide range of applications and new and easy to use methods are desperately needed. We have recently published[1] the use of this to attach proteins to solid supports and others have described the attachment of small molecules[2], peptides[3], PNA[4], PEG[5] and a range of other things.

One type of protein-conjugate that is challenging to generate is one in which a protein is linked to a DNA molecule. Such conjugates have a wide range of potential applications particularly as analytical tools where the very strong and selective binding that can often be found in a protein is linked to the wide range of extremely sensitive techniques available for DNA detection and identification[6]. Such techniques have been limited because it is difficult to find a general and straightforward technique for making such conjugates.

We have used our Sortase mediated ligation to successfully attach oligonucleotides to proteins and I have put up the data we have that supports this in my lab book (see here for an overview of what we have and here for some more specific examples with conditions). I should note that some of this is not strictly open notebook science because this is data from a student which I have put up after the event.

We are confident that it is possible to get reasonable yields of these conjugates and that the method is robust and easy to apply. This is an exciting result with some potentially exciting applications. However to publish we need to generate some data on applications of these conjugates. One obvious target here is to use a DNA array and differently coloured fluorescent proteins attached to different oligonucleotides to form an image on the array. The problem is that we are not well set up to do this in my lab and don’t have the expertise or resources to do this experiment efficiently. We could do it but it seems to me that it would be quicker and more efficient for someone else with the expertise and experience to do this. In return they obviously get an authorship on the paper.

Other experiments we are interested in doing:

  • Analytical experiment using the binding of a protein-DNA conjugate that utilises the DNA part for detection.
  • Pull down of peptide-DNA conjugates onto an array after exposure of the peptides to a protease
  • Attachment of proteins to a full length PCR product containing the gene for the protein. Select one of the protein and then re-amplifity the desired gene. (I had a quick go at this but it didn’t work)

So what I am asking is this:

  • If any reader of this blog is interested in doing these (or any other) experiments to aid us in getting the published paper then get in touch
  • If you feel so inclined then publicise this call wider on your own blog and let’s see whether using the blogosphere to make contacts can really aid the science

We will send the reagents to anyone who would like to do the experiments along with any further information required. In principle people ought to be able to figure out everything they need from the lab book but this will probably not be the case in practise. The idea here is to see whether this notion of a loose collaboration of groups with different resources and expertise that is driven by the science can work and whether it is a competitive way of doing science.

My criteria in accepting collaborators will be as follows:

  1. Willingness to adopt an Open Notebook Science approach for this experiment (ideally using our lab book system but not necessarily)
  2. Interest in and willingness to engage in the development of the published paper (including proposing and/or carrying out any new experiments that would be cool to include)
  3. Ability to actually carry out the experiment in reasonable time (ideally looking for a couple of months here)

So this is notionally a win-win situation for me. We will be getting on and doing our own thing as well but by working with other groups we may be able to get this paper out more efficiently and effectively. Maybe others will come up with clever experiments that would add to the value of the paper. The worst case scenario is that someone comes along and sees this, copies the results, and publishes ahead of us. The best case scenario is that someone else already working in a similar direction may come across this and propose working together on this.

In any case, the results promise to be interesting…

References:

[1] Chan et al, 2007, Covalent attachment of proteins to solid supports via Sortase-mediated ligation, PLoS ONE, e1164

[2] Popp et al, 2007, Sortagging: a versatile method for protein labelling, Nat Chem Biol, 3:707

[3] Mao et al, 2004, Sortase-mediated protein ligation: a new method for protein engineering, J Am Chem Soc, 126:2670

[4] Pritz et al, 2007, Synthesis of biologically active peptide nucleic acid-peptide conjugates by sortase-mediated ligation, J Org Chem, 72:3909

[5] Parasarathy et al, 2007, Sortase A as a novel molecular “stapler” for sequence specific protein conjugation, Bioconj Chem, 18:469

[6] Barbulis et al, 2005, Using protein-DNA chimeras to detect and count small numbers of molecules, Nature Methods, 2:31

How best to do the open notebook thing…a nice specific example

Peter Murray-Rust is going to take an Open Notebook Science approach to a project on checking whether NMR spectra match up with the molecules they are asserted to represent. The question he poses is how best to organise this. The form of an open notebook seems to be a theme at the moment with both discussions between myself and Jean-Claude Bradley (see also the ONS session at SFLO and associated comments) as well as an initiative on OpenWetWare to develop their Wiki notebook platform with more features. There are many ideas around at the moment so Peter’s question is a good specific example to think about.

As I understand Peter’s project the plan is as follows;

  1. Obtain NMR spectra from a public database and carry out a high level QM calculation to see whether this appears consistent with the molecule that the spectra is supposed to represent.
  2. Expose the results of this analysis useful form.
  3. Identify and prioritise examples where the spectrum appears to be ‘wrong’. The spectrum could be misassigned, the actual molecule could be wrong, or the calculation could be wrong.
  4. Obtain feedback on the ‘wrong’ cases and attempt to correct them through a process of discussion and refinement

So there are several requirements. The raw data needs to be presented in a coherent and organised fashion. Specific examples need to be ‘pushed out’ or ‘alerted’ so that knowledgeable and interested people are made aware and can comment and (and this is separate from commenting) further detailed discussion is enabled and recorded for the record. In addition there are the usual requirements for a notebook or a scientific record. The raw data must remain inviolate and any modifications must be recorded along with the process that generated the data. There will also presumably be a requirement to record thought processes and realisations as the process goes forward.

My suggestion is as follows:

  • The raw data is generated by a computational and repititive process so I imagine it is highly structured. I would use a template web page, possibly sitting within a Wiki but not editable, to expose these. This would include details of what was run and how and when. This would be machine generated as part of the analysis. Obviously appropriate tagging will play an important role in allowing people to parse this data.
  • A blog to provide two things. An informal running commentary of what is going on, what the current thought processes are, and what is being run and ‘alerts’ of specific examples which are interesting (or ‘wrong’). This is largely human generated, although the ‘alerts’ could be automated.
  • A wiki to enable discussion of specific examples and detailed comparisons by outside and inside observers. As Peter suggests in his draft paper, specific groups, both functional and academic, may show up as problems but predicting these in advance is challenging. A wiki provides a free form way of letting people identify and collate these. It may be appropriate to (automatically or manually) post comments from the blog into the wiki (which would also provide reliable time stamps and histories, not available in most standard blog engines).

So my answer to Peter’s question which might have been paraphrased as ‘Which engine is the best to use?’ is all of them. They all provide functionality that is important for the project as I understand it but none of them provide enough functionality on their own. An interesting question which would arise from this combination of approaches is ‘where is the notebook?’ to which I will admit I don’t have an answer. But I’m not sure that it matters.

This doubling up mirrors current practise both in Jean-Claude’s group where the UsefulChem wiki is the core notebook but the Blog is used for high level discussion. Similarly I am moving towards using this Blog for higher level discussion of results but the chemtools blog as more of a data repository. At Southampton we are thinking about the notion of ‘publishing’ from the Blog to a Wiki once a protocol or set of results is sufficiently established as Step 1 on the way to the paper.

Finally a throw away suggestion. Peter, if you want to get a lot of spectra with a lot of associated molecules, without any concerns about publisher copyrights, then consider opening this up as a service for graduate students to check their NMR assignments. I bet you get inundated…