Increasing the persistence of online Open Notebooks

Weird. I came across WebCite this morning while having a quick scan through the Eysenbach paper on Open Access increasing number of citations in PLoS Biology. At the bottom was the comment that all the web pages have been archived on WebCite. Going across to WebCite I find the following:

What is WebCite®?
WebCite® is an archiving system for webreferences (cited webpages and websites), which can be used by authors, editors, and publishers of scholarly papers and books, to ensure that cited webmaterial will remain available to readers in the future. If cited webreferences in journal articles, books etc. are not archived, future readers may encounter a “404 File Not Found” error when clicking on a cited URL.

A WebCite® reference is an archived webcitation, and rather than linking to the live website (which can and probably will disappear in the future), authors of scholarly works will link to the archived WebCite® copy on webcitation.org.

Now this is interesting in its own right but my thoughts went back to a question I was asked several times during the talks I’ve given recently, ‘How do you know it will still be there in X years time?’ So might this be a good secondary archive for pages of electronic lab books, particularly when we wish to refer to specific pages in a peer reviewed paper.

Part of the answer to this may be to have many independent copies in different forms. However another answer is explicit archiving, whether that be in institutional repositories, general online repositories, embedded in supplementary information, or just trusting in Google. WebCite or something similar could have a useful role to play here. It could also provide a method of getting third party datestamps on the notebook.

That’s not the weird bit:

The wierd thing is that Deepak Singh just blogged about WebCite a few hours ago, which he picked up on from Jon Udell’s blog. Perhaps it is just morphic resonance :)

Some of the comments on Jon’s blog mirror my own concerns. The business model for this doesn’t seem to be well thought out. I can’t see many publishers buying into a system which is essentially trying to bridge the gap between their collapsing business model and a future of online publishing. In this context it is interesting that the list of members does not for instance include Nature Publishing Group or any PLoS journals who you would think wouldbe at the forefront of this kind of initiative. I would support the comment by Jon Galloway that this would be a good case for contextual advertising.

So I can see that this could well be useful as an archive for cited web pages, and I will use it when citing material from the web in future in my own papers. In terms of using it more generally as an archive for the notebooks there are two major issues. Firstly, will there be objections to us dumping large quantities of material into the system? The FAQ explicitly excludes charging authors but currently only asks people to be considerate in terms of what they archive. Secondly I would want to see a more clearly sustainable business model.

Getting scooped…

I have been waiting to write this post for a while. The biggest concern expressed when people consider taking on an Open Notebook Science approach is that of being ‘scooped’. I wanted to talk about this potential risk using a personal example where my group was scooped but I didn’t want to talk about someone else’s published paper until the paper on our work was available for people to compare. Our paper has just gone live at PLoS ONE so you will be able to compare the two sets of results.

Attaching proteins in a site selective manner to solid supports is a challenging problem. A general approach to attaching proteins to resin beads or planar surfaces while retaining function would have applications in chemical catalysis, analytical devices, and the generation of protein microarrays.

We established in my laboratory that the Sortase enzyme of S. aureus was an effective way of attaching functional proteins to solid supports in about March 2006. This was before I started taking up ONS and as the student is finishing up the project has not been moved onto an ONS basis so the data was not made available when we had it. We delayed publishing this as we attempted to generate a ‘pretty picture’ in which we would create the Southampton University logo in fluorescent protein on a glass surface. The idea of this was to make it more likely that we would get the paper into a higher ranked journal but ultimately we were unsuccessful.

In March 2007 we were scooped by a paper in Bioconjugate Chemistry (1). This paper, amongst other things, included an experiment that was very similar to the core experiment in our data (2). I should emphasise that there is absolutely no suggestion that this group ‘stole’ our data. They were working independently and were probably doing their experiments at about the same time as we did ours.

The first point here is that in the vast majority of cases being scooped is not about theft but about the fact that a good idea is an idea that is likely to occur to more than one person. It is essentially about not being first to get to publication. I can argue that I had the idea some years ago – but we didn’t get on to the work until 2006 and we’ve only just managed to get it published.

The second point is that our work is clearly different enough from Parthasarathy et al to be published. This is often the case. Indeed we have recently been scooped again on a different aspect of this project (3) but I expect we will still be able to publish as our data is again complementary to that reported.

So, from the perspective of traditional publication we were scooped because we didn’t publish fast enough. We can’t claim any precedence because we weren’t taking an ONS approach that would support this claim. But let us consider what would have happened if we had taken an ONS approach. I think there are a series of possible outcomes;

  1. It is possible, or even likely, that the other group may not have noticed our results at all. Under these circumstances we would at least be able to claim precedence.
  2. The other group may have seen our results and been spurred into more rapid publication. Again we would have been able to claim precedence but also there would be a record of the visit. I suspect this is the most common route to being scooped. In most cases results are not ‘copied’ from e.g. conference presentations but much more often the fact that someone is close to publication spurs another group to get their work published first.
  3. The most positive outcome is that, having seen we had some similar results, the other group may have got in contact and we could have put the results together to make a better paper.

Outcome 3) may seem unlikely but it really is the best outcome for everyone. Pathasarathy et al published in Bioconjugate Chemistry and we will publish in PLoS ONE after chasing around a number of other journals. If we had combined the results and, possibly more importantly, the resources to hand we probably could have put together a much better paper. This could possibly have gone to a significantly higher ranked journal. Apart from possible arguments over first and corresponding authorship everyone would have been better off.

This is the promise of being open as well as practising Open Notebook Science. By cooperating we can do a lot better. Being open has its risks but equally there are significant potential benefits including doing better science, better publications, and better career prospects as a result.

But let us now put the shoe on the other foot. What if the other group had made their data available? Would I have rushed out our paper to prevent them getting in first? It is one thing to advocate openness but would I really have gotten in touch with them myself? The answer is that 12 months ago I probably wouldn’t have got in contact. I would have pushed the student to work 24 hours a day and got our own paper out as fast as possible with whatever data we had to hand. I probably would not have contacted the other group. And we may have cut corners to get the data together, missed out controls that we know would work but didn’t have time to do and glossed over any possible issues.

But today, faced with the same dilemma I would get in touch with them and propose combining our data. Why the change? Partly because I have spent the past 12 months considering the issues around being open. But a strong contributor is that if I didn’t I would be exposing myself to criticism as a hypocrite. I have come to think that one of the real benefits of ‘being open’ is that being exposed means you hold yourself to higher standards precisely because being out in the open means that people have the evidence to judge you on.

I find that as I do my experiments and record them I take more care, I describe them more clearly, and I take more care to preserve and index the data properly. More generally I feel more inclined to share my ideas and preliminary results with others. And part of this is because I am aware that double standards will be obvious to anyone who is looking. Standards and discipline in maintaining them make for better science and for better people. Anyone who is honest with themselves knows that sometimes, somewhere, there is a temptation to cut corners. We all need help in maintaining discipline and being open is a very effective way of doing it.

It may sound a bit over the top but I actually feel like a better person for taking this approach. So for all the sceptics out there, and particularly for those academics with blood pressure issues, I recommend you try throwing the doors open. The fresh air is a bit bracing but it will do you the world of good.

  1. Parthasarathy R, Subramanian S, Boder ET (2007) Sortase A as a novel molecular “stapler” for sequence-specific protein conjugation. Bioconjug Chem 18:469-76
  2. Chan L, Cross HF, She JK, Cavalli G, Martins HFP, Neylon C (2007) Covalent attachment of proteins to solid supports and surfaces via Sortase-mediated ligation, PLoS ONE 2(11): e1164 doi:10.1371/journal.pone.0001164
  3. Popp et al., (2007) Sortagging: a versatile method for protein labelling. Nat Chem Biol Sep 23 (Epub ahead of print)

Sourceforge for science

I got to meet Jeremiah Faith this morning and we had an excellent wide ranging discussion which I will try to capture in more detail later. However I wanted to get down some thoughts we had at the end of the discussion. We were talking about how to publicise and generate more interest and activity for Open Notebook Science. Jeremiah suggested the idea of a Sourceforge for science; a central clearing house somewhere on the web where projects could be described and people could opt in to contribute. There have been some ideas in this direction such as Totally retrosynthetic but I don’t think there has been a lot of uptake there.

This was all tied into the idea of making lab books findable and indexed in places where people might look for them. I have been taken with the way PostGenomic and ChemicalBlogSpace aggregate blogs, particularly blog posts on the peer reviewed literature and in the case of ChemicalBlogSpace aggregate comments on molecules, based on trawling for InChi Keys (I think). So can we propose that one of (both of?) these sites start aggregating online notebook posts? If we could make these point at peer reviewed papers online it would also be possible to use a modified version of the Blue Obelisk Grease Monkey that would popup whenever you were looking at a paper for which there was raw data online.

It wouldn’t be necessary, or perhaps even advisable, to limit these to people strictly practising Open Notebook Science. People could put up data once a paper was published or after a delay. Perhaps we could not even require that all the raw data be put up. If the barriers are lowered more people may do it. A range of appropriate tags (‘Partial Raw Data is available for this paper’, ‘Full raw data is available for this paper’, ‘Full raw data and associated data is available as an open notebook’,) would distinguish between what people are making available. Data could be dropped anywhere online and by aggregation it gains more visibility encouraging people to move from making specific data available towards making all their data available.

Any thoughts?

Talks on Open Notebook Science – some initial thoughts

So I have given three talks in ten days or so, one at the CanSAS meeting at NIST,  one at Drexel University and one at MIT last night. Jean-Claude Bradley was kind enough to help me record the talk at Drexel as a screencast and you can see this in various formats here. He has also made some comments on the talk on the UsefulChem Blog and Scientific Blogging site.

The talks at Drexel and MIT were interesting. I was expecting the focus of questions to be more on the issues of being open, the risks and benefits, and problems. Actually the focus of questions was on the technicalities and in particular people wanting to get under the hood and play with the underlying data. Several of the questions I was asked could be translated as ‘do you have an API?’. The answer to this is at the moment no, but we know it is a direction we need to go in.

We have two crucial things we need to address at the moment: the first is the issue of automating some of the posting. We believe this needs to be achieved through an application or script that sits outside the blog itself and that it can be linked to the process of actually labelling the stuff we make. The second issue is that of an API or web service that allows people to get at the underlying data in an automated fashion. This will be useful for us as we move towards doing analysis of our data as well. Jean-Claude said he was also looking at how to automate processes so clearly this is the next big step forward.

Another question raised at MIT was how you could retro-fit our approach into an existing blog or wiki engine. The key issues here are templates (which is next on my list to describe here in detail) which would probably require some sort of plugin. The other issue is the metadata. Our blog engine goes one step beyond tagging by providing keys with values. Presumably this could be coded into a conventional engine using RDF or microformats – perhaps we should be doing this our Blog in any case?

Incidentally a point I made in both talks, partly in response to the question ‘does anyone really look at it’, is that in many cases it is your own access you are enabling. Making it open means you can always get at your own data, which is a surprisingly helpful thing.

The CanSAS meeting was also interesting. This is traditionally a meeting where Small Angle Scattering instrument scientists, the people who maintain and support these instruments at large scale neutron and X-ray facilities, fail to agree on a standard data format. I wanted to make two points, one was the general point that making data available was a good thing, and secondly that making the instrument data available without a detailed description of the sample was pretty useless. However against all precedent they not only agreed a data format but it is also a flexible XML format allowing different tags for different ‘dialects’. So I can insert a tag into the data file that will point to our lab book, which is what I wanted.

Today I head off to talk to the OpenWetWare developers and the Simile group so that will be very interesting. More details as I have time to post.

Growing a community – Open Notebook Science directories

As has been flagged up by Jean-Claude Bradley there are a couple of places now where people can sign up to say that they have Open Notebook Science in their laboratory, practise Open Notebook Science,or even would like to find a place where they can keep an Open Notebook.  Jean-Claude has put a list on the Nodalpoint Wiki and I have set up a database at DabbleDB. Dabbledb is a rather cool web based database system that provides free access as long as you make the database contents freely available. Because the data is completely open I am not asking for people’s email addresses.

If you want to be included in the database you can put your details in on the form here. This will allow anyone to re-use the data (which you can find here) to generate lists on appropriate web-pages, or maps or any number of other nice re-uses of the data. If you are interested in the working of the database give me a yell and I can give you admin access.

PMR’s Open Notebook Project continued

This is reply continuing the conversation with Peter Murray-Rust on his plans for an Open Notebook Science based project. I have cut a lot of the context to keep the post size to a manageable level so if you want to track back see the original two posts from Peter, my response, and Peter’s response to that in full.

I should add that I am not a coder in any form so where this gets technical I am proposing things in principle (or hand waving as some might put it :).

PMR: We’ll create plots for ALL molecules and spectra. However it may not be always to identify what is “wrong”. Thus a bad TMS value (e.g. if the solvent is wrong) will shift all the values. So we may give a revised line (y = x –> y = x + c).

…and…

PMR: Yes. We’ll probably do this by RMS deviation and we could colour the table of contents or something similar. It may not be easy to make generic corrections over several thousand files. (Hang on – the files are in CML so it’s trivial).

…and…

PMR: Yes. It may not bbe trivial to correct them – we shan’t have a chemical editor in the Wiki, so it may be an idea to have a molecule upload. However the details often bite hard.

The most practical approach may simply be to let people flag things or suggest other solutions, either through comments on a blog or as additions on the wiki (which could all be aggregated into an RSS feed) and then to re-run the spectra with the new molecule/assignment/solvent by hand (or rather put it in the queue by hand). I imagine having a ‘Wrong Spectra Blog’ which has both a conventional comment button as well as a ‘propose correction’ button which still posts a comment but flags this for easy aggregation and possibly prompts people to drop in an InChi/CML/Smiles code (aggregation from multiple RSS feeds and filtering is do-able e.g. via Yahoo Pipes – grab RSS feed(s) from comments and then filter for a specific tag code, the comment is then in effect an RDF triple ).

Although if there is an entry page then presumably someone could run an existing spectra against a new assignment/molecule. Mainly a question of providing a link to make this easy for people. This could be in the template. Then how do you associate an attempted correction with the original ‘wrong’ spectra? Well that is where the details bite I suspect.

There is also the issue of implementation of all of this:

PMR: Yes. I am not yet sure how to insert machine-generated pages into a Wiki and we’ll value help here. The pages will certainly NOT be editable. Any refinement of the protocol or correction will generate a NEW job, not overwrite the last one.

…and…

PMR: I think we are clearly going to have a new blog. What I’m not clear is how we post comments from the blog to the Wiki and alert the Wiki from the blog.

This is not a world away from the blogging instruments developed in Jeremy Frey’s group at Southampton Uni. In that case the instruments themselves post a blog entry each time a sample is analysed. Here we are talking about a computational analysis but the principal is the same. Both Word Press and MediaWiki have (I believe – this is where I get out of my depth) quite sophisticated APIs that could enable automated posting.

There is some information at OpenWetWare (see particularly Julius Luck’s comments) about interfacing with MediaWiki that came up during the recent discussion on lab books). I believe there was an intention at some point to attempt to integrate the OpenWetWare blogs (like this one) with the OpenWetWare wiki at some level but I don’t think this is currently a priority for them.

I imagine it ought to be possible to write plugins for both MediaWiki and WordPress that would provide a button for each post/comment to ‘Publish to Wiki/Blog’. I don’t think this needs to be automated as I would see this as precisely the point at which human intervention is helping things along. The researcher may wish to publish a set of Blog comments to the Wiki to encourage more detailed discussion or conversely may wish to post a good solution from the Wiki to the Blog to alert people that something interesting has popped up.

CN: An interesting question which would arise from this combination of approaches is ‘where is the notebook?’ to which I will admit I don’t have an answer. But I’m not sure that it matters.

PMR: I am not worried about where the notebook is (though it could be difficult to “lift it up” by a single root.

I think this shows that by expanding the functionality of the ‘lab notebook’ we are starting to break our underlying idea of what that notebook is. Experimental scientists think very much in terms of a monolithic object bound in nice hard covers (even though this bears very little resemblance to reality)  and the idea that it can become a diffuse object distributed through a series of different repositories and journals is a bit discomforting. What is becoming clear to me is that we are starting to capture much more than just the raw data or procedures that go into the lab book itself. Keeping an electronic lab notebook is more work than a paper one, primarily because most of us don’t use paper notebooks very effectively.

Postscript: I’ve edited this slightly as of 10:36 GMT 14-October because somethings were unclear. I’ve used the tag pmr-ons to collect these posts.

How best to do the open notebook thing…a nice specific example

Peter Murray-Rust is going to take an Open Notebook Science approach to a project on checking whether NMR spectra match up with the molecules they are asserted to represent. The question he poses is how best to organise this. The form of an open notebook seems to be a theme at the moment with both discussions between myself and Jean-Claude Bradley (see also the ONS session at SFLO and associated comments) as well as an initiative on OpenWetWare to develop their Wiki notebook platform with more features. There are many ideas around at the moment so Peter’s question is a good specific example to think about.

As I understand Peter’s project the plan is as follows;

  1. Obtain NMR spectra from a public database and carry out a high level QM calculation to see whether this appears consistent with the molecule that the spectra is supposed to represent.
  2. Expose the results of this analysis useful form.
  3. Identify and prioritise examples where the spectrum appears to be ‘wrong’. The spectrum could be misassigned, the actual molecule could be wrong, or the calculation could be wrong.
  4. Obtain feedback on the ‘wrong’ cases and attempt to correct them through a process of discussion and refinement

So there are several requirements. The raw data needs to be presented in a coherent and organised fashion. Specific examples need to be ‘pushed out’ or ‘alerted’ so that knowledgeable and interested people are made aware and can comment and (and this is separate from commenting) further detailed discussion is enabled and recorded for the record. In addition there are the usual requirements for a notebook or a scientific record. The raw data must remain inviolate and any modifications must be recorded along with the process that generated the data. There will also presumably be a requirement to record thought processes and realisations as the process goes forward.

My suggestion is as follows:

  • The raw data is generated by a computational and repititive process so I imagine it is highly structured. I would use a template web page, possibly sitting within a Wiki but not editable, to expose these. This would include details of what was run and how and when. This would be machine generated as part of the analysis. Obviously appropriate tagging will play an important role in allowing people to parse this data.
  • A blog to provide two things. An informal running commentary of what is going on, what the current thought processes are, and what is being run and ‘alerts’ of specific examples which are interesting (or ‘wrong’). This is largely human generated, although the ‘alerts’ could be automated.
  • A wiki to enable discussion of specific examples and detailed comparisons by outside and inside observers. As Peter suggests in his draft paper, specific groups, both functional and academic, may show up as problems but predicting these in advance is challenging. A wiki provides a free form way of letting people identify and collate these. It may be appropriate to (automatically or manually) post comments from the blog into the wiki (which would also provide reliable time stamps and histories, not available in most standard blog engines).

So my answer to Peter’s question which might have been paraphrased as ‘Which engine is the best to use?’ is all of them. They all provide functionality that is important for the project as I understand it but none of them provide enough functionality on their own. An interesting question which would arise from this combination of approaches is ‘where is the notebook?’ to which I will admit I don’t have an answer. But I’m not sure that it matters.

This doubling up mirrors current practise both in Jean-Claude’s group where the UsefulChem wiki is the core notebook but the Blog is used for high level discussion. Similarly I am moving towards using this Blog for higher level discussion of results but the chemtools blog as more of a data repository. At Southampton we are thinking about the notion of ‘publishing’ from the Blog to a Wiki once a protocol or set of results is sufficiently established as Step 1 on the way to the paper.

Finally a throw away suggestion. Peter, if you want to get a lot of spectra with a lot of associated molecules, without any concerns about publisher copyrights, then consider opening this up as a service for graduate students to check their NMR assignments. I bet you get inundated…

Why and where we search…

This quote is grabbed from a comment by Jean-Claude Bradley at bbgm in reply to my comment on Deepak’s post on my post on…. anyway my original comment was that our Wiki review would not be indexed on Google Scholar which is where people might go for literature searches

Jean-Claude:

Getting on Google Scholar is something on my list to look into – if anyone knows how to do it please let us know. But from our Sitemeter tracking on UsefulChem it is clear that scientists are using Google to search for (and find) actionable scientific information.

Now this is an interesting point and it mirrors what I do. Jean-Claude has established that a lot of the ‘new’ traffic coming to UsefulChem comes from Google searches for specific information. Specific molecules in many cases but also spectra and other experimental data. If I’m looking for information, or a resource, scientific or otherwise, I will do a generic Google search for the most part (the most successful recent one was for ‘sticky apple pudding‘ – the result was very good indeed – see the Waitrose.com link).

But if I’m looking for scientific literature I will go to PubMed or sometimes to Google Scholar if I’m getting frustrated. I only ever use WOK for citation based searching (i.e. who cited a paper) or on the rare occasions when I’m looking for material that is not indexed in PubMed. Partly this is because I really like the ‘related items’ tab in PubMed. But what strikes me is that in my mind I have obviously divided these classes of searches up into two different things: ‘information/resources’ and ‘literature’. I bet this correlates quite strongly with both age and with scientific field. Do others out there think of these things as different or as all part of a continuum of information?

I recently saw a talk on a ‘Research Information Centre’ being developed by Microsoft, a sort of portal for handling research projects and all the associated information. This is at an early stage of development but one of the features they were working on was an integrated search where you could add and subtract various items (PubMed, WOK, Google, GoogleScholar, and various toll access info sources as well). GoogleScholar could do this well. So as Jean-Claude says above. Anybody got any contacts with the developers? We could just talk to them…

Joint NSF-EPSRC programme in Chemistry – an opportunity for ONS?

Looking at the EPSRC website I came across the following call for proposals involving collaboration between a US and UK programme:

http://www.epsrc.ac.uk/CallsForProposals/NSF-EPSRCChemistryProposals07.htm

Now, being an academic I’m up for any method of trying to get money out the system, especially special programmes. But is there an opportunity here to do something quite exciting in the area of Open Notebooks for chemistry where we take Jean-Claude’s experience and our Lab Blogbook system and try to build something that combines the best of both or possibly better, something which is a superset of both? Biggest issue I can see is that it might not be seen as chemistry, but if we focus on the idea of getting the chemical data both in and back out again it might fly.

Deadline for outline applications is November 6th…