PMR’s Open Notebook Project continued

This is reply continuing the conversation with Peter Murray-Rust on his plans for an Open Notebook Science based project. I have cut a lot of the context to keep the post size to a manageable level so if you want to track back see the original two posts from Peter, my response, and Peter’s response to that in full.

I should add that I am not a coder in any form so where this gets technical I am proposing things in principle (or hand waving as some might put it :).

PMR: We’ll create plots for ALL molecules and spectra. However it may not be always to identify what is “wrong”. Thus a bad TMS value (e.g. if the solvent is wrong) will shift all the values. So we may give a revised line (y = x –> y = x + c).

…and…

PMR: Yes. We’ll probably do this by RMS deviation and we could colour the table of contents or something similar. It may not be easy to make generic corrections over several thousand files. (Hang on – the files are in CML so it’s trivial).

…and…

PMR: Yes. It may not bbe trivial to correct them – we shan’t have a chemical editor in the Wiki, so it may be an idea to have a molecule upload. However the details often bite hard.

The most practical approach may simply be to let people flag things or suggest other solutions, either through comments on a blog or as additions on the wiki (which could all be aggregated into an RSS feed) and then to re-run the spectra with the new molecule/assignment/solvent by hand (or rather put it in the queue by hand). I imagine having a ‘Wrong Spectra Blog’ which has both a conventional comment button as well as a ‘propose correction’ button which still posts a comment but flags this for easy aggregation and possibly prompts people to drop in an InChi/CML/Smiles code (aggregation from multiple RSS feeds and filtering is do-able e.g. via Yahoo Pipes – grab RSS feed(s) from comments and then filter for a specific tag code, the comment is then in effect an RDF triple ).

Although if there is an entry page then presumably someone could run an existing spectra against a new assignment/molecule. Mainly a question of providing a link to make this easy for people. This could be in the template. Then how do you associate an attempted correction with the original ‘wrong’ spectra? Well that is where the details bite I suspect.

There is also the issue of implementation of all of this:

PMR: Yes. I am not yet sure how to insert machine-generated pages into a Wiki and we’ll value help here. The pages will certainly NOT be editable. Any refinement of the protocol or correction will generate a NEW job, not overwrite the last one.

…and…

PMR: I think we are clearly going to have a new blog. What I’m not clear is how we post comments from the blog to the Wiki and alert the Wiki from the blog.

This is not a world away from the blogging instruments developed in Jeremy Frey’s group at Southampton Uni. In that case the instruments themselves post a blog entry each time a sample is analysed. Here we are talking about a computational analysis but the principal is the same. Both Word Press and MediaWiki have (I believe – this is where I get out of my depth) quite sophisticated APIs that could enable automated posting.

There is some information at OpenWetWare (see particularly Julius Luck’s comments) about interfacing with MediaWiki that came up during the recent discussion on lab books). I believe there was an intention at some point to attempt to integrate the OpenWetWare blogs (like this one) with the OpenWetWare wiki at some level but I don’t think this is currently a priority for them.

I imagine it ought to be possible to write plugins for both MediaWiki and WordPress that would provide a button for each post/comment to ‘Publish to Wiki/Blog’. I don’t think this needs to be automated as I would see this as precisely the point at which human intervention is helping things along. The researcher may wish to publish a set of Blog comments to the Wiki to encourage more detailed discussion or conversely may wish to post a good solution from the Wiki to the Blog to alert people that something interesting has popped up.

CN: An interesting question which would arise from this combination of approaches is ‘where is the notebook?’ to which I will admit I don’t have an answer. But I’m not sure that it matters.

PMR: I am not worried about where the notebook is (though it could be difficult to “lift it up” by a single root.

I think this shows that by expanding the functionality of the ‘lab notebook’ we are starting to break our underlying idea of what that notebook is. Experimental scientists think very much in terms of a monolithic object bound in nice hard covers (even though this bears very little resemblance to reality)  and the idea that it can become a diffuse object distributed through a series of different repositories and journals is a bit discomforting. What is becoming clear to me is that we are starting to capture much more than just the raw data or procedures that go into the lab book itself. Keeping an electronic lab notebook is more work than a paper one, primarily because most of us don’t use paper notebooks very effectively.

Postscript: I’ve edited this slightly as of 10:36 GMT 14-October because somethings were unclear. I’ve used the tag pmr-ons to collect these posts.

How best to do the open notebook thing…a nice specific example

Peter Murray-Rust is going to take an Open Notebook Science approach to a project on checking whether NMR spectra match up with the molecules they are asserted to represent. The question he poses is how best to organise this. The form of an open notebook seems to be a theme at the moment with both discussions between myself and Jean-Claude Bradley (see also the ONS session at SFLO and associated comments) as well as an initiative on OpenWetWare to develop their Wiki notebook platform with more features. There are many ideas around at the moment so Peter’s question is a good specific example to think about.

As I understand Peter’s project the plan is as follows;

  1. Obtain NMR spectra from a public database and carry out a high level QM calculation to see whether this appears consistent with the molecule that the spectra is supposed to represent.
  2. Expose the results of this analysis useful form.
  3. Identify and prioritise examples where the spectrum appears to be ‘wrong’. The spectrum could be misassigned, the actual molecule could be wrong, or the calculation could be wrong.
  4. Obtain feedback on the ‘wrong’ cases and attempt to correct them through a process of discussion and refinement

So there are several requirements. The raw data needs to be presented in a coherent and organised fashion. Specific examples need to be ‘pushed out’ or ‘alerted’ so that knowledgeable and interested people are made aware and can comment and (and this is separate from commenting) further detailed discussion is enabled and recorded for the record. In addition there are the usual requirements for a notebook or a scientific record. The raw data must remain inviolate and any modifications must be recorded along with the process that generated the data. There will also presumably be a requirement to record thought processes and realisations as the process goes forward.

My suggestion is as follows:

  • The raw data is generated by a computational and repititive process so I imagine it is highly structured. I would use a template web page, possibly sitting within a Wiki but not editable, to expose these. This would include details of what was run and how and when. This would be machine generated as part of the analysis. Obviously appropriate tagging will play an important role in allowing people to parse this data.
  • A blog to provide two things. An informal running commentary of what is going on, what the current thought processes are, and what is being run and ‘alerts’ of specific examples which are interesting (or ‘wrong’). This is largely human generated, although the ‘alerts’ could be automated.
  • A wiki to enable discussion of specific examples and detailed comparisons by outside and inside observers. As Peter suggests in his draft paper, specific groups, both functional and academic, may show up as problems but predicting these in advance is challenging. A wiki provides a free form way of letting people identify and collate these. It may be appropriate to (automatically or manually) post comments from the blog into the wiki (which would also provide reliable time stamps and histories, not available in most standard blog engines).

So my answer to Peter’s question which might have been paraphrased as ‘Which engine is the best to use?’ is all of them. They all provide functionality that is important for the project as I understand it but none of them provide enough functionality on their own. An interesting question which would arise from this combination of approaches is ‘where is the notebook?’ to which I will admit I don’t have an answer. But I’m not sure that it matters.

This doubling up mirrors current practise both in Jean-Claude’s group where the UsefulChem wiki is the core notebook but the Blog is used for high level discussion. Similarly I am moving towards using this Blog for higher level discussion of results but the chemtools blog as more of a data repository. At Southampton we are thinking about the notion of ‘publishing’ from the Blog to a Wiki once a protocol or set of results is sufficiently established as Step 1 on the way to the paper.

Finally a throw away suggestion. Peter, if you want to get a lot of spectra with a lot of associated molecules, without any concerns about publisher copyrights, then consider opening this up as a service for graduate students to check their NMR assignments. I bet you get inundated…

Why and where we search…

This quote is grabbed from a comment by Jean-Claude Bradley at bbgm in reply to my comment on Deepak’s post on my post on…. anyway my original comment was that our Wiki review would not be indexed on Google Scholar which is where people might go for literature searches

Jean-Claude:

Getting on Google Scholar is something on my list to look into – if anyone knows how to do it please let us know. But from our Sitemeter tracking on UsefulChem it is clear that scientists are using Google to search for (and find) actionable scientific information.

Now this is an interesting point and it mirrors what I do. Jean-Claude has established that a lot of the ‘new’ traffic coming to UsefulChem comes from Google searches for specific information. Specific molecules in many cases but also spectra and other experimental data. If I’m looking for information, or a resource, scientific or otherwise, I will do a generic Google search for the most part (the most successful recent one was for ‘sticky apple pudding‘ – the result was very good indeed – see the Waitrose.com link).

But if I’m looking for scientific literature I will go to PubMed or sometimes to Google Scholar if I’m getting frustrated. I only ever use WOK for citation based searching (i.e. who cited a paper) or on the rare occasions when I’m looking for material that is not indexed in PubMed. Partly this is because I really like the ‘related items’ tab in PubMed. But what strikes me is that in my mind I have obviously divided these classes of searches up into two different things: ‘information/resources’ and ‘literature’. I bet this correlates quite strongly with both age and with scientific field. Do others out there think of these things as different or as all part of a continuum of information?

I recently saw a talk on a ‘Research Information Centre’ being developed by Microsoft, a sort of portal for handling research projects and all the associated information. This is at an early stage of development but one of the features they were working on was an integrated search where you could add and subtract various items (PubMed, WOK, Google, GoogleScholar, and various toll access info sources as well). GoogleScholar could do this well. So as Jean-Claude says above. Anybody got any contacts with the developers? We could just talk to them…

The Directed evolution library construction Wiki Review

I and a few others have been working for some time on converting a review I wrote some time ago for publication on OpenWetWare. Jason Kelly put the initial work in in developing an area for reviews and it was from him that I had the idea of taking this forward.

The traditional review is a cornerstone of the research literature. The first port of call for virtually any researcher on moving into an unfamiliar area is to find a good review. These vary vastly in length, specificity, and usefulness. The one thing that all traditional reviews have in common is that they are out of date as soon as they are printed.

On the other hand collaborative tools on the web make it straightforward to update and maintain, or even correct the text of a review. This is an area where community support can provide real added value in generating a useful and useable resource that can be maintained. By placing a review within the framework of a Wiki it would be possible for regular updates, and links to the current literature to be made available and for an open discussion of differences of opinion to take place. Unlike e-notebooks there are less technical and user interface issues to be worked out. This really ought to be an easy win.

There are many details that remain to be worked out. Should these be moderated, who can be trusted to maintain them, and more practically how can the regular maintainance of such reviews be guaranteed. On OpenWetWare anyone with an account can edit and anyone can view. The best practise for maintaining these reviews remains to be worked. The first step is to get a community of people involved in doing this.

So the review is now in a reasonable state, and certainly more up to date than it was. Any comments are welcome, either here, or on the review itself. We are also starting to write a paper on the review which we hope to get published. I think this is necessary so as to provide a pointer from the traditional literature to the review. As it happens the review is one of the top hits on Google but this is not where people go to find literature. It doesn’t appear on a Google Scholar search (that top hit is the original review) and is obviously not on PubMed or WOK. Whether we can get the paper published remains to be seen. Time to start emailing journal editors…

Variations in reproduction charges – the results are in!

I blogged the other day about the way charges for reproducing figures in review articles appear to have gone ballistic. At that stage I only had the cost for J Biol Chem which was a whopping $USD73 to reproduce a figure. I had three other images I was using and was expecting around $20 for them, based on the website that Science Direct points you at for permissions. Well I was wrong, they have now come back at $USD3 each. So J Biol Chem, $73, Structure and J Mol Biol $3. Question is, what am I paying for from J Biol Chem (apart from my own lack of organisation in getting similar but copyright free figures). The other question is why through one website is it $3 whereas the one at Science Direct said it would be $20. Especially seeing as they both seem to be run by the Copyright Clearance Center!

Joint NSF-EPSRC programme in Chemistry – an opportunity for ONS?

Looking at the EPSRC website I came across the following call for proposals involving collaboration between a US and UK programme:

http://www.epsrc.ac.uk/CallsForProposals/NSF-EPSRCChemistryProposals07.htm

Now, being an academic I’m up for any method of trying to get money out the system, especially special programmes. But is there an opportunity here to do something quite exciting in the area of Open Notebooks for chemistry where we take Jean-Claude’s experience and our Lab Blogbook system and try to build something that combines the best of both or possibly better, something which is a superset of both? Biggest issue I can see is that it might not be seen as chemistry, but if we focus on the idea of getting the chemical data both in and back out again it might fly.

Deadline for outline applications is November 6th…

UK research council policies on open data

I was looking through the website of the Biotechnology and Biological Sciences Research Council the other day looking for policies on data sharing and open access. You can find the whole policy here but here are the edited highlights;

…BBSRC is committed to getting the best value for the funds we invest and believes that helping to make research data more readily available will reinforce open scientific enquiry and stimulate new investigations and analyses…

…BBSRC expects research data generated as a result of BBSRC support to be made available with as few restrictions as possible in a timely and responsible manner to the scientific community for subsequent research…In line with the BBSRC Statement on Safeguarding Good Scientific Practice, data should also be retained for a period of ten years after completion of a research project…

Ten years? I wonder how many UK scientists really work to that standard in practise (and I don’t include ‘I know its around here somewhere…’)?

…BBSRC supports the view that those enabling sharing should receive full and appropriate recognition by funders, their academic institutions and new users for promoting secondary research…

…BBSRC reserves the right to implement a more prescriptive approach to data sharing for research initiatives…

There are also detailed guidance notes and a FAQ for those who want to follow up.

The focus here is really on large coherent data sets rather than aggregating or indexing diffuse sets of online notebooks that I am more interested in. However I am in the process of writing some proposals that I want to embed an Open Notebook Science approach into so it will be interesting to see what the referees comments come back looking like on those. All proposals to BBSRC since April this year have had to have a section on ‘Data Sharing’ that explicitly includes published and unpublished data.

The need for open data is getting more mainstream

Via Jean-Claude Bradley on UsefulChem, an article in Wired on making more of the ‘Dark Data’ out there available. As Jean-Claude notes this is focussed mainly on the notion of ‘failed experiments’ and ‘positive bias’ but there is much more background data out there. Experiments that don’t quite make the grade for inclusion in the paper or are just one of many that may be useful from a statistical perspective. How many synthetic chemistry papers give the range of yields achieved for a reaction? Or for a PCR reaction.

But its good to see more of this happening in the mainstream media and especially that Jean-Claude is getting the kudos for pushing the Open Notebook Science agenda. As this gets more mainstream it will filter through to the funders and other bodies.

Postscript: The article was originally commented on by Attilla at Pimm where there are more thoughtful comments on this.

Limits to openness – where is the boundary?

I’ve been fiddling with this post for a while and I’m not sure where its going but I think other people’s views might make the whole thing clearer. This is after all why we believe in being open. So here it is in its unfinished and certainly unclarified form. All comments gratefully received.

One issue that got a lot of people talking at the Scifoo lives on session on Monday (transcript here) was the question of where the boundaries between what should and should not be open lie. At one level it seems obvious: the structure of a molecule can’t really have privacy issues whereas it is clear that a patient’s medical data should remain private. The issue came up a lot at the recent All Hands UK E-science meeting where the issues were often about census data or geographical data that could pinpoint specific people. It seems obvious that people’s personal data should be private but where do we draw the line? I am uncomfortable with a position where it is ‘obvious’ that my data should be open but ‘obvious’ that personal medical or geographical data should not be. Ideally I would like to find a clear logical distinction.

Continue reading “Limits to openness – where is the boundary?”

Variations in reproduction charges for figures in a review article

Interesting morning negotiating the problems of getting permission to re-produce figures in a review I am currently writing. I wanted to reproduce figures from four papers, two from J Mol Biol (Elsevier), one from Structure (Cell Press), one from J Biol Chem (ASBMB). I could request permission for all three journals from copyright.com and as I’m in a rush I did this. Looking back at the journal websites now it appears that I should have gone to copyright clearance centre for the J Mol Biol ones and for the article in Structure.

At copyright.com the JBC article was charged at USD73 to reproduce a figure and at the copyright clearance centre it seems that the others are around USD23. Putting aside the issues of the price differential here I am wondering when these charges came in. The last time I wrote a review like this there was an administrative charge of $1.50 (I think) on one of the images I borrowed. Admittedly that was a couple of years ago. These are mostly representations of the structure of biological macromolecules so there is a clear creative aspect in generating the original figures.

Anyway, I’m tied in now, so will report back what the final charges are. I did consider asking the authors of the original papers for different, non-copyright, versions of the pictures but for most of these I feel I am reviewing the paper and should therefore reproduce the image. Anybody got an alternative view on this? I already took the decision that I woudn’t refer to any papers that I couldn’t view from my desktop (if I restricted myself to open access I wouldn’t have had anything to write about) but would I be better off leaving the pictures out?