PMR’s Open Notebook Project continued

This is reply continuing the conversation with Peter Murray-Rust on his plans for an Open Notebook Science based project. I have cut a lot of the context to keep the post size to a manageable level so if you want to track back see the original two posts from Peter, my response, and Peter’s response to that in full.

I should add that I am not a coder in any form so where this gets technical I am proposing things in principle (or hand waving as some might put it :).

PMR: We’ll create plots for ALL molecules and spectra. However it may not be always to identify what is “wrong”. Thus a bad TMS value (e.g. if the solvent is wrong) will shift all the values. So we may give a revised line (y = x –> y = x + c).

…and…

PMR: Yes. We’ll probably do this by RMS deviation and we could colour the table of contents or something similar. It may not be easy to make generic corrections over several thousand files. (Hang on – the files are in CML so it’s trivial).

…and…

PMR: Yes. It may not bbe trivial to correct them – we shan’t have a chemical editor in the Wiki, so it may be an idea to have a molecule upload. However the details often bite hard.

The most practical approach may simply be to let people flag things or suggest other solutions, either through comments on a blog or as additions on the wiki (which could all be aggregated into an RSS feed) and then to re-run the spectra with the new molecule/assignment/solvent by hand (or rather put it in the queue by hand). I imagine having a ‘Wrong Spectra Blog’ which has both a conventional comment button as well as a ‘propose correction’ button which still posts a comment but flags this for easy aggregation and possibly prompts people to drop in an InChi/CML/Smiles code (aggregation from multiple RSS feeds and filtering is do-able e.g. via Yahoo Pipes – grab RSS feed(s) from comments and then filter for a specific tag code, the comment is then in effect an RDF triple ).

Although if there is an entry page then presumably someone could run an existing spectra against a new assignment/molecule. Mainly a question of providing a link to make this easy for people. This could be in the template. Then how do you associate an attempted correction with the original ‘wrong’ spectra? Well that is where the details bite I suspect.

There is also the issue of implementation of all of this:

PMR: Yes. I am not yet sure how to insert machine-generated pages into a Wiki and we’ll value help here. The pages will certainly NOT be editable. Any refinement of the protocol or correction will generate a NEW job, not overwrite the last one.

…and…

PMR: I think we are clearly going to have a new blog. What I’m not clear is how we post comments from the blog to the Wiki and alert the Wiki from the blog.

This is not a world away from the blogging instruments developed in Jeremy Frey’s group at Southampton Uni. In that case the instruments themselves post a blog entry each time a sample is analysed. Here we are talking about a computational analysis but the principal is the same. Both Word Press and MediaWiki have (I believe – this is where I get out of my depth) quite sophisticated APIs that could enable automated posting.

There is some information at OpenWetWare (see particularly Julius Luck’s comments) about interfacing with MediaWiki that came up during the recent discussion on lab books). I believe there was an intention at some point to attempt to integrate the OpenWetWare blogs (like this one) with the OpenWetWare wiki at some level but I don’t think this is currently a priority for them.

I imagine it ought to be possible to write plugins for both MediaWiki and WordPress that would provide a button for each post/comment to ‘Publish to Wiki/Blog’. I don’t think this needs to be automated as I would see this as precisely the point at which human intervention is helping things along. The researcher may wish to publish a set of Blog comments to the Wiki to encourage more detailed discussion or conversely may wish to post a good solution from the Wiki to the Blog to alert people that something interesting has popped up.

CN: An interesting question which would arise from this combination of approaches is ‘where is the notebook?’ to which I will admit I don’t have an answer. But I’m not sure that it matters.

PMR: I am not worried about where the notebook is (though it could be difficult to “lift it up” by a single root.

I think this shows that by expanding the functionality of the ‘lab notebook’ we are starting to break our underlying idea of what that notebook is. Experimental scientists think very much in terms of a monolithic object bound in nice hard covers (even though this bears very little resemblance to reality)  and the idea that it can become a diffuse object distributed through a series of different repositories and journals is a bit discomforting. What is becoming clear to me is that we are starting to capture much more than just the raw data or procedures that go into the lab book itself. Keeping an electronic lab notebook is more work than a paper one, primarily because most of us don’t use paper notebooks very effectively.

Postscript: I’ve edited this slightly as of 10:36 GMT 14-October because somethings were unclear. I’ve used the tag pmr-ons to collect these posts.

How best to do the open notebook thing…a nice specific example

Peter Murray-Rust is going to take an Open Notebook Science approach to a project on checking whether NMR spectra match up with the molecules they are asserted to represent. The question he poses is how best to organise this. The form of an open notebook seems to be a theme at the moment with both discussions between myself and Jean-Claude Bradley (see also the ONS session at SFLO and associated comments) as well as an initiative on OpenWetWare to develop their Wiki notebook platform with more features. There are many ideas around at the moment so Peter’s question is a good specific example to think about.

As I understand Peter’s project the plan is as follows;

  1. Obtain NMR spectra from a public database and carry out a high level QM calculation to see whether this appears consistent with the molecule that the spectra is supposed to represent.
  2. Expose the results of this analysis useful form.
  3. Identify and prioritise examples where the spectrum appears to be ‘wrong’. The spectrum could be misassigned, the actual molecule could be wrong, or the calculation could be wrong.
  4. Obtain feedback on the ‘wrong’ cases and attempt to correct them through a process of discussion and refinement

So there are several requirements. The raw data needs to be presented in a coherent and organised fashion. Specific examples need to be ‘pushed out’ or ‘alerted’ so that knowledgeable and interested people are made aware and can comment and (and this is separate from commenting) further detailed discussion is enabled and recorded for the record. In addition there are the usual requirements for a notebook or a scientific record. The raw data must remain inviolate and any modifications must be recorded along with the process that generated the data. There will also presumably be a requirement to record thought processes and realisations as the process goes forward.

My suggestion is as follows:

  • The raw data is generated by a computational and repititive process so I imagine it is highly structured. I would use a template web page, possibly sitting within a Wiki but not editable, to expose these. This would include details of what was run and how and when. This would be machine generated as part of the analysis. Obviously appropriate tagging will play an important role in allowing people to parse this data.
  • A blog to provide two things. An informal running commentary of what is going on, what the current thought processes are, and what is being run and ‘alerts’ of specific examples which are interesting (or ‘wrong’). This is largely human generated, although the ‘alerts’ could be automated.
  • A wiki to enable discussion of specific examples and detailed comparisons by outside and inside observers. As Peter suggests in his draft paper, specific groups, both functional and academic, may show up as problems but predicting these in advance is challenging. A wiki provides a free form way of letting people identify and collate these. It may be appropriate to (automatically or manually) post comments from the blog into the wiki (which would also provide reliable time stamps and histories, not available in most standard blog engines).

So my answer to Peter’s question which might have been paraphrased as ‘Which engine is the best to use?’ is all of them. They all provide functionality that is important for the project as I understand it but none of them provide enough functionality on their own. An interesting question which would arise from this combination of approaches is ‘where is the notebook?’ to which I will admit I don’t have an answer. But I’m not sure that it matters.

This doubling up mirrors current practise both in Jean-Claude’s group where the UsefulChem wiki is the core notebook but the Blog is used for high level discussion. Similarly I am moving towards using this Blog for higher level discussion of results but the chemtools blog as more of a data repository. At Southampton we are thinking about the notion of ‘publishing’ from the Blog to a Wiki once a protocol or set of results is sufficiently established as Step 1 on the way to the paper.

Finally a throw away suggestion. Peter, if you want to get a lot of spectra with a lot of associated molecules, without any concerns about publisher copyrights, then consider opening this up as a service for graduate students to check their NMR assignments. I bet you get inundated…