UK E-Science all hands meeting – initial thoughts

If it hasn’t been obvious from what has gone previously I am fairly new to the whole E-science world. I am definitely not in any form a computer scientists. I’m not a computer-phobe either but my skills are pretty limited. It’s therefore a little daunting to be going for the first time to an e-science meeting. This is the usual story of not really knowing the people from this community and not necessarily having a clear idea of what people within the field or community think the priorities are.

The programme is available online and my first response on looking at it in detail was that I don’t even understand what most of the session titles mean. “OMII-UK” is a fairly inpenetrable workshop title for which the first talk is “Portalization Process for the Access Grid”. Now to be fair these are somewhat more specialised workshops and many of the plenary session names make more sense. This is normal when you go to an out-of-your-field conference but it will be interesting to see how much of the programme makes sense.

One of the issues with e-science programmes is the process of bringing the ‘outside’ scientist into the fold. Systems such as our lab e-notebook require an extra effort to use, certainly at the beginning, and during the development process there are often very few tangible benefits. Researchers are always time poor people so they want to see benefits. In theory we are here to demonstrate and promote our e-notebook system but I suspect this may be a case of preaching to the converted. It will be interesting to see a) whether we get much interest b) whether the comments we get are more on the technical implementation or the practical side of actually using it to record experiments.

One of the great things about starting this blog has been the way it has facilitated discussion with others interested in open notebook science and open science in general. I am less sure it has brought scientists who are interested in the details of the work in our notebook. My feeling is that this meeting may be a bit similar. On the other hand it may get us some good ideas on solving some of the problems of visualising the notebook that I want to discuss in a future post.

So if you are at the meeting and want to see the notebook please drop by to the BBSRC booth on Wednesday afternoon and do say hello if you see a shortish balding bearded guy who is looking lost or confused.

p.s. Thanks to whoever was running a meeting upstairs today. I didn’t realise I was stealing your lunch!

When is open notebook science not?

Well when it’s not open obviously.

There are many ways to provide all the information imagineable while still keeping things hidden. Or at least difficult to figure out or to find. The slogan ‘No insider information’ is useful because it provides a good benchmark to work towards. It is perhaps an ideal to attain rather than a practical target but thinking about what we know but is not clear from the blog notebook has a number of useful results. Clearly it helps us to see how open we are being but also it is helpful in identifying what it is that the notebook is not successfully capturing.

I have put up a series of posts recently in the ‘Sortase Cloning‘ blog notebook. The experiments I did on 29th August worked reasonably well. However this is not clear from the blog. Indeed I suspect our hypothetical ‘outsider’ would have a hard time figuring out what the point of the experiment is. Certainly the what is reasonably obvious, although it may be hidden in the detail, but the why is not. So the question is how to capture this effectively. We need a way of noting that an experiment works and that the results are interesting. In this case we have used Sortase to do two things that I don’t believe have yet been reported, fluorescently label a protein, and ligate a protein to a piece of DNA. This therefore represents the first report of this type of ligation using Sortase.

Perhaps more importantly, how do we then provide the keys that let interested people find the notebook? UsefulChem does this by providing InChi and smiles codes that identify specific molecules. Searching on the code by Google will usually bring UsefulChem up in the top few searches if the compound has been used. Searching on ‘Sortase’ the enzyme we are doing our conjugation with brings up our blog at number 14 or so. So not bad but not near the top and on the second page not the first. For other proteins with a wider community actively interested the blog would probably be much further down. Good tags and visibility on appropriate search engines (whatever they may turn out to be) is fairly critical to making this work.

How to get critical mass? Scifoo Lives on Session on Medicine and Web 2.0

I attended the session held on Nature Island as part of the Scifoo Lives On series being organised by Jean-Claude Bradley and Bertalan Mesko and wanted to record some of my impressions. The mechanics of the meeting itself were interesting. My initial reaction to the idea of meetings in Second Life was pretty sceptical. My natural inclination would have been to setup some sort of video cast or conference call. However there are advantages to the sense of actually having people milling around (and I apologise to all the people I bumped into or whose slides I inadvertantly changed). It was good to ‘meet’ Jean-Claude if not in the flesh then in the fur and the sense of actually seeing a person or at least a representation makes this somethow seem more natural.

The disadvantage was that as the dialogue is typed and at least on my connection was coming through quite slow it is difficult to have a real conversation. Several times I would start typing a question or comment and by the time I’d got through to the end the conversation seemed to have moved on. Audio would be better and this can be enabled but it would probably have its own problems with people talking over each other. Maybe we would need to learn to put our hands up? The advantage of SL is that it is a single package which once you’ve got working, gives you the slides, the talk, and the questions all in one package. I use video conferencing quite regularly for meetings but there are real issues with ensuring that participants are compatible with the package you are using – in practise it is almost entirely used either for internal (multi-site) meetings or one-to-one meetings via Skype or something similar.

The Scifoo Lives on session itself has been covered by Jean-Claude who also provides a transcript. An issue that came up with several of the posters is how big the community that supports them is or needs to be and how you go about growing that community. Sites that provide a ‘Wikipedia for Medical Information’ or a ‘digg for the bioscience literature’ are laudable efforts. Their succes, like that of other sites featured depends on a large enough community actively contributing to the site and providing added value. Wikipedia ‘works’ (not wishing to get into the argument about accuracy here) because an enormous number of people provide their time freely to add value. Arguably a number of other Wiki sites aimed at smaller communities have not achieved as much as hoped because the community support isn’t enough to provide critical mass. Its a tough world out there and competing sites will rise and fall but there is some critical mass required before they attract a big enough audience that the site builds itself.

This is also true of open research more generally. We are a long way from the critical mass that makes it worthwhile for people to put in a little bit of effort on a regular basis because they get a lot back. They key question to my view is what are the best steps to take that will put us on the right path as fast as is sensible. And where can we find some sociologists to help us on this? An argument for using the term ‘Open Research’ may be that we need the help of the social sciences community to figure out how best to proceed.

Through a PRISM darkly

I don’t really want to add anything more to what has been said in many places (and has been rounded up well by Bora Zivkovic on Blog Around the Clock, see also Peter Suber for the definitive critique, also updates here and here). However there is a public relations issue here for the open science movement in general that I think hasn’t come up yet.

PRISM is an organisation with a specific message designed by PR people which is essentially that ‘Mandating Open Access for government funded science undermines the traditional model of peer review’. We know this is demonstrably false in respect of both Open Access scientific journals and more generally of making papers from other journals available after a certain delay. It is however conceivable, for someone with a particularly twisted mindset, to construe the actions of some members of the ‘Open Science Community’ as being intended to undermine peer review. We think of providing raw data online or using blogs, Wikis, pre-print archives or whatever other means to discuss science as an exciting way to supplement the peer reviewed literature. PRISM, and other like-minded groups, will attempt to link Open Access and Open Science together so as to represent an attempt by ‘those people’ to undermine peer review.

What is important is control of the language. PRISM has focussed on the term ‘Open Access’. We must draw a sharp distinction between Open Access and ‘Open Science’ (or ‘Open Research‘ which may be a better term). The key is that while those of us who believe in Open Research are largely in favour of Open Access literature, publishing in the Open Access literature does not imply any commitment to Open Research. Indeed it doesn’t even imply a commitment to providing the raw data that supports a publication. It is purely and simple a commitment to provide specific peer reviewed research literature in a freely accessible form which can be freely re-used and re-mixed.

We need some simple messages of our own. Here are some suggested ideas;

‘Open Access literature provides public access to publicly funded research’

‘Publically supported research should be reported in publically accessible literature’

‘How many times should a citizen have to pay to see a report on research supported by their tax dollars?’

‘Open Access literature improves the quality of peer review’

Emphasis here is on ‘public’ and ‘literature’ rather than ‘government’ and ‘results’ or ‘science’

I think there is also a need for some definitions that the ‘Open Research Community’ feels able to sign up to. Jean-Claude Bradley and Bertalan Mesko are running a session in Second Life on Nature Island next Tuesday (1600 UTC) which will include a discussion of definitions (see here for details and again the link to Bill Hooker’s good discussion of terminology). I probably won’t be able to attend but would encourage people to participate in whatever form possible so as to take this forward.

Blogs vs Wikis and third party timestamps

I wanted to pull out some of the comments Jean-Claude Bradley has made on the e-notebook posts and think them through in more detail.

Jean-Claude‘s comment on this post:

There may be differences between fields but, in organic chemistry, we could not make a blog by itself work as an electronic notebook. The key problem was the assumption that an experiment could be recorded without further modification. But a lab notebook doesn’t work like that – the idea is to record work as it is being done and make observations and conclusions over time. For experiments that can take weeks, a blog post can be updated but there is no version tracking. It is thus difficult to prove who-knew-what-when. Using a wiki page as a lab notebook page gives you results in real time and a detailed trail of additions and corrections, with each version being addressable with a different url.

Thinking about this and looking at some examples on the UsefulChem Wiki I wondered whether this is largely down to a different way of thinking about the notebook rather than differences in field. I will use the UsefulChem Exp098 as an example for this.

This experiment has been modified over time and this can be tracked through the Wiki system. Now my initial reaction to this was ‘but you should never modify the notebook’. In our system we originally had no means of making any changes (for precisely this reason) but eventually introduced one because typographical errors were otherwise very annoying (and because at the moment incorporating all the links requires double editing). However our common use is not to modify the core of the post. Arguably this is a hangover from paper notebooks (never use whiteout, never remove pages).

In the case of the UsefulChem Wiki the rational objections to this kind of modification go away because it is properly tracked in a transparent fashion. However I still find myself somewhat uncomfortable with the notion it has been changed. I wonder whether this could cause an unfavourable impression in some minds? There is a good presentation with audio here where Jean-Claude describes the benefits of and good rationale for this flexibility as well as the history that brought them to the system they use.

Differences in use

The key difference I think between modifications in the respective systems is that in the UsefulChem case changes can be made over a period of weeks with corrections and details being sorted out. In the case of Exp098 this includes the analysis of a set of samples over the course of a week. There are then a series of further corrections over the course of over a month, although the main changes occur over a few weeks. Partly this is the nature of the experiment with it taking place over several days. We would probably handle this through multiple posts. I will try to set up a sandpit where I will see how we might have represented this experiment. The other element is the process of corrections and comments that are incorporated. I think we would implement this through comments in the blog rather than correcting the post.

So a key difference here is the presentation standards of the experiment. The aim for UsefulChem seems to be to provide a ‘finished’ or ‘polished’ representation of the experiment whereas I think our approach is more traditional ‘well that’s what I wrote down so that’s what is there’ kind of approach. The benefits of the former as Jean-Claude points out in his talk is that it is a great opportunity to improve the students standards of recording as they go. In principle if things really are brought up to standard then they can be incorporated directly into the methods section of a thesis, perhaps even a paper. In my group however I would do this as the methodology is transferred from the lab book (in whatever form) to the regular reports required in our department.

Third party timestamps and hosted systems

Jean-Claude’s comment again:

I think the main other issue is the third party time stamp. That’s one reason I like using a service, like Wikispaces, hosted by a large stable company. It also makes it easier for people to replicate overnight at zero cost if they are interested in trying it.

These are two good reasons for using a standard engine. Independent time stamps are very useful in demonstrating history, whether for precedence, or even for patent issues. If one of the key arguments in favour of open notebook science (or at least one of the main arguments against the idea that your risk being scooped) is that it provides a means of establishing precedence then it is important to have a reliable time stamp. I don’t know what the legal status of a computer based time stamp is but I do wonder whether from a legal perspective at least that an in house time stamp in a well regulated and transparent system might be as good (or no worse than) a signed and dated paper notebook. Again however, impressions are important, and while it may be impossible for me to fake a date in our system it doesn’t mean that people will necessarily believe me when I say it.

The second point here, that using a generic hosted system makes it much easier for other people to replicate is also a good one. A case could be made that if my group are doing open notebook science we are doing it with a closed system which at least partly defeats the purpose of the exercise. My answer to this is that we are trying to develop something that will take us further than a generic hosted system can – perhaps it may be possible to retro-fit what we develop into a generic blog or wiki at a later date but currently this isn’t possible. People are of course welcome to look at and use the system, this is part of what we have the grant for after all, but I recognise that this creates a barrier which we will need to overcome. I think we just see how this plays out over time.

Finally…

Final comment from Jean-Claude

But I think there is also a lot more to learn about the differences between how scientific fields (and researchers) operate. We may gain a better appreciation of this if a few of us do Open Notebook Science.

Couldn’t agree more. I have learnt a lot from doing this about how we think about experiments and how things are ordered (or not). There is a lot to learn from looking at how systems do and don’t work and how this differs in different fields.

The Southampton E-lab Blog Notebook – Part 3 Implementation

In Part 1 and Part 2 I discussed the criteria we set for our system to be successful and the broad outlines of a strategy for organisation. In this part I will outline how we apply this strategy in practise. This will not deal with the technical implementation and software design for the Blog engine which will be discussed later. I will note however that we are not using a standard blog engine but one which has been custom built for handling lab-based blogging.

Continue reading “The Southampton E-lab Blog Notebook – Part 3 Implementation”

Followup on ‘open methods’

I wanted to followup on the post I wrote a few days ago where I quoted a post from Black Knight on the concept of making methodology open. The point I wanted to make was the scientists in general might be even more protective of their methodology than they are of their data. However I realised afterwards that I may have given the impression that I thought BK was being less open than he ‘should’, which was not my intention. Anyway, yesterday I spent several few hours reading through his old posts (thoroughly enjoyable and definitely worth the effort) and discovered quite a few posts where he makes detailed and helpful methodological suggestions.

For example here is a post on good methods for recovery of DNA from gels as well as a rapid response to a methodological query. Here is another valuable hint on getting the best from PCR (though I would add this is more true for analytical PCR than if you just want to make as much DNA as possible). Nor is the helpful information limited just to lab methodology. Here is some excellent advice on how to give a good seminar. So here is a good example of providing just the sort of information I was writing about and indeed of open notebook science in action. I’d be interested to know how many people in the School now use the recipes suggested here.

p.s. his post Word of the week – 38

deuteragonist , n.

In a structural laboratory, one who labels his samples with 2H.

e.g. “Jill says that to be successful at small angle neutron scattering you have to be a good deuteragonist.”

c.f. protagonist

nearly got my computer completely covered in tea. And there is much more where that came from. They are probably more funny if you are an ex-pat Australian working in the UK but hey, that’s life.

The Southampton E-lab blog notebook – Part 2 ELN strategy

In Part 1 I outlined our aims in building an ELN system and the criteria we hope to satisfy. In this post I will discuss the outline of the system that has been developed.

The WebLog as an ELN system

A blog is a natural system on which to build an ELN. It provides free text entry, automatic date recording, the ability to include images and other files, and a system for publishing and collecting comments and advice. In this configuration a blog can function essentially as ‘electronic paper’ that can be used as a substitute for a paper notebook.

The blog can be used in a more sophisticated fashion by using internal links. Procedure B uses the material produced in Procedure A, therefore a link is inserted. Such links can go someway towards providing some metadata ‘Material from procedure A is used in procedure B’. Following the links backwards and forwards can provide some measure of sample or lot tracking, an understanding of where samples have come from and where they are going to.

This is a step forward but it is not ideal. It is one thing to say that Gel X has samples from PCR Y and PCR Y was carried out with samples from Culture Z. But which samples are which? The connections between items can probably be inferred by a human reader but not by a machine. If we wish to have a system where a machine can tell that Lane 2 on Gel X is from a PCR of material from Culture Z three hours after induction we need to be able to track samples through the system. For this to work the system needs to assign a unique ID to each ‘sample’.

The ‘One item-one post’ model

A blog does provide a natural way of providing IDs. Each post carries its own ID. This post has ID#8 in this blog (notwithstanding any pretty human readable ID at the top of the page). Therefore if each ‘sample’ has its own post it automatically has an ID (not strictly a UID as Andrew pointed out to me this morning). See here and here for examples of ‘product’ or sample posts (pcr product and primer respectively). Procedures take samples and process them to generate new samples. Thus if samples each have their own post any procedures will also need their own post. Product posts link to the procedure post in which they are generated(and vice versa) and procedures link back to the input materials. See here for an example of a PCR reaction. By following the links it is possible to trace a sample through the system.

The concept can be taken further. By categorising samples and procedures into classes it is possible to automatically capture a great deal of metadata for the experiment. Several pieces of data are already available ‘Procedure X made Product Y using materials A and B’. By adding that procedure X is a PCR reaction and product Y is a piece of double stranded DNA then significantly more can be inferred e.g. PCR reactions (can) make double stranded DNA or in a more sophsticated fashion ‘All PCR reactions contain Mg but Vent polymerase does not work in PCR reactions with MgCl2’. In the one item-one post model the blog becomes a repository of information of relationships between items. Adding categories, or tags, to these items adds much more value to to the repository. Such a data store has some of the characteristics of a triple store including, in principle, the potential for automated reasoning.

Such an approach does, however have distinct disadvantages. It requires the creation of a large number of posts, currently by hand. This creates two problems; firstly it is a lot of work and secondly it fills the blog up with posts which for a human reader do not contain any useful information, making it quite difficult to read. Neither of these are insurmountable problems but they make the process of recording data more complex and less appealing to the user. The challenge therefore is to provide a system that makes this easy for the user and encourages them to provide the extra informaton.

In Part 3 I will start to cover the implementation of the system.

Open methods vs open data – might the former be even harder?

Continuing the discussion set off by Black Knight and continued here and by Peter Murray-Rust I was interested in the following comment in Black Knight’s followup post (my emphasis and I have quoted slightly out of context to make my point).

But all that is not really what I wanted to write about now. The OpenWetWare (have you any idea how difficult it is to type that?) project is a laudable effort to promote collaboration within the life sciences. And this is cool, but then I realize that the devil is in the details.

Share my methods? Yeah! Put in some technical detail? Yea–hang on.

A lot of the debate has been about posting results and the risk of someone stealing them or otherwise using them. But in bioscience the competitive advantage that a laboratory has can lie in the methods. Little tricks that don’t necessarily make it into the methods sections of papers, that sometimes researchers aren’t even entirely aware of, but which form part of the culture of the lab.

The case for sharing methods is, at least on the surface, easier to make than sharing data. A community can really benefit from having all those tips and tricks available. You put yours up and I’ll put mine up means everyone benefits. But if there is something that gives you a critical competitive advantage then how easy is that going to be to give up? An old example is the ‘liquid gold’ transformation buffer developed by Doug Hanahan (read the story in Sambrook and Russell, third edition, p1.105 or online here – I think; its not open access). Hanahan ‘freely and generously distributed the buffer to anyone whose experiments needed high efficiencies…’ (Sambrook and Russell) but he was apparently less keen to make the recipe available. And when it was published (Hanahan, 1983) many labs couldn’t achieve the same efficiencies, again because of details like a critical requirement for absolutely clean glassware (how clean is clean?). How many papers these days even include or reference the protocol used for transformation of E. coli? Yet this could, and did, give a real competitive advantage to particular labs in the early 1980s.

So, if we are to make a case for making methodology open we need to tackle this. I think it is clear that making this knowledge available is good for the science community. But it could be a definite negative for specific groups and people. The challenge lies in making sure that altruistic behaviour that benefits the community is rewarded. And this won’t happen unless metrics of success and community stature are widened to include more than just publications.