The truth…well most of the truth anyway

Notebook collection
Image by Dvortygirl via Flickr

I had a pretty ropey day today. Failing for 45 minutes to do some simple algebra. Transformation overnight that didn’t work…again…and just generally being a bit down what with the whole imploding British science funding situation. But in the midst of this we did one very cool and simple experiment, one that worked, and one that actually has some potentially really interesting implications.

The only things is…I can’t tell you about it. It’s not my project and the result is potentially patentable. At one level this is great…”Neylon in idea that might have practical application shock!”…but as a person who believes in the value of sharing results and information as fast as is practical it is frustrating. Jean-Claude when talking about Open Notebook Science has used a description which I think captures the value here. “If it’s not in the notebook it hasn’t been done”. I can’t really live up to that statement.

I don’t have any simple answers to this. The world is not a simple place to live in in practice. My job is fundamentally about collaborating and supporting other scientists. I enjoy this but it does mean that a lot of what I do is really other people’s projects – and mostly ones where I don’t control the publication schedule. The current arrangement of the international IP system effectively mandates secrecy for as long as can be managed, exactly the opposite of what it was intended to do. The ONS badges developed by Jean-Claude, Andrew, and Shirley are great but they solve the communication problem, the one of explaining what I have done, not the philosophical one, that I can do what I feel I should. And for reasons too tedious to go into it’s not straightforward to put them on my notebook.

Like I say, life in the real world is complicated. As much as anything this is simply a marker to say that not everything I do is made public. I do what I can when I can, with what I can. But it’s a long way from perfect.

But hey, at least I had an idea that worked!

Enhanced by Zemanta

A little bit of federated Open Notebook Science

Girl Reading a Letter at an Open Window
Image via Wikipedia

Jean-Claude Bradley is the master when it comes to organising collaborations around diverse sets of online tools. The UsefulChem and Open Notebook Science Challenge projects both revolved around the use of wikis, blogs, GoogleDocs, video, ChemSpider and whatever tools are appropriate for the job at hand. This is something that has grown up over time but is at least partially formally organised. At some level the tools that get used are the ones Jean-Claude decides will be used and it is in part his uncompromising attitude to how the project works (if you want to be involved you interact on the project’s terms) that makes this work effectively.

At the other end of the spectrum is the small scale, perhaps random collaboration that springs up online, generates some data and continues (or not) towards something a little more organised. By definition such “projectlets” will be distributed across multiple services, perhaps uncoordinated, and certainly opportunistic. Just such a project has popped up over the past week or so and I wanted to document it here.

I have for some time been very interested in the potential of visualising my online lab notebook as a graph. The way I organise the notebook is such that it, at least in a sense, automatically generates linked data and for me this is an important part of its potential power as an approach. I often use a very old graph visualisation in talks I give out the notebook as a way of trying to indicate the potential which I wrote about previously, but we’ve not really taken it any further than that.

A week or so ago, Tony Hirst (@psychemedia) left a comment on a blog post which sparked a conversation about feeds and their use for generating useful information. I pointed Tony at the feeds from my lab notebook but didn’t take it any further than that. Following this he posted a series of graph visualisations of the connections between people tweeting at a set of conferences and then the penny dropped for me…sparking this conversation on twitter.

@psychemedia You asked about data to visualise. I should have thought about our lab notebook internal links! What formats are useful? [link]

@CameronNeylon if the links are easily scrapeable, it’s easy enough to plot the graph eg http://blog.ouseful.info/2010/08/30/the-structure-of-ouseful-info/ [link]

@psychemedia Wouldn’t be too hard to scrape (http://biolab.isis.rl.ac.uk/camerons_labblog) but could possibly get as rdf or xml if it helps? [link]

@CameronNeylon structured format would be helpful… [link]

At this point the only part of the whole process that isn’t publicly available takes place as I send an email to find out how to get an XML download of my blog and then report back  via Twitter.

@psychemedia Ok. XML dump at http://biolab.isis.rl.ac.uk/camerons_labblog/index.xml but I will try to hack some Python together to pull the right links out [link]

Tony suggests I pull out the date and I respond that I will try to get the relevant information into some sort of JSON format, and I’ll try to do that over the weekend. Friday afternoons being what they are and Python being what is I actually manage to do this much quicker than I expect and so I tweet that I’ve made the formatted data, raw data, and script publicly available via DropBox. Of course this is only possible because Tony tweeted the link above to his own blog describing how to pull out and format data for Gephi and it was easy for me to adapt his code to my own needs, an open source win if there ever was one.

Despite the fact that Tony took the time out to put the kettle on and have dinner and I went to a rehearsal by the time I went to bed on Friday night Tony had improved the script and made it available (with revisions) via a Gist, identified some problems with the data, and posted an initial visualisation. On Saturday morning I transfer Tony’s alterations into my own code, set up a local Git repository, push to a new Github repository, run the script over the XML dump as is (results pushed to Github). I then “fix” the raw data by manually removing the result of a SQL insertion attack – note that because I commit and push to the remote repository I get data versioning for free – this “fixing” is transparent and recorded. Then I re-run the script, pushing again to Github. I’ve just now updated the script and committed once more following further suggestions from Tony.

So over a couple of days we used Twitter for communication, DropBox, GitHub, Gists, and Flickr for sharing data and code, and the whole process was carried out publicly. I wouldn’t have even thought to ask Tony about this if he hadn’t  been publicly posting his visualisations (indeed I remember but can’t find an ironic tweet from Tony a few weeks back about it would be clearly much better to publish in a journal in 18 months time when no-one could even remember what the conference he was analysing was about…).

So another win for open approaches. Again, something small, something relatively simple, but something that came together because people were easily connected in a public space and were routinely sharing research outputs, something that by default spread into the way we conducted the project. It never occurred to me at the time, I was just reaching for the easiest tool at each stage, but at every stage every aspect of this was carried out in the open. It was just the easiest and most effective way to do it.

Enhanced by Zemanta

What would scholarly communications look like if we invented it today?

Picture1
Image by cameronneylon via Flickr

I’ve largely stolen the title of this post from Daniel Mietchen because I it helped me to frame the issues. I’m giving an informal talk this afternoon and will, as I frequently do, use this to think through what I want to say. Needless to say this whole post is built to a very large extent on the contributions and ideas of others that are not adequately credited in the text here.

If we imagine what the specification for building a scholarly communications system would look like there are some fairly obvious things we would want it to enable. Registration of ideas, data or other outputs for the purpose of assigning credit and priority to the right people is high on everyone’s list. While researchers tend not to think too much about it, those concerned with the long term availability of research outputs would also place archival and safekeeping high on the list as well. I don’t imagine it will come as any surprise that I would rate the ability to re-use, replicate, and re-purpose outputs very highly as well. And, although I won’t cover it in this post, an effective scholarly communications system for the 21st century would need to enable and support public and stakeholder engagement. Finally this specification document would need to emphasise that the system will support discovery and filtering tools so that users can find the content they are looking for in a huge and diverse volume of available material.

So, filtering, archival, re-usability, and registration. Our current communications system, based almost purely on journals with pre-publication peer review doesn’t do too badly at archival although the question of who is actually responsible for actually doing the archival, and hence paying for it doesn’t always seem to have a clear answer. Nonetheless the standards and processes for archiving paper copies are better established, and probably better followed in many cases, than those for digital materials in general, and certainly for material on the open web.

The current system also does reasonably well on registration, providing a firm date of submission, author lists, and increasingly descriptions of the contributions of those authors. Indeed the system defines the registration of contributions for the purpose of professional career advancement and funding decisions within the research community. It is a clear and well understood system with a range of community expectations and standards around it. Of course this is circular as the career progression process feeds the system and the system feeds career progression. It is also to some extent breaking down as wider measures of “impact” become important. However for the moment it is an area where the incumbent has clear advantages over any new system, around which we would need to grow new community standards, expectations, and norms.

It is on re-usability and replication where our current system really falls down. Access and rights are a big issue here, but ones that we are gradually pushing back. The real issues are much more fundamental. It is essentially assumed, in my experience, by most researchers that a paper will not contain sufficient information to replicate an experiment or analysis. Just consider that. Our primary means of communication, in a philosophical system that rests almost entirely on reproducibility, does not enable even simple replication of results. A lot of this is down to the boundaries created by the mindset of a printed multi-page article. Mechanisms to publish methods, detailed laboratory records, or software are limited, often leading to a lack of care in keeping and annotating such records. After all if it isn’t going in the paper why bother looking after it?

A key advantage of the web here is that we can publish a lot more with limited costs and we can publish a much greater diversity of objects. In principle we can solve the “missing information” problem by simply making more of the record available. However those important pieces of information need to be captured in the first place. Because they aren’t currently valued, because they don’t go in the paper, they often aren’t recorded in a systematic way that makes it easy to ultimately publish them. Open Notebook Science, with its focus on just publishing everything immediately, is one approach to solving this problem but it’s not for everyone, and causes its own overhead. The key problem is that recording more, and communicating it effectively requires work over and above what most of us are doing today. That work is not rewarded in the current system. This may change over time, if as I have argued we move to metrics based on re-use, but in the meantime we also need much better, easier, and ideally near-zero burden tools that make it easier to capture all of this information and publish it when we choose, in a useful form.

Of course, even with the perfect tools, if we start to publish a much greater portion of the research record then we will swamp researchers already struggling to keep up. We will need effective ways to filter this material down to reduce the volume we have to deal with. Arguably the current system is an effective filter. It almost certainly reduces the volume and rate at which material is published. Of all the research that is done, some proportion is deemed “publishable” by those who have done it, a small portion of that research is then incorporated into a draft paper and some proportion of those papers are ultimately accepted for publication. Up until 20 years ago where the resource pinch point was the decision of whether or not to publish something this is exactly what you would want. The question of whether it is an effective filter; is it actually filtering the right stuff out, is somewhat more controversial. I would say the evidence for that is weak.

When publication and distribution was the expensive part that was the logical place to make the decision. Now these steps are cheap the expensive part of the process is either peer review, the traditional process of making a decision prior to publication, or conversely, the curation and filtering after publication that is seen more widely on the open web. As I have argued I believe that using the patterns of the web will be ultimately a more effective means of enabling users to discover the right information for their needs. We should publish more; much more and much more diversely but we also need to build effective tools for filtering and discovering the right pieces of information. Clearly this also requires work, perhaps more than we are used to doing.

An imaginary solution

So what might this imaginary system that we would design look like. I’ve written before about both key aspects of this. Firstly I believe we need recording systems that as far as possible record and publish both the creation of objects, be they samples, data, or tools. As far as possible these should make a reliable, time stamped, attributable record of the creation of these objects as a byproduct of what the researcher needs to do anyway. A simple concept for instance is a label printer that, as a byproduct of printing off a label, makes a record of who, what, and when, publishing this simultaneously to a public or private feed.

Publishing rapidly is a good approach, not just for ideological reasons of openness but also some very pragmatic concerns. It is easier to publish at the source than to remember to go back and do it later. Things that aren’t done immediately are almost invariably forgotten or lost. Secondly rapid publication has the potential to both efficiently enable re-use and to prevent scooping risks by providing a time stamped citable record. This of course would require people to cite these and for those citations to be valued as a contribution; requiring a move away from considering the paper as the only form of valid research output (see also Michael Nielsen‘s interview with me).

It isn’t enough though, just to publish the objects themselves. We also need to be able to understand the relationship between them. In a semantic web sense this means creating the links between objects, recording the context in which they were created, what were their inputs and outputs. I have alluded a couple of times in the past to the OREChem Experimental Ontology and I think this is potentially a very powerful way of handling these kind of connections in a general way. In many cases, particularly in computational research, recording workflows or generating batch and log files could serve the same purpose, as long as a general vocabulary could be agreed to make this exchangeable.

As these objects get linked together they will form a network, both within and across projects and research groups, providing the kind of information that makes Google work, a network of citations and links that make it possible to directly measure the impact of a single dataset, idea, piece of software, or experimental method through its influence over other work. This has real potential to help solve both the discovery problem and the filtering problem. Bottom line, Google is pretty good at finding relevant text and they’re working hard on other forms of media. Research will have some special edges but can be expected in many ways to fit patterns that mean tools for the consumer web will work, particularly as more social components get pulled into the mix.

On the rare occasions when it is worth pulling together a whole story, for a thesis, or a paper, authors would then aggregate objects together, along with text and good visuals to present the data. The telling of a story then becomes a special event, perhaps one worthy of peer review in its traditional form. The forming of a “paper” is actually no more than providing new links, adding grist to the search and discovery mill, but it can retain its place as a high status object, merely losing its role as the only object worth considering.

So in short, publish fragments, comprehensively and rapidly. Weave those into a wider web of research communication, and from time to time put in the larger effort required to tell a more comprehensive story. This requires tools that are hard to build, standards that are hard to agree, and cultural change that at times seems like spitting into a hurricane. Progress is being made, in many places and in many ways, but how can we take this forward today?

Practical steps for today

I want to write more about these ideas in the future but here I’ll just sketch out a simple scenario that I hope can be usefully implemented locally but provide a generic framework to build out without necessarily requiring a massive agreement on standards.

The first step is simple, make a record, ideally an address on the web for everything we create in the research process. For data and software just the files themselves, on a hard disk is a good start. Pushing them to some sort of web storage, be it a blog, github, an institutional repository, or some dedicated data storage service, is even better because it makes step two easy.

Step two is to create feeds that list all of these objects, their addresses and as much standard metadata as possible, who and when would be a good start. I would make these open by choice, mainly because dealing with feed security is a pain, but this would still work behind a firewall.

Step three gets slightly harder. Where possible configure your systems so that inputs can always be selected from a user-configurable feed. Where possible automate the pushing of outputs to your chosen storage systems so that new objects are automatically registered and new feeds created.

This is extraordinarily simple conceptually. Create feeds, use them as inputs for processes. It’s not so straightforward to build such a thing into an existing tool or framework, but it doesn’t need to be too terribly difficult either. And it doesn’t need to bother the user either. Feeds should be automatically created, and presented to the user as drop down menus.

The step beyond this, creating a standard framework for describing the relationships between all of these objects is much harder. Not because its difficult, but because it requires an agreement on standards for how to describe those relationships. This is do-able and I’m very excited by the work at Southampton on the OREChem Experimental Ontology but the social problems are harder. Others prefer the Open Provenance Model or argue that workflows are the way to manage this information. Getting agreement on standards is hard, particularly if we’re trying to maximise their effective coverage but if we’re going to build a computable record of science we’re going to have to tackle that problem. If we can crack it and get coverage of the records via a compatible set of models that tell us how things are related then I think we will be will placed to solve the cultural problem of actually getting people to use them.

Enhanced by Zemanta

Replication, reproduction, confirmation. What is the optimal mix?

Issues surrounding the relationship of Open Research and replication seems to be the meme of the week. Abhishek Tiwari provided notes on a debate describing concerns about how open research could damage replication and Sabine Hossenfelder explored the same issue in a blog post. The concern fundamentally is that by providing more of the details of our research we may actually be damaging the research effort by reducing the motivation to reproduce published findings or worse, as Sabine suggests, encouraging group think and a lack of creative questioning.

I have to admit that even at a naive level I find this argument peculiar. There is no question that in aiming to reproduce or confirm experimental findings it may be helpful to carry out that process in isolation, or with some portion of the available information withheld. This can obviously increase the quality and power of the confirmation, making it more general. Indeed the question of how and when to do this most effectively is very interesting and bears some thought. The optimization of these descisions in specific cases will be important part of improving research quality. What I find peculiar is the apparent belief in many quarters (but not necessarily Sabine who I think has a much more sophisticated view) that this optimization is best encouraged by not bothering to make information available. We can always choose not to access information if it is available but if it is not we cannot choose to look at it. Indeed to allow the optimization of the confirmation process it is crucial that we could have access to the information if we so decided.

But I think there is a deeper problem than the optimization issue. I think that the argument also involves two category errors. Firstly we need to distinguish between different types of confirmation. There is pure mechanical replication, perhaps just to improve statistical power or to re-use a technique for a different experiment. In this case you want as much detail as possible about how the original process was carried out because there is no point in changing the details. The whole point of the exercise is to keep things as similar as possible. I would suggest the use of the term “reproduction” to mean a slightly different case. Here the process or experiment “looks the same” and is intended to produce the same result but the details of the process are not tightly controlled. The purpose of the exercise is to determine how robust the result or output is to modified conditions. Here withholding, or not using, some information could be very useful. Finally there is the process of actually doing something quite different with the intention of testing an idea or a claim from a different direction with an entirely different experiment or process. I would refer to this as “confirmation”. The concerns of those arguing against providing detailed information lie primarily with confirmation, but the data and process sharing we are talking about relates more to replication and reproduction. The main efficiency gains lie in simply re-using shared process to get down a scientific path more rapidly rather than situations where the process itself is the subject of the scientific investigation.

The second category error is somewhat related in as much as the concerns around “group-think” refer to claims and ideas whereas the objects we are trying to encourage sharing when we talk about open research are more likely to be tools and data. Again, it seems peculiar to argue that the process of thinking independently about research claims is aided by reducing  the amount of data available. There is a more subtle argument that Sabine is making and possibly Harry Collins would make a similar one, that the expression of tools and data may be inseparable from the ideas that drove their creation and collection. I would still argue however that it is better to actively choose to omit information from creative or critical thinking rather than be forced to work in the dark. I agree that we may need to think carefully about how we can effectively do this and I think that would be an interesting discussion to have with people like Harry.

But the argument that we shouldn’t share because it makes life “too easy” seems dangerous to me. Taking that argument to its extreme we should remove the methods section from papers altogether. In many cases it feels like we already have and I have to say that in day to day research that certainly doesn’t feel helpful.

Sabine also makes a good point, that Michael Nielsen also has from time to time, that these discussions are very focussed on experimental and specifically hypothesis driven research. It bears some thinking about but I don’t really know enough about theoretical research to have anything useful to add. But it is the reason that some of the language in this post may seem a bit tortured.

Use Cases for Provenance – eScience Institute – 20 April

On Monday I am speaking as part of a meeting on Use Cases for Provenance (Programme), which has a lot of interesting talks scheduled. I appear to be last. I am not sure whether that means I am the comedy closer or the pre-dinner entertainment. This may, however, be as a result of the title I chose:

In your worst nightmares: How experimental scientists are doing provenance for themselves

On the whole experimental scientists, particularly those working in traditional, small research groups, have little knowledge of, or interest in, the issues surrounding provenance and data curation. There is however an emerging and evolving community of practice developing the use of the tools and social conventions related to the broad set of web based resources that can be characterised as “Web 2.0”. This approach emphasises social, rather than technical, means of enforcing citation and attribution practice, as well as maintaining provenance. I will give examples of how this approach has been applied, and discuss the emerging social conventions of this community from the perspective of an insider.

The meeting will be webcast (link should be available from here) and my slides will with any luck be up at least a few minutes before my talk in the usual place.

Very final countdown to Science Online 09

I should be putting something together for the actual sessions I am notionally involved in helping running but this being a very interactive meeting perhaps it is better to leave things to very last minute. Currently I am at a hotel at LAX awaiting an early flight tomorrow morning. Daily temperatures in the LA area have been running around 25-30 C for the past few days but we’ve been threatened with the potential for well below zero in Chapel Hill. Nonetheless the programme and the people will more than make up for it I have no doubt. I got to participate in a bit of the meeting last year via streaming video and that was pretty good but a little limited – not least because I couldn’t really afford to stay up all night unlike some people who were far more dedicated.

This year I am involved in three sessions (one on Blog Networks, one on Open Notebook Science, and one on Social Networks for Scientists – yes those three are back to back…) and we will be aiming to be video casting, live blogging, posting slides, images, and comments; the whole deal. If you’ve got opinions then leave them at the various wiki pages (via the programme) or bring them along to the sessions. We are definitely looking for lively discussion. Two of these are being organised with the inimitable Deepak Singh who I am very much looking forward to finally meeting in person – along with many others I feel I know quite well but have never met – and others I have met and look forward to catching up with including Jean-Claude who has instigated the Open Notebook session.

With luck I will get to the dinner tomorrow night so hope to see some people there. Otherwise I hope to see many in person or online over the weekend. Thanks for Bora and Anton and David for superb organisation (and not a little pestering to make sure I decided to come!)

The Southampton Open Science Workshop – a brief report

On Monday 1 September we had a one day workshop in Southampton discussing the issues that surround ‘Open Science’. This was very free form and informal and I had the explicit aim of getting a range of people with different perspectives into the room to discuss a wide range of issues, including tool development, the social and career structure issues, as well as ideas about standards and finally, what concrete actions could actually be taken. You can find live blogging and other commentary in the associated Friendfeed room and information on who attended as well as links to many of the presentations on the conference wiki.

Broadly speaking the day was divided into three chunks, the first was focussed on tools and services and included presentations on MyExperiment, Mendeley, Chemtools, and Inkspot Science. Branwen Hide of Research Information Network has written more on this part. Given that the room contained more than the usual suspects the conversation focussed on usability and interfaces rather than technical aspects although there was a fair bit of that as well.

The second portion of the day revolved more around social challenges and issues. Richard Grant presented his experience of blogging on an official university sanctioned site and the value of that for both outreach and education. One point he made was that the ‘lack of adoption problem’ seen in science just doesn’t seem to exist in the humanities. Perhaps this is because scientists don’t generally see ‘writing’ as a valuable thing in its own right. Certainly there is a preponderance of scientists who happen also to see themselves as writers on Nature Network.

Jennifer Rohn followed on from Richard, and objected to my characterising her presentation as “the skeptic’s view”. A more accurate characterisation would have been “I’d love to be open but at the moment I can’t: This is what has to change to make it work”. She presented a great summary of the proble, particularly from the biological scientist’s point of view as well as potential solutions. Essentially the problem is that of the ‘Minimum Publishable Unit’ or research quantum as well as what ‘counts’ as publication. Her main point was that for people to be prepared to publish material that falls short of a full paper they need to get some proportional credit for that. This folds closely into the discussion of what can be cited and what should be cited in particular contexts. I have used the phrase ‘data sized peg into a paper shaped hole’ to describe this in the past.

After lunch Liz Lyon from UKOLN talked about curation and long term archival storage which lead into an interesting discussion about the archiving of blogs and other material. Is it worth keeping? One answer to this was to look at the real interest today in diaries from the second world war and earlier from ‘normal people’. You don’t necessarily need to be a great scientist, or even a great blogger, for the material to be of potential interest to historians in 50-100 years time. But doing this properly is hard – in the same way that maintaining and indexing data is hard. Disparate sites, file formats, places of storage, and in the end whose blog is it actually? Particularly if you are blogging for, or recording work done at, a research institution.

The final session was about standards or ‘brands’. Yaroslav Nikolaev talked about semantic representations of experiments. While important it was probably a shame in the end we did this at the end of the day because it would have been helpful to get more of the non-techie people into that discussion to iron out both the communication issues around semantic web as well as describing the real potential benefits. This remains a serious gap – the experimental scientists who could really use semantic tools don’t really get the point, and the people developing the tools don’t communicate well what the benefits are, or in some cases (not all I hasten to add!) actually build the tools the experimentalists want.

I talked about the possibility of a ‘certificate’ or standard for Open Science, and the idea of an organisation to police this. It would be safe to say that, while people agreed that clear definitions would be hepful, the enhusiasm level for a standards organisation was pretty much zero. There are more fundamental issues of actually building up enough examples of good practice, and working towards identifying best practice in open science, that need to be dealt with before we can really talk about standards.

On the other hand the idea of ‘the fully supported’ paper got immediate and enthusiastic support. The idea here is deceptively simple, and has been discussed elsewhere; simply that all the relevant supporting information for a paper (data, detailed methodology, software tools, parameters, database versions etc. as well as access to required materials at reasonable cost) should be available for any published paper. The challenge here lies in actually recording experiments in such a way that this information can be provided. But if all of the record is available in this form then it can be made available whenever the researcher chooses. Thus by providing the tools that enable the fully supported paper you are also providing tools that enable open science.

Finally we discussed what we could actually do: Jean-Claude Bradley discussed the idea of an Open Notebook Science challenge to raise the profile of ONS (this is now setup – more on this to follow). Essentially a competition type approach where individuals or groups can contribute to a larger scientific problem by collecting data – where the teams get judged on how well they describe what they have done and how quickly they make it available.

The most specific action proposed was to draft a ‘Letter to Nature’ proposing the idea of the fully supported paper as a submission standard. The idea would be to get a large number of high profile signatories on a document which describes  a concrete step by step plan to work towards the final goal, and to send that as correspondence to a high profile journal. I have been having some discussions about how to frame such a document and hope to be getting a draft up for discussion reasonably soon.

Overall there was much enthusiasm for things Open and a sense that many elements of the puzzle are falling into place. What is missing is effective coordinated action, communication across the whole community of interested and sympathetic scientsts, and critically the high profile success stories that will start to shift opinion. These ought to, in my opinion, be the targets for the next 6-12 months.

Q&A in this week’s Nature – one or two (minor) clarifications

So a bit of a first for me. I can vaguely claim to have contributed to two things into the print version of Nature this week. Strictly speaking my involvement in the first, the ‘From the Blogosphere‘ piece on the Science Blogging Challenge, was really restricted to discussing the idea (originally from Richard Grant I believe) and now a bit of cheerleading and ultimately some judging. The second item though I can claim some credit for in as much as it is a Q&A with myself and Jean-Claude Bradley that was done when we visited Nature Publishing Group in London a few weeks back.

It is great that a journal like Nature views the ideas of data publication, open notebook science, and open science in general as worthy of featuring. This is not an isolated instance either, as we can point to the good work of the Web Publishing Group, in developing useful resources such as Nature Precedings, as well as previous features in the print edition such as the Horizons article (there is also another version on Nature Precedings) written by Peter Murray-Rust. One thing I have heard said many times in recent months is that while people who advocate open science may not agree with everything NPG with respect to copyright and access, people are impressed and encouraged by the degree of engagement that they maintain with the community.

I did however just want to clarify one or two of the things I apparently said. I am not claiming that I didn’t say those things – the interview was recorded after all – but just that on paper they don’t really quite match what I think I meant to say. Quoting from the article:

CN-Most publishers regard what we do as the equivalent of presenting at a conference, or a preprint. That hasn’t been tested across a wide range of publishers, and there’s at least one — the American Chemical Society — that doesn’t allow prepublication in any form whatsoever.

That sounds a little more extreme than what I meant to say – there are a number of publishers that don’t allow submission of material that has appeared online as a pre-print and the ACS has said that they regard online publication as equivalent to a pre-print. I don’t have any particular sympathy for the ACS but I think they probably do allow publication of material that was presented at ACS conferences.

CN-Open notebooks are practical but tough at the moment. My feeling is that the tools are not yet easy enough to use. But I would say that a larger proportion of people will be publishing electronically and openly in ten years.

Here I think what I said is too conservative on one point and possibly not conservative enough on the other. I did put my neck out and say that I think the majority of scientists will be using electronic lab notebooks of one sort or another in ten years. Funder data sharing policies will drive a much greater volume of material online post publication (hopefully with higher quality description) and this may become the majority of all research data. I think that more people will be making more material available openly as it is produced as well but I doubt that this will be a majority of people in ten years – I hope for a sizeable and significant minority and that’s what we will continue to work towards.

How I got into open science – a tale of opportunism and serendipity

So Michael Nielsen, one morning at breakfast at Scifoo asked one of those questions which never has a short answer; ‘So how did you get into this open science thing?’ and I realised that although I have told the story to many people I haven’t ever written it down. Perhaps this is a meme worth exploring more generally but I thought others might be interested in my story, partly because it illustrates how funding drives scientists, and partly because it shows how the combination of opportunism and serendipity can make for successful bedfellows.

In late 2004 I was spending a lot of my time on the management of a large collaborative research project and had had a run of my own grant proposals rejected. I had a student interested in doing a PhD but no direct access to funds to support the consumables cost of the proposed project. Jeremy Frey had been on at me for a while to look at implementing the electronic lab notebook system that he had lead the development of and at the critical moment he pointed out to me a special call from the BBSRC for small projects to prototype, develop, or implement e-science technologies in the biological sciences. It was a light touch review process and a relatively short application. More to the point it was a way of funding some consumables.

So the grant was written. I wrote the majority of it, which makes somewhat interesting reading in retrospect. I didn’t really know what I was talking about at the time (which seems to be a theme with my successful grants). The original plan was to use the existing, fully semantic, rdf backed electronic lab notebook and develop models for use in a standard biochemistry lab. We would then develop systems to enable a relational database to be extracted from the rdf representation and present this on the web.

The grant was successful but the start was delayed due to shenanigans over the studentship that was going to support the grant and the movement of some of the large project to another institution with one of the investigators. Partly due to the resulting mess I applied for the job I ultimately accepted at RAL and after some negotiation organised an 80:20 split between RAL and Southampton.

By the time we had a student in place and had got the grant started it was clear that the existing semantic ELN was not in a state that would enable us to implement new models for our experiments. However at this stage there was a blog system that had been developed in Jeremy’s group and it was thought it would be an interesting experiment to use this as a notebook. This would be almost the precise opposite of the rdf backed ELN. Looking back at it now I would describe it as taking the opportunity to look at a Web 2.0 approach to the notebook as compared to a Web 3.0 approach but bear in mind that at the time I had little or no idea of what these terms meant, let alone the care with which they need to be used.

The blog based system was great for me as it meant I could follow the student’s work online and doing this I gradually became aware of blogs in general and the use of feed readers. The RSS feed of the LaBLog was a great help as it made following the details of experiments remotely straightforward. This was important as by now I was spending three or four days a week at RAL while the student was based in Southampton. As we started to use the blog, at first in a very naïve way we found problems and issues which ultimately led to us thinking about and designing the organisational approach I have written about elsewhere [1, 2]. By this stage I had started to look at other services online and was playing around with OpenWetWare and a few other services, becoming vaguely aware of Creative Commons licenses and getting a grip on the Web 2.0 versus Web 3.0 debate.

To implement our newly designed approach to organising the LaBLog we decided the student would start afresh with a clean slate in a new blog. By this stage I was playing with using the blog for other things and had started to discover that there were issues that meant the ID authentication we were using didn’t always work through the RAL firewall. I ended up having complicated VPN setups, particularly working from home, where I couldn’t log on to the blog and I have my email online at the same time. This, obviously, was a pain and as we were moving to a new blog which could have new security settings I said, ‘stuff it, let’s just make it completely visible and be done with it’.

So there you go. The critical decision to move to an Open Notebook status was taken as the result of a firewall. So serendipity, or at least the effect of outside pressures, was what made it happen.  I would like to say it was a carefully thought out philosophical decision but, although the fact that I was aware of the open access movement, creative commons, OpenWetWare, and others no doubt prepared the background that led me to think down that route, it was essentially the result of frustration.

So, so far, opportunism and serendipity, which brings us back to opportunism again, or at least seizing an opportunity. Having made the decision to ‘go open’ two things clicked in my mind. Firstly the fact that this was rather radical. Secondly, the fact that all of these Web 2.0 tools combined with an open approach could lead to a marked improvement in the efficiency of collaborative science, a kind of ‘Science 2.0’ [yes, I know, don’t laugh, this would have been around March 2007]. Here was an opportunity to get my name on a really novel and revolutionary concept! A quick Google search revealed that, funnily enough, I wasn’t the first person to think of this (yes! I’d been scooped!), but more importantly it led to what I think ought to be three of the Standard Works of Open Science, Bill Hooker’s three part series on Open Science at 3 Quarks Daily [1, 2, 3], Jean-Claude Bradley’s presentation on Open Notebook Science at Nature Precedings (and the associated original blog post coining the term), and Deepak Singh’s talk on Open Science at Ignite Seattle. From there I was inspired to seize the opportunity, get a blog of my own, and get involved. The rest of my story story, so far, is more or less available online here and via the usual sources.

Which leads me to ask. What got you involved in the ‘open’ movement? What, for you, were the ‘primary texts’ of open science and open research? There is a value in recording this, or at least our memories of it, for ourselves, to learn from our mistakes and perhaps discern the direction going forward. Perhaps it isn’t even too self serving to think of it as history in the making. Or perhaps, more in line with our own aims as ‘open scientists’, that we would be doing a poor job if we didn’t record what brought us to where we are and what is influencing our thinking going forward. I think the blogosphere does a pretty good job of the latter, but perhaps a little more recording of the former would be helpful.

Southampton Open Science Workshop 31 August and 1 September

An update on the Workshop that I announced previously. We have a number of people confirmed to come down and I need to start firming up numbers. I will be emailing a few people over the weekend so sorry if you get this via more than one route. The plan of attack remains as follows:

Meet on evening of Sunday 31 August in Southampton, most likely at a bar/restaurant near the University to coordinate/organise the details of sessions.

Commence on Monday at ~9:30 and finish around 4:30pm (with the option of discussion going into the evening) with three or four sessions over the course of the day broadly divided into the areas of tools, social issues, and policy. We have people interested and expert in all of these areas coming so we should be able to to have a good discussion. The object is to keep it very informal but to keep the discussion productive. Numbers are likely to be around 15-20 people. For those not lucky enough to be in the area we will aim to record and stream the sessions, probably using a combination of dimdim, mogulus, and slideshare. Some of these may require you to be signed into our session so if you are interested drop me a line at the account below.

To register for the meeting please send me an email to my gmail account (cameronneylon). To avoid any potential confusion, even if you have emailed me in the past week or so about this please email again so that I have a comprehensive list in one place. I will get back to you with a request via PayPal for £15 to cover coffees and lunch for the day (so if you have a PayPal account you want to use please send the email from that address). If there is a problem with the cost please state so in your email and we will see what we can do. We can suggest options for accomodation but will ask you to sort it out for yourself.

I have set up a wiki to discuss the workshop which is currently completely open access. If I see spam or hacking problems I will close it down to members only (so it would be helpful if you could create an account) but hopefully it might last a few weeks in the open form. Please add your name and any relevant details you are happy to give out to the Attendees page and add any presentations or demos you would be interested in giving, or would be interested in hearing about, on the Programme suggestion page.