Reflections on the Open Science workshop at PSB09

In a few hours I will be giving a short presentation to the whole of the PSB conference on the workshop that we ran on Monday. We are still thinking through the details of what has come out of this and hopefully the discussion will continue in any case so this is a personal view. The slides for the presentation are available at Slideshare.

To me there were a couple of key points that came out. Many of these are not surprising but bear repeating:

  • Citation and improving and expanding the way it is used lies at the core of making sure that people get credit for the work they do and making the widest range of useful contributions to the research community
  • Persistence of identity, and persistence of objects (in general the persistence of resources) is absolutely critical to making a wider citation culture work. We must know who generated something and be able to point to it in the long term to deliver on the potential of credits.
  • “If you build it they won’t come” – building a service, whether a technical or a social one, depends on a community that uses and adds value to those services. Build the service for the community and build the community for the service. Don’t solve the problems that you think people have – solve the ones that they tell you they have

The main point for me grew out of the panel session and was perhaps articulated best by Drew Endy. Identify specific problems (not ideological issues) and make process more efficient. Ideology may help to guide us but it can also blind you to specific issues and hide the underlying reasons for specific successes and failures from our view. We have a desperate need for both qualitative data, stories about successes and failures, and quantitative data, hard numbers on uptake and the consequences of uptake of specific practice.

Taking inspiration from Drew’s keynote, we have an evolved system for doing research that is not designed to be easily understood or modified. We need to take an experimental approach to identifying and solving specific problems that would let us increase the efficiency of the research process. Drew’s point was that this should be a proper research discipline in it’s own right, with the funding and respect that goes with it. For the presentation I summarised this as follows:

Improving the research process is an area for (experimental) research that requires the same rigour, standards (and funding) as anything else that we do

Brief running report on the Open Science Workshop at PSB09

Just a very brief rundown of what happened at the workshop this morning and some central themes that came out of it. The slides from the talks are available on Slideshare and recorded video from most of the talks (unfortunately not Dave de Roure‘s or Phil Bourne‘s at the moment) is available on my Mogulus channel (http://www.mogulus.com/cameron_neylon – click on Video on Demand and select the PSB folder). The commentary from the conference is available in the PSB 2009 Friendfeed room.

For me there were three main themes that came through from the talks and the panel session. The first was one that has come up in many contexts but most recently in Phil Bourne and Lyn Fink’s perspectives article in PLos Computational Biology; the need for persistent identity tokens to track people’s contributions, and a need to re-think how citation works, and what citations are used for.

The second theme was a need for more focus on specific issues, including domain specific problems or barriers, where “greasing the wheels” could make a direct difference to people’s ability to do their research. Solving specific problems that are not necessarily directly associated with “openness” as an ideological movement. Similar ideas were raised in discussion of tool and service development, the need to build the user into the process of service design and solve the problems users have, rather than those the developer may think they ought to be worrying about.

But probably the main theme that came through for me was the need to identify and measure real outcomes from adopting more open practice. This was the central theme of Heather‘s talk but also came up strongly in the panel session. We have little if any quantitative information on the benefits of open practice and  there are still relatively few examples of complete examples of open research projects. More research and more aggregation of examples will help here but there is a desperate need for numbers and details to help funders, policy makers, and researchers themselves to make informed choices about what approaches are worth adopting, and indeed which are not.

The session was a good conversation, with some great talks, and lots of people involved throughout. Even with a three hour slot we ran 30 minutes over and could have kept talking for quite a bit longer. We will keep posting material over the next few days so please continue the discussion over at Friendfeed and on the workshop website.

Final countdown to Open Science@PSB

As I noted in the last post we are rapidly counting down towards the final few days before the Open Science Workshop at the Pacific Symposium on Biocomputing. I am flying out from Sydney to Hawaii this afternoon and may or may not have network connectivity in the days leading up the meeting. So just some quick notes here on where you can find any final information if you are coming or if you want to follow online.

The workshop website is available at psb09openscience.wordpress.com and this is where information will be posted in the leadup to the workshop and links to presentations and any other information posted afterwards.

If you want to follow in closer to real time then there is a Friendfeed room available at friendfeed.com/rooms/psb-2009 which will have breaking information and live blogging during the workshop and throughout the conference. I will be aiming to broadcast video of the workshop at www.mogulus.com/cameron_neylon but this will depend on how well the wireless is working on the day. This will not be the highest priority. Updates on whether it is functioning or not will be in the friendfeed room and I will not be monitoring the chat room on the mogulus feed. If there are technical issues please leave a message in the friendfeed room and I will try to fix the problem or at least say if I can’t.

Otherwise I hope to see many of you at the workshop either in person or online!

New Year’s Resolutions 2009

Sydney Harbour Bridge NYE fireworksAll good traditions require someone to make an arbitrary decision to do something again. Last year I threw up a few New Year’s resolutions in the hours before NYE in the UK. Last night I was out on the shore of Sydney Harbour. I had the laptop – I thought about writing something – and then I thought – nah I can just lie here and look at the pretty lights. However I did want to follow up the successes and failures of last year’s resolutions and maybe make a few more for this year.

So last year’s resolutions were, roughly speaking, 1) to adopt the principles of the NIH Open Access mandate when choosing journals for publications, 2) to get more of the existing data within my group online and available, 3) to take the whole research group fully open notebook, 4) to mention Open Notebooks in every talk I gave, and 5) attempt to get explicit funding for developing open notebook approaches.

So successes – the research group at RAL is now (technically) working on an Open Notebook basis. This has taken a lot longer than we expected and the guys are still really getting a feel for what that means both in terms of how the record things and how they feel about it. I think it will improve over time and it just reinforces the message that none of this is easy.  I also made a point about talking about the Open Notebook approach is every talk I gave – mostly this was well received – often there was some scepticism but the message is getting out there.

However we didn’t do so well on picking journals – most of the papers I was on this year were driven by other people or were directed requests for special issues, or both. The papers that I had in mind I still haven’t got written, some drafts exist, but they’re definitely not finished. I also haven’t done any real work on getting older data online – it has been enough work just trying to manage the stuff we already have.

Funding is a mixed bag – the network proposal that was in last New Year’s was rejected. A few proposals have gone in – more haven’t gone in but exist in draft form – and a group of us went close to getting a tender to do some research into the uptake of Web 2. tools in science (more on that later but Gavin Baker has written about it and our tender document itself is available). The success of the year was the funding that Jean-Claude Bradley obtained from Submeta (as well as support from Aldrich Chemicals and Nature Publishing Group) to support the Open Notebook Science Challenge. I can’t take any credit for this but I think it is a good sign that we may have more luck this coming year.

So for this year – there are some follow ons – and some new ones:

  1. I will re-write the network application (and will be asking for help) and re-submit it to a UK funder
  2. I will clean up the “Personal View of Open Science” series of blog posts and see if I can get it published as a perspectives article in a high ranking journal
  3. I will get some of those damn papers finished – and decide which ones are never going to be written and give up on them. Papers I have full control over will go by first preference to Gold OA journals.
  4. I will pull together the pieces needed to take action on the ideas that came out of the Southampton Open Science workshop, specifically the idea of a letter signed by a wide range of scientists and interested people to a high ranking journal stating the importance of working towards published papers being fully supported by data and methodological detail that is fully available
  5. I will focus on doing less things and doing them better – or at least making sure the resources are available to do more of the things I take on…

I think five is enough things to be going on with. Hope you all have a happy new year, whenever it may start, and that it takes you further in the direction you want to go (whether you know what that is now or not) than you thought was possible.

p.s. I noticed in the comments to last year’s post a comment from one Shirley Wu suggesting the idea of running a session at the 2009 Pacific Symposium on Biocomputing – a proposal that resulted in the session we are holding in a few days (again more later on – we hope – streaming video, micro blogging etc). Just thinking about how much has changed in the way such an idea would be raised and explored in the last twelve months is food for thought.

The people you meet on the train…

Yesterday on the train I had a most remarkable experience of synchronicity. I had been at the RIN workshop on the costs of scholarly publishing (more on that later) in London and was heading of to Oxford for a group dinner. On the train I was looking for a seat with a desk and took one up opposite a guy with a slightly battered looking mac laptop. As I pulled out my new Macbook (13” 2.4 GHz, 4 Gb memory since you ask) he leaned across to have a good look, as you do, and we struck up a conversation. He asked what I did and I talked a little about being a scientist and my role at work. He was a consultant who worked on systems integration.
At some stage he made a throwaway comment about the fact that he had been going back to learn or re-learn some fairly advanced statistics and that he had had a lot of trouble getting access to some academic papers, certainly he didn’t want to pay for them, but had managed to find free versions of what he wanted online. I managed to keep my mouth somewhat shut at this point, except to say I had been at a workshop looking at these issues. However it gets better, much better. He was looking into quantitative risk issues and this lead into a discussion about the problems of how science and particularly medicine reporting in the media doesn’t provide links back to the original research (which is generally not accessible anyway) and that, what is worse, the original data is usually not available (and this was all unprompted by me, honestly!). To paraphrase his comment “the trouble with science is that I can’t get at the numbers behind the headlines; what is the sample size, how was the trial run…” Well at this point, all thought of getting any work done went out the window and we had a great discussion about data availability, the challenges of recording it in the right form (his systems integration work includes efforts to deal with mining of large, badly organised data sets), drifted into identity management and trust networks and was a great deal of fun.
What do I take from this? That there is a a demand for this kind of information and data from an educated and knowledgable public. One of the questions he asked was whether as a scientist I ever see much in the way of demand from the public. My response was that, aside from pushing the taxpayer access to taxpayer funded research myself, I hadn’t seen much evidence of real demand. His argument was that there is a huge nascent demand there from people who haven’t thought about their need to get into the detail of news stories that effect them. People want the detail, they just have no idea of how to go about getting it. Spread the idea that access to that detail is a right and we will see the demand for access to the outputs of research grow rapidly. The idea that “no-one out there is interested or competent to understand the details” is simply not true. The more respect we have for the people who fund our research the better frankly.

It’s a little embarrassing…

…but being straightforward is always the best approach. Since we published our paper in PLoS ONE a few months back I haven’t been as happy as I was about the activity of our Sortase. What this means is that we are now using a higher concentration of the enzyme to do our ligation reactions. They seem to be working well and with high yields, but we need to put in more enzyme. If you don’t understand that don’t worry – just imagine you posted a carefully thought out recipe and then discovered you couldn’t get that same taste again unless you added ten times as much saffron.

None of this prevents the method being useful and doesn’t change the fundamental point of our paper, but if people are following our methods, particularly if they only go to the paper and don’t get in contact, they may run into trouble. Traditionally this would be a problem, and would probably lead to our results being regarded as unreliable. However in our case we can do a simple fix. Because the paper is in PLoS ONE which has some good commenting features, I can add a note to the paper itself, right where we give the concentration of enzyme (scroll down to note 3 in results) that we used. I can also add a note to direct people to where we have put more of our methdology online, at OpenWetWare. As we get more of this work into our online lab notebooks we will also be able to point directly back to example experiments to show how the reaction rate varies, and hopefully in the longer term sort it out. All easily done on the web, but impossible on paper, and in an awful lot (but not all!) of the other journals around.

Or we could just let people find out for themselves…

Note to the PLoS team: Even better would be if I could have a link that went to a page where the comment was displayed in the context of the paper (i.e. what you get when you click on the marker when reading the paper )  :-)

What Russel Brand and Jonathan Ross can teach us about the value of community norms

For anyone in the UK who lives under a stone, or those people elsewhere in the world who don’t follow British news, this week there has been at least some news beyond the ongoing economic crisis and a U.S. election. Two media ‘personalities’ have been excoriated for leaving what can only be described as crass and offensive messages on an elderly actor’s answer phone, while on air. What made the affair worse was that the radio programme was in fact recorded and someone, somewhere, made a decision to broadcast it in full. Even worse was the fact that the broadcaster was that bastion of British values, the BBC.

If you want to get more of the details of what exactly happened then do a search on their names, but what I wanted to focus on here was some of the public and institutional reactions and their relation to the presumed need within the science community for ‘rules’, ‘licences’, and ‘copyright’ over works and data. Consistently we try to explain why this is not a good approach and developing strong community norms is better [1, 2]. I think this affair gives an example of why.

Much of the media and public outcry has been of the type ‘there must be some law, or if not some BBC rule that must have been broken, bang them up!’ There is a sense that there can only be recourse if someone has broken a rule. This is quite similar to the sense amongst many researchers, that they will only be able to ‘protect’ the results they make public by making them available under an explicit licence. That the only way they can have any recourse against someone ‘misusing’ ‘their’ results is if they are able to show that they have broken the terms of a licence.
The problem with this, as we know, is two-fold. First if someone does break the terms of the licence then frankly your chance of actually doing anything about it is pretty minimal. Secondly, and more importantly from the perspective of those of us interested in re-use and re-purposing, we know that pretty much any licensing system will create incompatibilities that prevent combining datasets, or using them in new ways, even when that wasn’t the intention of the original licensor.

There is an interesting parallel here with the Brand/Ross affair. It is entirely possible that no laws, or even BBC rules, have been broken. Does this mean they get off scott free? No, Brand has resigned and Ross has been suspended with his popular Friday night TV show apparently not to be recorded this week. The most interesting thing about the whole affair is that the central failure at the BBC was an editorial one. Some, so far unnamed, senior editor signed off and allowed the programme to be broadcast. What should have happened was that this editor should have blocked the programme or removed the offending passages. Not because a rule was broken but because it was not appropriate for the BBC’s editorial standards. Because it violated the community norms of what is acceptable for the BBC to broadcast. Whether or not they broke any rules what was done was crass and offensive. Whether or not someone is technically in violation of a data re-use license, failing to provide adequate attribution to the generators of that dataset is equally crass and unacceptable behaviour.

What the BBC discovered was that when it doesn’t live up to the standards that the wider community expects of it, that it receives withering censure. Indeed much of the most serious criticism came from some of its own programmes. It was the voice of the wider community (as mediated through the mass media admittedly) which has lead to the resignation and suspension. If it were just a question of ‘rules’ it is entirely possible that nothing could have been done. And if rules were put in place that would have prevented it then the unintended consequence would almost certainly have been to block programmes that had valid dramatic or narrative reasons for carrying such a passage. Again, community censure was much more powerful than any tribunal arbitrating some set of rules.

Yes this is nuanced, yes it is difficult to get right, and yes there is the potential for mob rule. That is why there is a team of senior professional editors at the BBC charged with policing and protecting the ‘community norms’ of what is acceptable for the BBC brand. That is why the damage done to the BBC’s brand will be severe. Standards, where it is explicit that the spirit is applied rather than the letter, where there are grey areas, can be much more effective than legalistic rules. When someone or some group clearly steps outside of the bounds then widespread censure is appropriate. It is then for individuals and organisations to decide how to apply that censure. And in turn to expect to be held to the same standards.

The cheats will always break the rules. If you use legalistic rules, then you invite legalistic approaches to getting around them. Those that try to apply the rules properly will then be hamstrung in their attempts to do anything useful while staying within the letter of the law. Community norms and standards of behaviour, appropriate citation, respect for people’s work and views, can be much more effective.

  1. Wilbanks, John. The Control Fallacy: Why OA Out-Innovates the Alternative. Available from Nature Precedings <http://hdl.handle.net/10101/npre.2008.1808.1> (2008)
  2. Wilbanks, John. Chemspider: Good intentions and the fog of licensing. http://network.nature.com/people/wilbanks/blog/2008/05/10/chemspider-good-intentions-and-the-fog-of-licensing (2008)

A personal view of Open Science – Part IV – Policies and standards

This is the fourth and final part of the serialisation of a draft paper on Open Science. The other parts are here – Part IPart IIPart III

A question that needs to be asked when contemplating any major change in practice is the balance and timing of ‘bottom up’ versus ‘top-down’ approaches for achieving that change. Scientists are notoriously un-responsive to decrees and policy initiatives but as has been discussed they are also inherently conservative and generally resistant to change led from within the community as well. For those advocating the widespread, and ideally rapid, adoption of more open practice in science it will be important to strike the right balance between calling for mandates and conditions for funding or journal submission and of simply adopting these practices in their own work. While the motivation behind the adoption of data sharing policies by funders such as the UK research councils is to be applauded it is possible for such intiatives to be counterproductive if the policies are not supported by infrastructure development, appropriate funding, and appropriate enforcement. Equally, standards and policy statements can send a powerful message on the aspirations of funders to make the research they fund more widely available and, for the most part, when funders speak, scientists listen.

One Approach for Mainstream Adoption – The fully supported paper

There are two broad approaches to standards that are currently being discussed. The first of these is aimed at mainstream acceptance and uptake and can be described as ‘The fully supported paper’. This is a concept that is simple on the surface but very complex to implement in practice. In essence it is the idea that the claims made in a peer reviewed paper in the conventional literature should be fully supported by a publically accessible record of all the background data, methodology, and data analysis procedures that contribute to those claims. On one level this is only a slightly increased in requirements from the Brussels Declaration made by the Internaional Association of Scientific, Technical, and Medical Publishers in 2007 which states;

Raw research data should be made freely available to all researchers. Publishers encourage the public posting of the raw data outputs of research. Sets or sub-sets of data that are submitted with a paper to a journal should wherever possible be made freely accessible to other scholars

http://www.stm-assoc.org/brussels-declaration/

The degree to which this declaration is supported by publishers and the level to which different journals require their authors to adhere to it is a matter for debate but the principle of availability of background data has been accepted by a broad range of publishers. It is therefore reasonable to consider the possibility of making the public posting of data as a requirement for submission. At a simple level this is already possible. For specific types of data repositories already exist and in many cases most journals require submission of these data types to recognised respositories. More generally it is possible to host data sets in some institutional repositories and with the expected announcement of a large scale data hosting service from Google the argument that this is not practicable is becoming unsustainable. While such datasets may have limited discoverability and limited metadata, they will at least be discoverable from the papers that reference them. It is reasonable to expect sufficent context to be provided in the published paper to make the data useable.

However the data itself, except in specific cases, is not enough to be useful to other researchers. The detail of how that data was collected and how it was processed are critical for making a proper analysis of whether the claims made in a paper to be properly judged. Once again we come to the problem of recording the process of research and then presenting that in a form which is both detailed enough to be widely useful but not so dense as to be impenetrable. The technical challenges of delivering a fully supported paper are substantial. However it is difficult to argue that this shouldn’t be available. If claims made in the scientific literature cannot be fully verified can they be regarded as scientific? Once again – while the target is challenging – it is simply a proposal to do good science, properly communicated.

Aspirational Standards – celebrating best practice in open science

While the fully supported paper would be a massive social and technical step forward it in many ways is no more open than the current system. It does not deal with the problem of unpublished or unsuccessful studies that may never find a home in a traditional peer reviewed paper. As discussed above the ‘fully supported paper’ is not really ‘open science’; it is just good science. What then are the requirements, or standards for ‘open science’. Does there need to be a certificate or a set of requirements that need to be met before a project, individual, or institution can claim they are doing Open Science. Or is Open Science simply too generic and prone to misinterpretation?

I would argue that while ‘Open Science’ is a very generic term it has real value as a rallying point or banner. It is a term which generates significant positive reaction amongst the general public, the mainstream media, and large sections of the research community. Its very vagueness also allows some flexibility making it possible to welcome contributions from publishers, scientists, and funders which while not 100% open are nonetheless positive and helpful. Within this broad umbrella it is then possible to look at defining or recomending practices and standards and giving these specific labels for identification.

The main work in the area of defining relevant practices and standards has been carried out by Science Commons and the Open Knowledge Foundation. Science Commons have published four ‘Principles for Open Science‘ which focus on the availability and accessiblity of published literature, research tools, and data, and the development of cyberinfrastructure to make this possible. These four principles currently do no explicitly include the availability of process, which has been covered in detail above, but provide a clear set of criteria which could form the basis of standards. Broadly speaking research projects, individuals, or institutions that deliver on these principles could be said to be doing Open Science. The Open Knowledge Definiton, developed by the Open Knowledge Foundation, is another useful touchstone here. Another possible defining criterion for Open Science is that all the relevant material is made available under licenses that adhere to the definition.

The devil, naturally, lies in the details. Are embargoes on data and methodology appropriate, and if so, in what fields and how should they be constructed? For data that cannot be released should specific exceptions be made, or special arrangments made to hold data in secure repositories? Where the same group is doing open and commercial research how should the divisions between these projects be defined and declared? These details are important, and will take time to work out. In the short term it is therefore probably more effective to identify and celebrate examples of open science, define best practice and observe how it works (and does not work) in the real world. This will raise the profile of Open Science without making it immediately an exclusive preserve of those with the luxury of radically changing practice. It enables examples of best practice to be held up as aspirational standards, providing the goals for others to work towards, and the impetus for the tool and infrastructure development that will support them. Many government funders are starting to introduce data sharing mandates, generally with very weak wording, but in most cases these refer to the expectation that funded research will adhere to the standard of ‘best practice’ in the relevant field. At this stage of development it may be more productive to drive adoption throgh the strategic support of improving best practice in a wide range of fields than to attempt to define strict standards.

Summary

The community advocating more open practice in scientific research is growing in size and influence. The major progress made in the past 12-18 months by the Open Access movement and the development of deposition and data sharing mandates by a range of research funders show that real progress is being made in increasing access to both the finished products of research and the materials that support them. While there have been significant successes this remains a delicate moment. There is a risk of over enthusiasm driving expectations which cannot be delivered and of alienating the mainstream community that we wish to draw in. The fears and concerns of researchers in widening access to their work need to be addressed sensitively and seriously, pointing out the benefits but also acknowledging the risks involved in adopting these practices.

It will not be enough to develop tools and infrastructure that, if adopted, would revolutionize science communication. Those tools must be built with an understanding of how scientists work today, and with the explicit aim of embedding these tools in existing workflows. The need for, and the benefits of, adopting controlled vocabularies needs to be sold much more effectively to the mainstream scientific community. The ontologies community also needs to recognise that there are cases and areas where the use of strict controlled vocabularies is not appropriate. Web 2.0 and Semantic web technologies are not competitors but are complementary approaches that are appropriate in different contexts. Again, the right question to ask is ‘what do scientists do? And what can we do to make that work better?’; not how can we make scientists see they need to do things the ‘right’ way.

Finally, it is my belief that now is not the time to set out specific and strict standards of what qualifies as Open Science. It is the right time to discuss the details of what these standards might look like. It is the right time to look at examples of best practice; to celebrate these and to see what can be learnt from them, but with our current lack of experience, and lack of knowledge of what the unintended consequences of specific standards might be, it is too early to pin down the details of those standards. It is a good time to be clearly articulating the specific aspirations of the movement, and to provide goals that communities can aggregate around; the fully supported paper, the Science Commons principles, and the Open Knowledge Definition are all useful starting points. Open Science is gathering momentum, and that is a good thing. But equally it is a good time to take stock, identify the best course forward, and make sure that we ar carrying as many people forward with use as we can.

A personal view of open science – Part III – Social issues

The third installment of the paper (first part, second part) where I discuss social issues around practicing more Open Science.

Scientists are inherently rather conservative in their adoption of new approaches and tools. A conservative approach has served the community well in the process of sifting ideas and claims; this approach is well summarised by the aphorism ‘extraordinary claims require extraordinary evidence’. New methodologies and tools often struggle to be accepted until the evidence of their superiority is overwhelming. It is therefore unreasonable to expect the rapid adoption of new web based tools and even more unreasonable to expect scientsits to change their overall approach to their research en masse. The experience of adoption of new Open Access journals is a good example of this.

Recent studies have shown that scientists are, in principle, in favour of publishing in Open Access journals yet show marked reluctance to publish in such journals in practice [1]. The most obvious reason for this is the perceived cost. Because most operating Open Access publishers charge a publication fee, and until recently such charges were not allowable costs for many research funders, it can be challenging for researchers to obtain the necessary funds. Although most OA publishers will waive these charges there is anecdotally a marked reluctance to ask for such a waiver. Other reasons for not submitting papers to OA journals include the perception that most OA journals are low impact and a lack of OA journals in specific fields. Finally, simple inertia can be a factor where the traditional publication outlets for a specific field are well defined and publishing outside the set of ‘standard’ journals runs the risk of the work simply not being seen by peers. As there is no perception of a reward for publishing in open access journals, and a perception of significant risk, uptake remains relatively small.

Making data available faces similar challenges but here they are more profound. At least when publishing in an open access journal it can be counted as a paper. Because there is no culture of citing primary data, but rather of citing the papers they are reported in, there is no reward for making data available. If careers are measured in papers published then making data available does not contribute to career development. Data availability to date has generally been driven by strong community norms, usually backed up by journal submission requirements. Again this links data publication to paper publication without necessarily encouraging the release of data that is not explicitly linked to a peer reviewed paper. The large scale DNA sequencing and astronomy facilities stand out as cases where data is automatically made available as it is taken. In both cases this policy is driven largely by the funders, or facility providers, who are in position to make release a condition of funding the data collection. This is not, however a policy that has been adopted by other facilities such as synchrotrons, neutron sources, or high power photon sources.

In other fields where data is more heterogeneous and particular where competition to publish is fierce, the idea of data availability raises many fears. The primary one is of being ‘scooped’ or data theft where others publish a paper before the data collector has had the ability to fully analyse the data. This again is partly answered by robust data citation standards but this does not prevent another group publishing an analysis quicker, potentially damaging the career or graduation prospects of the data collector. A principle of ‘first right to publish’ is often suggested. Other approaches include timed embargoes for re-use or release. All of these have advantages and disadvantages which depend to a large extent on how well behaved members of a specific field are. Another significant concern is that the release of substandard, non peer-reviewed, or simply innaccurate data into the public domain will lead to further problems of media hype and public misunderstanding. This must be balanced against the potential public good of having relevant research data available.

The community, or more accurately communities, in general, are waiting for evidence of benefits before adopting either open access publication or open data policies. This actually provides the opportunity for individuals and groups to take first mover advantages. While remaining controversial [3, 4] there is some evidence that publication in open access journals leads to higher citation counts for papers [5, 6] and that papers for which the supporting data is available receive more citations [7]. This advantage is likely to be at its greatest early in the adoption curve and will clearly disappear if these approaches become widespread. There are therefore clear advantages to be had in rapidly adopting more open approaches to research which can be balanced against the risks described above.

Measuring success in the application of open approaches and particularly quantifying success relative to traditional approaches is a challenge, as is demonstrated by the continuing controversy over the citation advantage of open access articles. However pointing to examples of success is relatively straightforward. In fact Open Science has a clear public relations advantage as the examples are out in the open for anyone to see. This exposure can be both good and bad but it makes publicising best practice easy. In many ways the biggest successes of open practice are the ones that we miss because they are right in front of us, the global databases of freely accessible data in biological databases such as the Protein Data Bank, NCBI, and many others that have driven the massive advances in biological sciences over the past 20 years. The ability to analyse and consider the implications of genome scale DNA sequence data, as it is being generated, is now

In the physical sciences, the arXiv has long stood as an example to other disciplines of how the research literature can be made available in an effective and rapid manner, and the availability of astronomical data from efforts such as the Sloan Digital Sky Survey make efforts combining public outreach and the crowdsourcing of data analysis such as Galaxy Zoo possible. There is likely to be a massive expansion in the availability of environmental and ecological data globally as the potential to combine millions of data gatherers holding mobile phones, and sophisticated data aggregation and manipulation tools is realised.

Closer to the bleeding edge of radical sharing there have been less high profile successes, a reflection both of the limited amount of time these approaches have been pursued and the limited financial and personnel resources that have been available. Nonetheless there are examples. Garret Lisi’s high profile preprint on the ArXiv, An exceptionally simple theory of everything, [8] is supported by a comprehensive online notebook at http://deferentialgeometry.org that contains all the arguments as well as the background detail and definitions that support the paper. The announcement by Jean-Claude Bradley of the successful identification of several compounds with activity against malaria [9] is an example where the whole research process was carried out in the open, from the decision on what the research target should be, through the design and in silico testing of a library of chemicals, to the synthesis and testing of those compounds. For every step of this process the data is available online and several of the collaborators that made the study possible made contact due to finding that material online. The potential for a coordinated global synthesis and screening effort is currently being investigated.

There are both benefits and risks associated with open practice in research. Often the discussion with researchers is focussed on the disadvantages and risks. In an inherently conservative pursuit it is perfectly valid to ask whether changes of the type and magnitude offer any benefits given the potential risks they pose. These are not concerns that should be dismissed or ridiculed, but ones that should be taken seriously, and considered. Radical change never comes without casualties, and while some concerns may be misplaced, or overblowm, there are many that have real potential consequences. In a competitive field people will necessarily make diverse decisions on the best way forward for them. What is important is providing as good information to them as is possible to help them balance the risks and benefits of any approach they choose to take.

The fourth and final part of this paper can be found here.

  1. Warlick S E, Vaughan K T. Factors influencing publication choice: why faculty choose open access. Biomedical Digital Libraries. 2007;4:1-12.
  2. Bentley D R. Genomic Sequence Information Should Be Released Immediately and Freely. Science. 1996;274(October):533-534.
  3. Piwowar H A, Day R S, Fridsma D B. Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE. 2007;1(3):e308.
  4. Davis P M, Lewenstein B V, Simon D H, Booth J G, Connolly M J. Open access publishing, article downloads, and citations: randomised controlled trial. BMJ. 2008;337(October):a568.
  5. Rapid responses to David et al., http://www.bmj.com/cgi/eletters/337/jul31_1/a568
  6. Eysenbach G. Citation Advantage of Open Access Articles. PLoS Biology. 2006;4(5):e157.
  7. Hajjem, C., Harnad, S. and Gingras, Y. (2005) Ten-Year Cross-Disciplinary Comparison of the Growth of Open Access and How it Increases Research Citation Impact. IEEE Data Engineering Bulletin, 28 (4). pp. 39-47. http://eprints.ecs.soton.ac.uk/12906/
  8. Lisi G, An exceptionally simple theory of everything, arXiv:0711.0770v1 [hep-th], November 2007.
  9. Bradley J C, We have antimalarial activity!, UsefulChem Blog, http://usefulchem.blogspot.com/2008/01/we-have-anti-malarial-activity.html, January 25 2008.