My Bad…or how far should the open mindset go?

So while on the train yesterday in somewhat pre-caffeinated state I stuck my foot in it somewhat. Several others have written (Nils Reinton, Bill Hooker, Jon Eisen, Hsien-Hsien Lei, Shirley Wu) on the unattributed use of an image that was put together by Ricardo Vidal for the DNA Network of blogs. The company that did this are selling hokum. No question of that. Now the logo is in fact clearly marked as copyright on Flickr but even if it were marked as CC-BY then the company would be in violation of the license for not attributing. But, despite the fact that it is clearly technically wrong, I felt that the outrage being expressed was inconsistent with the general attitude that materials should be shared, re-useable, and available for re-purposing.

So in the related Friendfeed thread I romped in, offended several people (particularly by using the word hypocritical which I should not have done, like I said, pre-caffeine) and had to back up and re-think what it was I was trying to say. Actually this is a good thing about Friendfeed, the rapid fire discussion can encourage semi-baked comments and ideas which are then leapt on and need to be more carefully thought through and refined. In science criticism is always valuable, agreement is often a waste of time.

So at core my concern is largely about the apparent message that can be sent by a group of “open” activists objecting about the violation of the copyright of a member of their community. As I wrote further down in the comments;

“…There is a danger that this kind of thing comes across as ‘everything should be pd [pubic domain] but when my mate copyrights something and you violate it I will jump down your throat’. The subtext being it is ok to violate copyright for ‘good’ reasons but not for ‘bad’ reasons… “

It is crucially important to me that when you argue that an area of law is poorly constructed, ineffective or having unexpected consequences, that you scrupulously operate within that law, while not criticising those who cut corners. At the same time if I argue that the risks of having people ‘steal’ my work are outweighed by the benefits of sharing then I should roll with the punches when bad stuff does happen.There is the specific issue that what was done is a breach of copyright as well and then the general issue that if people were more able to do this kind of thing that it would be good. The fact that it was used for a nasty service preying on people’s fears is at one level neither here nor there (or rather the moral rights issue is I think a separate, and rather complicated one that will not fit in this particular margin, does the use of the logo misrepresent Ricardo? Does it misrepresent the DNA network – who remember don’t own it?).

More broadly I think there is a mindset that goes with the way the web works and the way that sharing works that means we need to get away from the idea of the object or the work as property.The value of objects lies only in their scarcity, or their lack of presence. With the advent of the world’s greatest copying machine, no digital object need be scarce. It is not the object that has value, because it can be infinitely copied for near zero cost, it is the skill and expertise in putting the object together that has value. The argument of the “commonists” is that you will spend more on using licences and secrecy to protect objects than you could be making by finding the people who need your skills to make just the thing that they need, right now. If this is true it presumably holds for data, for scientific papers, for photos, for video, for software, for books, and for logos.

The argument that I try to promote (and many others do much better) is that we need to get away from the concepts and language of ownership of these digital objects. That even thinking in terms of it being “mine” is counterproductive and actually reduces value. It may be the case that there are limits to where these arguments hold, and if there is it probably has something to do with the intrinsic timeframe of the production cycle for a class of objects, but that is a thought for another time. What worried me was that people seemed to be using language that is driven by thinking about propery and scarcity; “theft”, “stealing”. In my view we should be talking about “service quality”, “delivery time”, and “availability”. This is where value lies on the net, not in control, and not in ownership of objects.

None of which is to say that people should not be completely free to license work which they produce in any way that they choose, and I will defend their right to do this. But at the same time I will work to persuade these same people that some types of license are counterproductive, particularly those that attempt to control content. If you beleive that science is better for the things that make it up being shared and re-used, that the value of a person’s work is increased by others re-using this why shouldn’t that apply to other types of work? The key thing is a consistent and clear message.

I try to be consistent, and I am by no means always successful, but its a work in progress.  Anyone is free to re-use and re-purpose anything I generate in whatever way they choose. If I disagree with the use I will say so. If it is unattributed I might comment, and I might name names, but I won’t call in the lawyers. If I am inconsistent I invite, and indeed expect, people to say so. I would hope that criticism would come from the friendly faces before it comes from people with another agenda. That, at the end of the day, is the main benefit of being open. It’s all just error checking in the end.

It’s a little embarrassing…

…but being straightforward is always the best approach. Since we published our paper in PLoS ONE a few months back I haven’t been as happy as I was about the activity of our Sortase. What this means is that we are now using a higher concentration of the enzyme to do our ligation reactions. They seem to be working well and with high yields, but we need to put in more enzyme. If you don’t understand that don’t worry – just imagine you posted a carefully thought out recipe and then discovered you couldn’t get that same taste again unless you added ten times as much saffron.

None of this prevents the method being useful and doesn’t change the fundamental point of our paper, but if people are following our methods, particularly if they only go to the paper and don’t get in contact, they may run into trouble. Traditionally this would be a problem, and would probably lead to our results being regarded as unreliable. However in our case we can do a simple fix. Because the paper is in PLoS ONE which has some good commenting features, I can add a note to the paper itself, right where we give the concentration of enzyme (scroll down to note 3 in results) that we used. I can also add a note to direct people to where we have put more of our methdology online, at OpenWetWare. As we get more of this work into our online lab notebooks we will also be able to point directly back to example experiments to show how the reaction rate varies, and hopefully in the longer term sort it out. All easily done on the web, but impossible on paper, and in an awful lot (but not all!) of the other journals around.

Or we could just let people find out for themselves…

Note to the PLoS team: Even better would be if I could have a link that went to a page where the comment was displayed in the context of the paper (i.e. what you get when you click on the marker when reading the paper )  :-)

Connecting the dots – the well posed question and code as a liability

Just a brief thought prompted by two, partly related, things streaming past my nose. Firstly Michael Nielsen discussed the views of Aristotle and Sunstein on collective intelligence. The thing that caught my attention was the idea that deliberation can make can make group functioning worse, leading to a collective decision that is muddled rather than actually identifying the best answer presented by members of the community. The exception to this is well posed questions, where deliberation can help. In science we are familiar with the idea that getting the question right (correct design of experiment, well organised theory) can be more important than the answer.

The second item was a blog post entitled “Data is good, code is a liability” from Greg Linden that was shared by Deepak Singh. Greg discussed a talk given by Peter Norvig which focusses on the idea that it is better to get a good sized dataset and use very sparing code to get at an answer rather than attempt to get at the answer de novo via complex code. Quoting from the post:

In one of several examples, Peter put up a slide showing an excerpt for a rule-based spelling corrector. The snippet of code, that was just part of a much larger program, contained a nearly impossible to understand let alone verify set of case and if statements that represented rules for spelling correction in English. He then put up a slide containing a few line Python program for a statistical spelling correction program that, given a large data file of documents, learns the likelihood of seeing words and corrects misspellings to their most likely alternative. This version, he said, not only has the benefit of being simple, but also easily can be used in different languages.

What struck me was the connection between being able to write a short, readable snippet of code, and the “well posed question”. The dataset provides the collective intelligence. So is it possible to propose the following?

“A well posed question is one which, given an appropriate dataset, can be answered by easily prepared and comprehensible code”

This could also possibly be turned on its head as “a good programming environment is one in which well posed questions can be readily converted to programs”. But it also raises an important point about how the structure of datasets relates to the questions you want to ask. The challenge in recording data is to structure it in such a way that the widest possible set of questions can be asked of that data. Data models all pre-suppose the kind of questions that will be asked. And any sufficiently general data model will be inefficient for most specific types of query.

Rajarshi Guha and Pierre Lindenbaum have been busy preparing different datastores for the solubility data being generated as part of the Open Notebook Science Challenge announced by Jean-Claude Bradley (more on this later). Rajarshi’s form based input has an SQL backend while Pierre has been working to extract the information as RDF. The point is not that one approach is better than the other, but that we need both, and possibly many more formats – and ideally we need to interconvert between them on the fly. A well posed question can easily founder on an inappropriately structured dataset (this is actually just a rephrasing of the Saunders Principle). It will be by enabling easy conversion between different formats that we might approach a situation where the aphorism I have suggested could become true.

What Russel Brand and Jonathan Ross can teach us about the value of community norms

For anyone in the UK who lives under a stone, or those people elsewhere in the world who don’t follow British news, this week there has been at least some news beyond the ongoing economic crisis and a U.S. election. Two media ‘personalities’ have been excoriated for leaving what can only be described as crass and offensive messages on an elderly actor’s answer phone, while on air. What made the affair worse was that the radio programme was in fact recorded and someone, somewhere, made a decision to broadcast it in full. Even worse was the fact that the broadcaster was that bastion of British values, the BBC.

If you want to get more of the details of what exactly happened then do a search on their names, but what I wanted to focus on here was some of the public and institutional reactions and their relation to the presumed need within the science community for ‘rules’, ‘licences’, and ‘copyright’ over works and data. Consistently we try to explain why this is not a good approach and developing strong community norms is better [1, 2]. I think this affair gives an example of why.

Much of the media and public outcry has been of the type ‘there must be some law, or if not some BBC rule that must have been broken, bang them up!’ There is a sense that there can only be recourse if someone has broken a rule. This is quite similar to the sense amongst many researchers, that they will only be able to ‘protect’ the results they make public by making them available under an explicit licence. That the only way they can have any recourse against someone ‘misusing’ ‘their’ results is if they are able to show that they have broken the terms of a licence.
The problem with this, as we know, is two-fold. First if someone does break the terms of the licence then frankly your chance of actually doing anything about it is pretty minimal. Secondly, and more importantly from the perspective of those of us interested in re-use and re-purposing, we know that pretty much any licensing system will create incompatibilities that prevent combining datasets, or using them in new ways, even when that wasn’t the intention of the original licensor.

There is an interesting parallel here with the Brand/Ross affair. It is entirely possible that no laws, or even BBC rules, have been broken. Does this mean they get off scott free? No, Brand has resigned and Ross has been suspended with his popular Friday night TV show apparently not to be recorded this week. The most interesting thing about the whole affair is that the central failure at the BBC was an editorial one. Some, so far unnamed, senior editor signed off and allowed the programme to be broadcast. What should have happened was that this editor should have blocked the programme or removed the offending passages. Not because a rule was broken but because it was not appropriate for the BBC’s editorial standards. Because it violated the community norms of what is acceptable for the BBC to broadcast. Whether or not they broke any rules what was done was crass and offensive. Whether or not someone is technically in violation of a data re-use license, failing to provide adequate attribution to the generators of that dataset is equally crass and unacceptable behaviour.

What the BBC discovered was that when it doesn’t live up to the standards that the wider community expects of it, that it receives withering censure. Indeed much of the most serious criticism came from some of its own programmes. It was the voice of the wider community (as mediated through the mass media admittedly) which has lead to the resignation and suspension. If it were just a question of ‘rules’ it is entirely possible that nothing could have been done. And if rules were put in place that would have prevented it then the unintended consequence would almost certainly have been to block programmes that had valid dramatic or narrative reasons for carrying such a passage. Again, community censure was much more powerful than any tribunal arbitrating some set of rules.

Yes this is nuanced, yes it is difficult to get right, and yes there is the potential for mob rule. That is why there is a team of senior professional editors at the BBC charged with policing and protecting the ‘community norms’ of what is acceptable for the BBC brand. That is why the damage done to the BBC’s brand will be severe. Standards, where it is explicit that the spirit is applied rather than the letter, where there are grey areas, can be much more effective than legalistic rules. When someone or some group clearly steps outside of the bounds then widespread censure is appropriate. It is then for individuals and organisations to decide how to apply that censure. And in turn to expect to be held to the same standards.

The cheats will always break the rules. If you use legalistic rules, then you invite legalistic approaches to getting around them. Those that try to apply the rules properly will then be hamstrung in their attempts to do anything useful while staying within the letter of the law. Community norms and standards of behaviour, appropriate citation, respect for people’s work and views, can be much more effective.

  1. Wilbanks, John. The Control Fallacy: Why OA Out-Innovates the Alternative. Available from Nature Precedings <http://hdl.handle.net/10101/npre.2008.1808.1> (2008)
  2. Wilbanks, John. Chemspider: Good intentions and the fog of licensing. http://network.nature.com/people/wilbanks/blog/2008/05/10/chemspider-good-intentions-and-the-fog-of-licensing (2008)

Call for submissions for a project on The Use and Relevance of Web 2.0 Tools for Researchers

The Research Information Network has put out a cal for expressions of interest in running a research project on how Web 2.0 tools are changing scientific practice. The project will be funded up to £90,000. Expressions of interest are due on Monday 3 November (yes next week) and the projects are due to start in January. You can see the call in full here but in outline RIN seeking evidence whether web 2.0 tools are:

• making data easier to share, verify and re-use, or otherwise

facilitating more open scientific practices;

• changing discovery techniques or enhancing the accessibility of

research information;

• changing researchers’ publication and dissemination behaviour,

(for example, due to the ease of publishing work-in-progress and

grey literature);

• changing practices around communicating research findings (for

example through opportunities for iterative processes of feedback,

pre-publishing, or post-publication peer review).

Now we as a community know that there are cases where all of these are occurring and have fairly extensively documented examples. The question is obviously one of the degree of penetration. Again we know this is small – I’m not exactly sure how you would quantify it.

My challenge to you is whether it would be possible to use the tools and community we already have in place to carry out the project? In the past we’ve talked a lot about aggregating project teams and distributed work but the problem has always been that people don’t have the time to spare. We would need to get some help from social scientists on process and design of the investigation but with £90,000 there is easily enough money to pay people properly for their time. Indeed I know there are some people out there freelancing already who are in many ways already working on these issues anyway. So my question is: Are people interested in pursuing this? And if so, what do you think your hourly rate is?

A personal view of Open Science – Part IV – Policies and standards

This is the fourth and final part of the serialisation of a draft paper on Open Science. The other parts are here – Part IPart IIPart III

A question that needs to be asked when contemplating any major change in practice is the balance and timing of ‘bottom up’ versus ‘top-down’ approaches for achieving that change. Scientists are notoriously un-responsive to decrees and policy initiatives but as has been discussed they are also inherently conservative and generally resistant to change led from within the community as well. For those advocating the widespread, and ideally rapid, adoption of more open practice in science it will be important to strike the right balance between calling for mandates and conditions for funding or journal submission and of simply adopting these practices in their own work. While the motivation behind the adoption of data sharing policies by funders such as the UK research councils is to be applauded it is possible for such intiatives to be counterproductive if the policies are not supported by infrastructure development, appropriate funding, and appropriate enforcement. Equally, standards and policy statements can send a powerful message on the aspirations of funders to make the research they fund more widely available and, for the most part, when funders speak, scientists listen.

One Approach for Mainstream Adoption – The fully supported paper

There are two broad approaches to standards that are currently being discussed. The first of these is aimed at mainstream acceptance and uptake and can be described as ‘The fully supported paper’. This is a concept that is simple on the surface but very complex to implement in practice. In essence it is the idea that the claims made in a peer reviewed paper in the conventional literature should be fully supported by a publically accessible record of all the background data, methodology, and data analysis procedures that contribute to those claims. On one level this is only a slightly increased in requirements from the Brussels Declaration made by the Internaional Association of Scientific, Technical, and Medical Publishers in 2007 which states;

Raw research data should be made freely available to all researchers. Publishers encourage the public posting of the raw data outputs of research. Sets or sub-sets of data that are submitted with a paper to a journal should wherever possible be made freely accessible to other scholars

http://www.stm-assoc.org/brussels-declaration/

The degree to which this declaration is supported by publishers and the level to which different journals require their authors to adhere to it is a matter for debate but the principle of availability of background data has been accepted by a broad range of publishers. It is therefore reasonable to consider the possibility of making the public posting of data as a requirement for submission. At a simple level this is already possible. For specific types of data repositories already exist and in many cases most journals require submission of these data types to recognised respositories. More generally it is possible to host data sets in some institutional repositories and with the expected announcement of a large scale data hosting service from Google the argument that this is not practicable is becoming unsustainable. While such datasets may have limited discoverability and limited metadata, they will at least be discoverable from the papers that reference them. It is reasonable to expect sufficent context to be provided in the published paper to make the data useable.

However the data itself, except in specific cases, is not enough to be useful to other researchers. The detail of how that data was collected and how it was processed are critical for making a proper analysis of whether the claims made in a paper to be properly judged. Once again we come to the problem of recording the process of research and then presenting that in a form which is both detailed enough to be widely useful but not so dense as to be impenetrable. The technical challenges of delivering a fully supported paper are substantial. However it is difficult to argue that this shouldn’t be available. If claims made in the scientific literature cannot be fully verified can they be regarded as scientific? Once again – while the target is challenging – it is simply a proposal to do good science, properly communicated.

Aspirational Standards – celebrating best practice in open science

While the fully supported paper would be a massive social and technical step forward it in many ways is no more open than the current system. It does not deal with the problem of unpublished or unsuccessful studies that may never find a home in a traditional peer reviewed paper. As discussed above the ‘fully supported paper’ is not really ‘open science’; it is just good science. What then are the requirements, or standards for ‘open science’. Does there need to be a certificate or a set of requirements that need to be met before a project, individual, or institution can claim they are doing Open Science. Or is Open Science simply too generic and prone to misinterpretation?

I would argue that while ‘Open Science’ is a very generic term it has real value as a rallying point or banner. It is a term which generates significant positive reaction amongst the general public, the mainstream media, and large sections of the research community. Its very vagueness also allows some flexibility making it possible to welcome contributions from publishers, scientists, and funders which while not 100% open are nonetheless positive and helpful. Within this broad umbrella it is then possible to look at defining or recomending practices and standards and giving these specific labels for identification.

The main work in the area of defining relevant practices and standards has been carried out by Science Commons and the Open Knowledge Foundation. Science Commons have published four ‘Principles for Open Science‘ which focus on the availability and accessiblity of published literature, research tools, and data, and the development of cyberinfrastructure to make this possible. These four principles currently do no explicitly include the availability of process, which has been covered in detail above, but provide a clear set of criteria which could form the basis of standards. Broadly speaking research projects, individuals, or institutions that deliver on these principles could be said to be doing Open Science. The Open Knowledge Definiton, developed by the Open Knowledge Foundation, is another useful touchstone here. Another possible defining criterion for Open Science is that all the relevant material is made available under licenses that adhere to the definition.

The devil, naturally, lies in the details. Are embargoes on data and methodology appropriate, and if so, in what fields and how should they be constructed? For data that cannot be released should specific exceptions be made, or special arrangments made to hold data in secure repositories? Where the same group is doing open and commercial research how should the divisions between these projects be defined and declared? These details are important, and will take time to work out. In the short term it is therefore probably more effective to identify and celebrate examples of open science, define best practice and observe how it works (and does not work) in the real world. This will raise the profile of Open Science without making it immediately an exclusive preserve of those with the luxury of radically changing practice. It enables examples of best practice to be held up as aspirational standards, providing the goals for others to work towards, and the impetus for the tool and infrastructure development that will support them. Many government funders are starting to introduce data sharing mandates, generally with very weak wording, but in most cases these refer to the expectation that funded research will adhere to the standard of ‘best practice’ in the relevant field. At this stage of development it may be more productive to drive adoption throgh the strategic support of improving best practice in a wide range of fields than to attempt to define strict standards.

Summary

The community advocating more open practice in scientific research is growing in size and influence. The major progress made in the past 12-18 months by the Open Access movement and the development of deposition and data sharing mandates by a range of research funders show that real progress is being made in increasing access to both the finished products of research and the materials that support them. While there have been significant successes this remains a delicate moment. There is a risk of over enthusiasm driving expectations which cannot be delivered and of alienating the mainstream community that we wish to draw in. The fears and concerns of researchers in widening access to their work need to be addressed sensitively and seriously, pointing out the benefits but also acknowledging the risks involved in adopting these practices.

It will not be enough to develop tools and infrastructure that, if adopted, would revolutionize science communication. Those tools must be built with an understanding of how scientists work today, and with the explicit aim of embedding these tools in existing workflows. The need for, and the benefits of, adopting controlled vocabularies needs to be sold much more effectively to the mainstream scientific community. The ontologies community also needs to recognise that there are cases and areas where the use of strict controlled vocabularies is not appropriate. Web 2.0 and Semantic web technologies are not competitors but are complementary approaches that are appropriate in different contexts. Again, the right question to ask is ‘what do scientists do? And what can we do to make that work better?’; not how can we make scientists see they need to do things the ‘right’ way.

Finally, it is my belief that now is not the time to set out specific and strict standards of what qualifies as Open Science. It is the right time to discuss the details of what these standards might look like. It is the right time to look at examples of best practice; to celebrate these and to see what can be learnt from them, but with our current lack of experience, and lack of knowledge of what the unintended consequences of specific standards might be, it is too early to pin down the details of those standards. It is a good time to be clearly articulating the specific aspirations of the movement, and to provide goals that communities can aggregate around; the fully supported paper, the Science Commons principles, and the Open Knowledge Definition are all useful starting points. Open Science is gathering momentum, and that is a good thing. But equally it is a good time to take stock, identify the best course forward, and make sure that we ar carrying as many people forward with use as we can.

A personal view of open science – Part III – Social issues

The third installment of the paper (first part, second part) where I discuss social issues around practicing more Open Science.

Scientists are inherently rather conservative in their adoption of new approaches and tools. A conservative approach has served the community well in the process of sifting ideas and claims; this approach is well summarised by the aphorism ‘extraordinary claims require extraordinary evidence’. New methodologies and tools often struggle to be accepted until the evidence of their superiority is overwhelming. It is therefore unreasonable to expect the rapid adoption of new web based tools and even more unreasonable to expect scientsits to change their overall approach to their research en masse. The experience of adoption of new Open Access journals is a good example of this.

Recent studies have shown that scientists are, in principle, in favour of publishing in Open Access journals yet show marked reluctance to publish in such journals in practice [1]. The most obvious reason for this is the perceived cost. Because most operating Open Access publishers charge a publication fee, and until recently such charges were not allowable costs for many research funders, it can be challenging for researchers to obtain the necessary funds. Although most OA publishers will waive these charges there is anecdotally a marked reluctance to ask for such a waiver. Other reasons for not submitting papers to OA journals include the perception that most OA journals are low impact and a lack of OA journals in specific fields. Finally, simple inertia can be a factor where the traditional publication outlets for a specific field are well defined and publishing outside the set of ‘standard’ journals runs the risk of the work simply not being seen by peers. As there is no perception of a reward for publishing in open access journals, and a perception of significant risk, uptake remains relatively small.

Making data available faces similar challenges but here they are more profound. At least when publishing in an open access journal it can be counted as a paper. Because there is no culture of citing primary data, but rather of citing the papers they are reported in, there is no reward for making data available. If careers are measured in papers published then making data available does not contribute to career development. Data availability to date has generally been driven by strong community norms, usually backed up by journal submission requirements. Again this links data publication to paper publication without necessarily encouraging the release of data that is not explicitly linked to a peer reviewed paper. The large scale DNA sequencing and astronomy facilities stand out as cases where data is automatically made available as it is taken. In both cases this policy is driven largely by the funders, or facility providers, who are in position to make release a condition of funding the data collection. This is not, however a policy that has been adopted by other facilities such as synchrotrons, neutron sources, or high power photon sources.

In other fields where data is more heterogeneous and particular where competition to publish is fierce, the idea of data availability raises many fears. The primary one is of being ‘scooped’ or data theft where others publish a paper before the data collector has had the ability to fully analyse the data. This again is partly answered by robust data citation standards but this does not prevent another group publishing an analysis quicker, potentially damaging the career or graduation prospects of the data collector. A principle of ‘first right to publish’ is often suggested. Other approaches include timed embargoes for re-use or release. All of these have advantages and disadvantages which depend to a large extent on how well behaved members of a specific field are. Another significant concern is that the release of substandard, non peer-reviewed, or simply innaccurate data into the public domain will lead to further problems of media hype and public misunderstanding. This must be balanced against the potential public good of having relevant research data available.

The community, or more accurately communities, in general, are waiting for evidence of benefits before adopting either open access publication or open data policies. This actually provides the opportunity for individuals and groups to take first mover advantages. While remaining controversial [3, 4] there is some evidence that publication in open access journals leads to higher citation counts for papers [5, 6] and that papers for which the supporting data is available receive more citations [7]. This advantage is likely to be at its greatest early in the adoption curve and will clearly disappear if these approaches become widespread. There are therefore clear advantages to be had in rapidly adopting more open approaches to research which can be balanced against the risks described above.

Measuring success in the application of open approaches and particularly quantifying success relative to traditional approaches is a challenge, as is demonstrated by the continuing controversy over the citation advantage of open access articles. However pointing to examples of success is relatively straightforward. In fact Open Science has a clear public relations advantage as the examples are out in the open for anyone to see. This exposure can be both good and bad but it makes publicising best practice easy. In many ways the biggest successes of open practice are the ones that we miss because they are right in front of us, the global databases of freely accessible data in biological databases such as the Protein Data Bank, NCBI, and many others that have driven the massive advances in biological sciences over the past 20 years. The ability to analyse and consider the implications of genome scale DNA sequence data, as it is being generated, is now

In the physical sciences, the arXiv has long stood as an example to other disciplines of how the research literature can be made available in an effective and rapid manner, and the availability of astronomical data from efforts such as the Sloan Digital Sky Survey make efforts combining public outreach and the crowdsourcing of data analysis such as Galaxy Zoo possible. There is likely to be a massive expansion in the availability of environmental and ecological data globally as the potential to combine millions of data gatherers holding mobile phones, and sophisticated data aggregation and manipulation tools is realised.

Closer to the bleeding edge of radical sharing there have been less high profile successes, a reflection both of the limited amount of time these approaches have been pursued and the limited financial and personnel resources that have been available. Nonetheless there are examples. Garret Lisi’s high profile preprint on the ArXiv, An exceptionally simple theory of everything, [8] is supported by a comprehensive online notebook at http://deferentialgeometry.org that contains all the arguments as well as the background detail and definitions that support the paper. The announcement by Jean-Claude Bradley of the successful identification of several compounds with activity against malaria [9] is an example where the whole research process was carried out in the open, from the decision on what the research target should be, through the design and in silico testing of a library of chemicals, to the synthesis and testing of those compounds. For every step of this process the data is available online and several of the collaborators that made the study possible made contact due to finding that material online. The potential for a coordinated global synthesis and screening effort is currently being investigated.

There are both benefits and risks associated with open practice in research. Often the discussion with researchers is focussed on the disadvantages and risks. In an inherently conservative pursuit it is perfectly valid to ask whether changes of the type and magnitude offer any benefits given the potential risks they pose. These are not concerns that should be dismissed or ridiculed, but ones that should be taken seriously, and considered. Radical change never comes without casualties, and while some concerns may be misplaced, or overblowm, there are many that have real potential consequences. In a competitive field people will necessarily make diverse decisions on the best way forward for them. What is important is providing as good information to them as is possible to help them balance the risks and benefits of any approach they choose to take.

The fourth and final part of this paper can be found here.

  1. Warlick S E, Vaughan K T. Factors influencing publication choice: why faculty choose open access. Biomedical Digital Libraries. 2007;4:1-12.
  2. Bentley D R. Genomic Sequence Information Should Be Released Immediately and Freely. Science. 1996;274(October):533-534.
  3. Piwowar H A, Day R S, Fridsma D B. Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE. 2007;1(3):e308.
  4. Davis P M, Lewenstein B V, Simon D H, Booth J G, Connolly M J. Open access publishing, article downloads, and citations: randomised controlled trial. BMJ. 2008;337(October):a568.
  5. Rapid responses to David et al., http://www.bmj.com/cgi/eletters/337/jul31_1/a568
  6. Eysenbach G. Citation Advantage of Open Access Articles. PLoS Biology. 2006;4(5):e157.
  7. Hajjem, C., Harnad, S. and Gingras, Y. (2005) Ten-Year Cross-Disciplinary Comparison of the Growth of Open Access and How it Increases Research Citation Impact. IEEE Data Engineering Bulletin, 28 (4). pp. 39-47. http://eprints.ecs.soton.ac.uk/12906/
  8. Lisi G, An exceptionally simple theory of everything, arXiv:0711.0770v1 [hep-th], November 2007.
  9. Bradley J C, We have antimalarial activity!, UsefulChem Blog, http://usefulchem.blogspot.com/2008/01/we-have-anti-malarial-activity.html, January 25 2008.

Where does Open Access stop and ‘just doing good science’ begin?

open access banner
I had been getting puzzled for a while as to why I was being characterised as an ‘Open Access’ advocate. I mean, I do adovcate Open Access publication and I have opinions on the Green versus Gold debate. I am trying to get more of my publications into Open Access journals. But I’m no expert, and I’ve certainly been around this community for a much shorter time and know a lot less about the detail than many other people. The giants of the Open Access movement have been fighting the good fight for many years. Really I’m just a late comer cheering from the sidelines.

This came to a head recently when I was being interviewed for a piece on Open Access. We kept coming round to the question of what it was that motivated me to be ‘such a strong’ advocate of open access publication. I must have a very strong motivation to have such strong views surely? And I found myself thinking that I didn’t. I wasn’t that motivated about open access per se. It took some thinking and going back over where I had come from to realise that this was because of where I was coming from.

I guess most people come to the Open Science movement firstly through an interest in Open Access. The frustration of not being able to access papers, followed by the realisation that for many other scientists it must be much worse. Often this is followed by the sense that even when you’ve got the papers they don’t have the information you want or need, that it would be better if they were more complete, the data or software tools available, the methodology online. There is a logical progression from ‘better access to the literature helps’ to ‘access to all the information would be so much better’.

I came at the whole thing from a different angle. My Damascus moment came when I realised the potential power of making everything available; the lab book, the data, the tools, the materials, and the ideas. Once you connect the idea of the read-write web to science communication, it is clear that the underlying platform has to be open, accessible, and re-useable to get the benefits. Science is perhaps the ultimate open platform available to build on. From this perspective it is immediately self evident that the current publishing paradigm and subscription access publication in particular is broken. But it is just one part of the puzzle, one of the barriers to communication that need to be attacked, broken down, and re-built. It is difficult, for these reasons, for me to separate out a bit of my motivation that relates just to Open Access.

Indeed in some respects Open Access, at least in the form in which it is funded by author charges can be a hindrance to effective science communication. Many of the people I would like to see more involved in the general scientific community, who would be empowered by more effective communication, cannot afford author charges. Indeed many of my colleagues in what appear to be well funded western institutions can’t afford them either. Sure you can ask for a fee waiver but no-one likes to ask for charity.

But I think papers are important. Some people believe that the scientific paper as it exists today is inevitably doomed. I disagree. I think it has an important place as a static document, a marker of what a particular group thought at a particular time, based on the evidence they had assembled. If we accept that the paper has a place then we need to ask how it is funded, particularly the costs of peer and editorial review, and the costs maintaining that record into the future. If you believe, as I do, that in an ideal world this communication would be immediately available to all then there are relatively few viable business models available. What has been exciting about the past few months, and indeed the past week has been the evidence that these business models are starting to work through and make sense. The purchase of BioMedCentral by Springer may raise concerns for the future but it also demonstrates that a publishing behemoth has faith in the future of OA as a publishing business model.

For me, this means that in many ways the discussion has moved on. Open Access, and Open Access publication in particular, has proved its viability. The challenges now lie in widening the argument to include data, to include materials, to include process. To develop the tools that will allow us to capture all of this in a meaningful way and to make sense of other people’s record. None of which should in any way belittle the achievement of those who have brought the Open Access movement to its current point. Immense amounts of blood, sweat, and tears, from thousands of people have brought what was once a fringe movement to the centre of the debate on science communication. The establishing of viable publishers and repositories for pre-prints, the bringing of funders and governments to the table with mandates, and of placing the option of OA publication at the fore of people’s minds are huge achievements, especially given the relatively short time it has taken. The debate on value for money, on quality of communication, and on business models and the best practical approaches will continue, but the debate about the value of, indeed the need for, Open Access has essentially been won.

And this is at the core of what Open Access means for me. The debate has placed, or perhaps re-placed, right at the centre of the discussion of how we should do science, the importance of the quality of communication. It has re-stated the principle of placing the claims that you make, and the evidence that supports them, in the open for criticism by anyone with the expertise to judge, regardless of where they are based or who is funding them. And it has made crystal clear where the deficiencies in that communication process lie and exposed the creeping tendency of publication over the past few decades to become more an exercise in point scoring than communication. There remains much work to be done across a wide range of areas but the fact that we can now look at taking those challenges on is due in no small part to the work of those who have advocated Open Access from its difficult beginnings to today’s success. Open Access Day is a great achievment in its own right and it should be celebration of the the efforts of all those people who have contributed to making it possible as well as an opportunity to build for the future.

High quality communication, as I and others have said, and will continue to say, is Just Good Science. The success of Open Access has shown how one aspect of that communication process can be radically improved. The message to me is a simple one. Without open communication you simply can’t do the best science. Open Access to the published literature is simply one necessary condition of doing the best possible science.