A personal view of Open Science – Part I
For the Open Science workshop at the Pacific Symposium on Biocomputing I wrote a very long essay as an introductory paper. It turned out that this was far too long for the space available so an extremely shortened version was submitted for the symposium proceedings. I thought I would post the full length essay in installments here as a prelude to cleaning it up and submitting to an appropriate journal.
Openness is arguably the great strength of the scientific method. At its core is the principle that claims and the data that support them are placed before the community for examination and critique. Through open examination and critical analysis models can be refined, improved, or rejected. Conflicting data can be compared and the underlying experiments and methodology investigated to identify which, if any, is more reliable. While individuals may not always adhere to the highest standards, the community mechanisms of review, criticism, and integration have proved effective in developing coherent and useful models of the physical world around us. As Lee Smolin of the Perimeter Institute for Theoretical Physics recently put it, “we argue in good faith from shared evidence to shared conclusions“. It is an open approach that drives science towards an understanding which, while never perfect, nevertheless enables the development of sophisticated technologies with practical applications.
The Internet and the World Wide Web provide the technical ability to share a much wider range of both the evidence and the argument and conclusions that drive modern research. Data, methodology, and interpretation can also be made available online at lower costs and with lower barriers to access than has traditionally been the case. Along with the ability to share and distribute traditional scientific literature, these new technologies also offer the potential for new approaches. Wikis and blogs enable geographically and temporally widespread collaborations, the traditional journal club can now span continents with online book marking tools such as Connotea and CiteULike, and the smallest details of what is happening in a laboratory (or on Mars ) can be shared via instant messaging applications such as Twitter.
The potential of online tools to revolutionise scientific communication and their ability to open up the details of the scientific enterprise so that a wider range of people can participate is clear. In practice, however, the reality has fallen far behind the potential. This is in part due to a need for tools that are specifically designed with scientific workflows in mind, partly due to the inertia of infrastructure providers with pre-Internet business models such as the traditional “subscriber pays” print literature and, to some extent, research funders. However it is predominantly due to cultural and social barriers within the scientific community. The prevailing culture of academic scientific research is one of possession – where control over data, methodological secrets, and exploitation of results are paramount. The tradition of Mertonian Science has receded, in some cases, so far that principled attempts to reframe an ethical view of modern science can seem charmingly naive.
It is in the context of these challenges that the movement advocating more openness in science must be seen. There will always be places where complete openness is not appropriate, such as where personal patient records may be identifiable, where research is likely to lead to patentable (and patent-worthy) results, or where the safety or privacy of environments, study subjects, or researchers might be compromised. These, however are special instances for which exceptional cases can be made, and not the general case across the whole of global research effort. Significant steps forward such as funder and institutional pre-print deposition mandates and the adoption of data sharing policies by UK Research Councils must be balanced against the legal and legislative attempts to overturn the NIH mandate and widespread confusion over what standards of data sharing are actually required and how they will be judged and enforced. Nonetheless there is a growing community interested in adopting more open practices in their research, and increasingly this community is developing as a strong voice in discussions of science policy, funding, and publication. The aim of this workshop is to strengthen this voice by focusing the attention of the community on areas requiring technical development, the development and implementation of standards, both technical and social, and identification and celebration of success.
Why we need open science – Open Access publication, Open Data, and Open Process
The case for taxpayer access to the taxpayer funded peer reviewed literature was made personally and directly in Jonathon Eisen’s first editorial for PLoS Biology .
[…describing the submission of a paper to PLoS Biology as an ‘experiment’…] But then, while finalizing the paper, a two-month-long medical nightmare ensued that eventually ended in the stillbirth of my first child. While my wife and I struggled with medical mistakes and negligence, we felt the need to take charge and figure out for ourselves what the right medical care should be. And this is when I experienced the horror of closed-access publishing. For unlike my colleagues at major research universities that have subscriptions to all journals, I worked at a 300-person nonprofit research institute with a small library. So there I was—a scientist and a taxpayer—desperate to read the results of work that I helped pay for and work that might give me more knowledge than possessed by our doctors. And yet either I could not get the papers or I had to pay to read them without knowing if they would be helpful. After we lost our son, I vowed to never publish in non-OA journals if I was in control. […]
Eisen JA (2008) PLoS Biology 2.0. PLoS Biol 6(2): e48 doi:10.1371/journal.pbio.0060048
As a scientist in a small institution he was unable to access the general medical literature. More generally, as a US taxpayer he was unable to access the outputs of US government funded research or indeed of research funded by the governments of other countries. The general case for enabling access of both the general public, scientists in less well funded institutions, and in the developing world has been accepted by most in principle. While there are continuing actions being taken to limit the action of the NIH mandate by US publishers a wide range of research institutions have adopted deposition mandates. There remains much discussion about routes to open access with the debate over ‘Green’ and ‘Gold’ routes continuing as well as an energetic ongoing debate about the stability and viability of the business models of various open access journals. However it seems unlikely that the gradual increase in number and impact of open access journals is likely to slow or stop soon. The principle that the scientific literature should be available to all has been won. The question of how best to achieve that remains a matter of debate.
A similar case to that for access to the published literature can also be made for research data. At the extremes, withholding data could lead to preventable deaths or severely reduced quality of life for patients. Andrew Vickers, in a hard hitting New York Times essay  dissected the reasons that medical scientists give for not making data from clinical cancer trials available; data that could, in aggregate, provide valuable insights into enhancing patient survival time and quality of life. He quotes work by John Kirwan (Bristol University) showing that three quarters of researchers in one survey opposed sharing data from clinical trials. While there may be specific reasons for retaining specific types of data from clinical trials, particularly in small specialised cases where maintaining the privacy of participants is difficult or impossible, it seems unarguable that the interests of patients and the public demand that such data be available for re-use and analysis. This is particularly the case where the taxpayer has funded these trials, but for other funders, including industrial funders, there is a public interest argument for making clinical trial data public in particular.
In other fields the case for data sharing may seem less clear cut. There is little obvious damage done to the general public by not making the details of research available. However, while the argument is more subtle, it is similar to that for clinical data. There the argument is that reanalysis and aggregation can lead to new insights with an impact on patient care. In non-clinical sciences this aggregation and re-analysis leads to new insights, more effective analysis, and indeed new types of analysis. The massive expansion in the scale and ambition of biological sciences over the past twenty years is largely due to the availability of biological sequence, structural, and functional data in international and freely available archives. Indeed the entire field of bioinformatics is predicated on the availability of this data. There is a strong argument to be made that the failure of the chemical sciences to achieve a similar revolution is due to the lack of such publicly available data. Bioinformatics is a highly active and widely practiced field of science. By comparison, chemoinformatics is marginalised, and, what is most galling to those who care for the future of chemistry, primarily driven by the needs and desires of biological scientists. Chemists for the most part haven’t grasped the need because the availability of data is not part of their culture.
High energy particle physics by contrast is necessarily based on a community effort; without strong collaboration, communication, and formalised sharing of the details of what work is going on the research simply would not happen. Astronomy, genome sequencing, and protein crystallography are other fields where there is a strong history, and in some cases formalized standards of data sharing. While there are anecdotal cases of ‘cheating’ or bending the rules, usually to prevent or restrict the re-use of data, the overall impact of data sharing in these areas is generally seen as positive, leading to better science, higher data quality standards, and higher standards of data description. Again, to paraphrase Smolin, where the discussion proceeds from a shared set of evidence we are more likely to reach a valid conclusion. This is simply about doing better science by improving the evidence base.
The final piece of the puzzle, and in many ways the most socially and technically challenging is the sharing of research procedures. Data has no value in and of itself unless the process used to generate it is appropriate and reliable. Disputes over the validity of claims are rarely based on the data themselves but on the procedures used either to collect them or those used to process and analyse them. A widely reported recent case turned on the details of how a protein was purified; whether with a step or gradual gradient elution. This detail of procedure led laboratories to differing results, a year of wasted time for one researcher, and ultimately the retraction of several high profile papers [refs – nature feature, retractions, original paper etc]. Experimental scientists generally imagine that in the computational sciences where a much higher level of reproducibility and the ready availability of code and subversion repositories makes sharing and documenting material relatively straightforward, would have much higher standards. However, a recent paper  by Ted Pedersen (University of Minnesota, Duluth) – with the wonderful title ‘Empiricism is not a matter of faith’ – criticized the standards of both code documentation and availability. He makes the case that working with the assumption that you will make the tools available to others not only allows you to develop better tools, and makes you popular in the community, but also improves the quality of your own work.
And this really is the crux of the matter. If the central principle of the scientific method is open analysis and criticism of claims then making the data and process and conclusions avalable and accessible is just doing good science. While we may argue about the timing of release or the details of ‘how raw’ available data needs to be or the file formats or ontologies used to describe it there can be no argument that if the scientific record is to have value it must rest on an accessible body of relevant evidence. Scientists were doing mashups long before the term was invented; mixing data from more than one source; reprocessing it to provide a different view. The potential of online tools to help to do this better is massive, but the utility of these tools depends on the sharing of data, workflows, ideas, and opinions.
There are broadly three areas for development that are required to enable the more widespread adoption of open practice by research scientists. The first is the development of tools that are designed for scientists. While many of the general purpose tools and services have been adopted by researchers there are many cases where specialised design or adaptation is required for the specific needs of a research environment. In some cases the needs of research willpush development in specific areas, such as controlled vocabularies, beyond what is being done in the mainstream. The second, and most important area involves the social and cultural barriers within various research communities.These vary widely in type and importance across different fields and understanding and overcoming the fears as well as challenging entrenched interests will be an important part of the open science programme. Finally, there is a value and a need to provide top-down guidance in the form of policies and standards. The vagueness of the term ‘Open Science’ means that while it is a good banner there is a potential for confusion. Standards, policies, and brands can provide clarity for researchers, a clear articulation of aspirations (and a guide to the technical steps required to achieve them), and the support required to help people actually make this happen in their own research.
Part II will cover the issues around tools for Open Science
- Smolin L (2008), Science as an ethical community, PIRSA ID#08090035, http://pirsa.org/08090035/
- Mars Phoenix on Twitter, http://twitter.com/MarsPhoenix
- Eisen JA (2008) PLoS Biology 2.0. PLoS Biol 6(2): e48 doi:10.1371/journal.pbio.0060048
- Vickers A (2008), http://www.nytimes.com/2008/01/22/health/views/22essa.html?_r=1
- Pedersen T (2008), Computational Linguistics, Volume 34, Number 3, pp. 465-470, Self archived.