Where does Open Access stop and ‘just doing good science’ begin?

open access banner
I had been getting puzzled for a while as to why I was being characterised as an ‘Open Access’ advocate. I mean, I do adovcate Open Access publication and I have opinions on the Green versus Gold debate. I am trying to get more of my publications into Open Access journals. But I’m no expert, and I’ve certainly been around this community for a much shorter time and know a lot less about the detail than many other people. The giants of the Open Access movement have been fighting the good fight for many years. Really I’m just a late comer cheering from the sidelines.

This came to a head recently when I was being interviewed for a piece on Open Access. We kept coming round to the question of what it was that motivated me to be ‘such a strong’ advocate of open access publication. I must have a very strong motivation to have such strong views surely? And I found myself thinking that I didn’t. I wasn’t that motivated about open access per se. It took some thinking and going back over where I had come from to realise that this was because of where I was coming from.

I guess most people come to the Open Science movement firstly through an interest in Open Access. The frustration of not being able to access papers, followed by the realisation that for many other scientists it must be much worse. Often this is followed by the sense that even when you’ve got the papers they don’t have the information you want or need, that it would be better if they were more complete, the data or software tools available, the methodology online. There is a logical progression from ‘better access to the literature helps’ to ‘access to all the information would be so much better’.

I came at the whole thing from a different angle. My Damascus moment came when I realised the potential power of making everything available; the lab book, the data, the tools, the materials, and the ideas. Once you connect the idea of the read-write web to science communication, it is clear that the underlying platform has to be open, accessible, and re-useable to get the benefits. Science is perhaps the ultimate open platform available to build on. From this perspective it is immediately self evident that the current publishing paradigm and subscription access publication in particular is broken. But it is just one part of the puzzle, one of the barriers to communication that need to be attacked, broken down, and re-built. It is difficult, for these reasons, for me to separate out a bit of my motivation that relates just to Open Access.

Indeed in some respects Open Access, at least in the form in which it is funded by author charges can be a hindrance to effective science communication. Many of the people I would like to see more involved in the general scientific community, who would be empowered by more effective communication, cannot afford author charges. Indeed many of my colleagues in what appear to be well funded western institutions can’t afford them either. Sure you can ask for a fee waiver but no-one likes to ask for charity.

But I think papers are important. Some people believe that the scientific paper as it exists today is inevitably doomed. I disagree. I think it has an important place as a static document, a marker of what a particular group thought at a particular time, based on the evidence they had assembled. If we accept that the paper has a place then we need to ask how it is funded, particularly the costs of peer and editorial review, and the costs maintaining that record into the future. If you believe, as I do, that in an ideal world this communication would be immediately available to all then there are relatively few viable business models available. What has been exciting about the past few months, and indeed the past week has been the evidence that these business models are starting to work through and make sense. The purchase of BioMedCentral by Springer may raise concerns for the future but it also demonstrates that a publishing behemoth has faith in the future of OA as a publishing business model.

For me, this means that in many ways the discussion has moved on. Open Access, and Open Access publication in particular, has proved its viability. The challenges now lie in widening the argument to include data, to include materials, to include process. To develop the tools that will allow us to capture all of this in a meaningful way and to make sense of other people’s record. None of which should in any way belittle the achievement of those who have brought the Open Access movement to its current point. Immense amounts of blood, sweat, and tears, from thousands of people have brought what was once a fringe movement to the centre of the debate on science communication. The establishing of viable publishers and repositories for pre-prints, the bringing of funders and governments to the table with mandates, and of placing the option of OA publication at the fore of people’s minds are huge achievements, especially given the relatively short time it has taken. The debate on value for money, on quality of communication, and on business models and the best practical approaches will continue, but the debate about the value of, indeed the need for, Open Access has essentially been won.

And this is at the core of what Open Access means for me. The debate has placed, or perhaps re-placed, right at the centre of the discussion of how we should do science, the importance of the quality of communication. It has re-stated the principle of placing the claims that you make, and the evidence that supports them, in the open for criticism by anyone with the expertise to judge, regardless of where they are based or who is funding them. And it has made crystal clear where the deficiencies in that communication process lie and exposed the creeping tendency of publication over the past few decades to become more an exercise in point scoring than communication. There remains much work to be done across a wide range of areas but the fact that we can now look at taking those challenges on is due in no small part to the work of those who have advocated Open Access from its difficult beginnings to today’s success. Open Access Day is a great achievment in its own right and it should be celebration of the the efforts of all those people who have contributed to making it possible as well as an opportunity to build for the future.

High quality communication, as I and others have said, and will continue to say, is Just Good Science. The success of Open Access has shown how one aspect of that communication process can be radically improved. The message to me is a simple one. Without open communication you simply can’t do the best science. Open Access to the published literature is simply one necessary condition of doing the best possible science.

A personal view of Open Science – Part II – Tools

The second installment of the paper (first part here) where I discuss building tools for Open (or indeed any) Science.

Tools for open science – building around the needs of scientists

It is the rapid expansion and development of tools that are loosely categorised under the banner of ‘Web2.0’ or ‘Read-write web’ that makes the sharing of research material available. Many of the generic tools, particularly those that provide general document authoring capabilities, have been adopted and used by a wide range of researchers. Online office tools can enable collaborative development of papers and proposals without the need for emailing documents to multiple recipients and the resultant headaches associated with which version is which. Storing spreadsheets, databases, or data online means that collaborators have easy access to the most recent versions and can see how these are changing. More generally the use of RSS feed readers and bookmarking sites to share papers of interest and, to some extent, to distribute the task of triaging the literature are catching in in some communities. The use of microblogging platforms such as Twitter and aggregation and conversational tools such as Friendfeed have recently been used very effectively to provide coverage of conferences in progress, including collaborative note-taking. In combination with streamed or recorded video as well as screencasts and sharing of presentations online the idea of a dstributed conference, while not an everyday reality, is becoming feasible.

However it is often the case that,while useful, generic web based services do not provide desired functionality or do not fit well into the existing workflows of researchers. Here there is the opportunity, and sometime necessity, to build specialised or adapated tools. Collaborative preparation of papers is a good example of this. Conventional web bookmarking services, such as del.icio.us provide a great way of sharing the literature or resources that a paper builds on with other authors but they do not automatically capture and recognise the necessary metadata associated with published papers (journal, date, author, volume, page numbers). Specialised services such as citeulke and Connotea have been developed to enable one click bookmarking of published literature and these have been used effectively by for example using a specific tag for references associated with a specific paper in progress. The problem with these services as they exist at the moment is that they don’t provide the crucial element in the workflow that scientists want to aggregate the references for, the formatting of the references in the finalised paper. Indeed the lack of formatting functionality in GoogleDocs, the most widely used collaborative writing tool, means that in practice the finalised document is usually cut and pasted into Word and the references formatted using proprietary software such as Endnote.The available tools do not provide the required functionality.

A number of groups and organisations have investigated the use of Blogs and Wikis as collaborative and shareable laboratory notebooks. However few of these systems offer good functionality ‘out of the box’. While there are many electronic laboratory notebook systems sold by commercial interests most are actually designed around securing data rather than sharing it so are not of interesthere. While the group of Jean-Claude Bradley has used the freely hosted WikiSpaces as a laboratory notebook without further modification, much of the data and analysis is hosted on other services, including YouTube, FlickR, and GoogleDocs. The OpenWetWare group has made extensive modifications to the MediaWiki system to provide laboratory notebook functionality whereas Garret Lisi has adapted the TiddlyWiki framework as a way of presenting his notebook. The Chemtools collaboration at the University of Southampton has developed a specialised Blog platform . Commercial offerings in the area of web based lab notebooks are also starting to appear. All of these different systems have developed because of the specialised needs of recording the laboratory work of the scientists they were designed for. The different systems make different assumptions about where they fit in the workflow of the research scientist, and what that workflow looks like. They are all, however, built around the idea that they need to satisfy the needs of the user.

This creates a tension in tool building. General tools, that can be used across a range of disciplines, are extremely challenging to design, because workflows, and the perception of how they work, are different in different disciplines. Specialist tools can be built for specific fields but often struggle to translate into new areas. Because the market is small in any field the natural desire for designers is to make tools as general as possible. However in the process of trying to build for a sufficiently general workflow it is often the case that applicability to specific workflows is lost. There is a strong argument based on this for building interoperable modules, rather than complete systems, that will allow domain specialists to stich together specific solutions for specific fields or even specific experiments. Interoperability of systems and standards that enable it is a criteria that is sometimes lost in the development process, but is absolutely essential to making tools and processes shareable. The use of workflow management tools, such as Taverna, Kepler, and VisTrails have an important role to play here.

While not yet at a stage where they are widely configurable by end users the vision behind them has the potential both to make data analysis much more straightforward for experimental scientist but also to solve many of the problems involved in sharing process, as opposed to data. The idea of visually wiring up online or local analysis tools to enable data processing pipelines is compelling. The reason most experimental scientists use spreadsheets for data analysis is that they do not wish to learn programming languages. Providing visual programming tools along with services with clearly defined inputs and outputs will make it possible for a much wider range of scientists to use more sophisticated and poweful analysis tools. What is more the ability to share, version, and attribute, workflows will go some significant distance towards solving the problem of sharing process. Services like MyExperiment which provide an environment for sharing and versioning Taverna workflows provide a natural way of sharing the details of exactly how a specific analysis is carried out. Along with an electronic notebook to record each specific use of a given workflow or analysis procedure (which can be achieved automatically though an API) the full details of the raw data, analysis procedure, and any specific parameters used, can be recorded. This combination offers a potential route out of the serious problem of sharing research processes if the appropriate support infrastructure can be built up.

Also critical to successful sharing is a shared language or vocabulary. The development of ontologies, controlled vocabularies, and design standards are all important in sharing knowledge and crucial to achieving the ulitmate goals of making this knowledge machine readable. While there are divisions in the technical development and user communities over the development and use of controlled vocabularies there is little disagreement over the fact that good vocabularies combined with good tools are useful. The disagreements tend to lie in how they are best developed, when they should be applied, and whether they are superior to or complementary to other approaches such as text mining and social tagging. An integrated and mixed approach to the use of controlled vocabularies and standards is the most likely to be successful. In particular it is important to match the degree of structure in the description to the natural degree of structure in the object or objects being described. Highly structured and consistent data types, such as crystal structures and DNA sequences, can benefit greatly from highly structured descriptions which are relatively straightforward to create, and in many cases are the standard outputs of an analysis process. For large scale experimental efforts the scale of the data and sample management problem makes an investment in detailed and structured desriptions worth while. In a small laboratory doing unique work, however, there may be a strong case for using local descriptions and vocabularies that are less rigorous but easier to apply and able to grow to fit the changing situation on the ground. Ideally designed in such a way that mapping onto an external vocabulary is feasible if it is required or useful in the future.

Making all of this work requires that researchers adopt these tools and that a community develops that is big enough to provide the added value that these tools might deliver. For a broad enough community to adopt these approaches the tools must fit well in their existing workflow and help to deliver the things that researchers are already motivated to produce. For most researchers, published papers are the measure of their career success and the basis of their reward structures. Therefore tools that make it easier to write papers, or that help researchers to write better papers, are likely to get traction. As the expectations of the quality and completeness of supporting data increase for published papers, tools that make it easier for the researcher to collate and curate the record of their research will become important. It is the process of linking the record of what happened in the laboratory, or study, to the first pass intepretation and analysis of data, through further rounds of analysis until a completed version is submitted for review, that is currently poorly supported by available tools, and it is this need that will drive the development of improved tools. These tools will enable the disparate elements of the record of research, currently scattered between paper notebooks, various data files on multiple hard drives, and unconnected electronic documents, to be chained together. Once this record is primarily electronic, and probably stored online in a web based system, the choice to make the record public at any stage from the moment the record is made to the point of publication, will be available. The reason to link this to publication is to tie it into an existing workflow in the first instance. Once the idea is embedded the steps involved in making the record even more open are easily taken.

Part III covers social issues around Open Science.

The Southampton Open Science Workshop – a brief report

On Monday 1 September we had a one day workshop in Southampton discussing the issues that surround ‘Open Science’. This was very free form and informal and I had the explicit aim of getting a range of people with different perspectives into the room to discuss a wide range of issues, including tool development, the social and career structure issues, as well as ideas about standards and finally, what concrete actions could actually be taken. You can find live blogging and other commentary in the associated Friendfeed room and information on who attended as well as links to many of the presentations on the conference wiki.

Broadly speaking the day was divided into three chunks, the first was focussed on tools and services and included presentations on MyExperiment, Mendeley, Chemtools, and Inkspot Science. Branwen Hide of Research Information Network has written more on this part. Given that the room contained more than the usual suspects the conversation focussed on usability and interfaces rather than technical aspects although there was a fair bit of that as well.

The second portion of the day revolved more around social challenges and issues. Richard Grant presented his experience of blogging on an official university sanctioned site and the value of that for both outreach and education. One point he made was that the ‘lack of adoption problem’ seen in science just doesn’t seem to exist in the humanities. Perhaps this is because scientists don’t generally see ‘writing’ as a valuable thing in its own right. Certainly there is a preponderance of scientists who happen also to see themselves as writers on Nature Network.

Jennifer Rohn followed on from Richard, and objected to my characterising her presentation as “the skeptic’s view”. A more accurate characterisation would have been “I’d love to be open but at the moment I can’t: This is what has to change to make it work”. She presented a great summary of the proble, particularly from the biological scientist’s point of view as well as potential solutions. Essentially the problem is that of the ‘Minimum Publishable Unit’ or research quantum as well as what ‘counts’ as publication. Her main point was that for people to be prepared to publish material that falls short of a full paper they need to get some proportional credit for that. This folds closely into the discussion of what can be cited and what should be cited in particular contexts. I have used the phrase ‘data sized peg into a paper shaped hole’ to describe this in the past.

After lunch Liz Lyon from UKOLN talked about curation and long term archival storage which lead into an interesting discussion about the archiving of blogs and other material. Is it worth keeping? One answer to this was to look at the real interest today in diaries from the second world war and earlier from ‘normal people’. You don’t necessarily need to be a great scientist, or even a great blogger, for the material to be of potential interest to historians in 50-100 years time. But doing this properly is hard – in the same way that maintaining and indexing data is hard. Disparate sites, file formats, places of storage, and in the end whose blog is it actually? Particularly if you are blogging for, or recording work done at, a research institution.

The final session was about standards or ‘brands’. Yaroslav Nikolaev talked about semantic representations of experiments. While important it was probably a shame in the end we did this at the end of the day because it would have been helpful to get more of the non-techie people into that discussion to iron out both the communication issues around semantic web as well as describing the real potential benefits. This remains a serious gap – the experimental scientists who could really use semantic tools don’t really get the point, and the people developing the tools don’t communicate well what the benefits are, or in some cases (not all I hasten to add!) actually build the tools the experimentalists want.

I talked about the possibility of a ‘certificate’ or standard for Open Science, and the idea of an organisation to police this. It would be safe to say that, while people agreed that clear definitions would be hepful, the enhusiasm level for a standards organisation was pretty much zero. There are more fundamental issues of actually building up enough examples of good practice, and working towards identifying best practice in open science, that need to be dealt with before we can really talk about standards.

On the other hand the idea of ‘the fully supported’ paper got immediate and enthusiastic support. The idea here is deceptively simple, and has been discussed elsewhere; simply that all the relevant supporting information for a paper (data, detailed methodology, software tools, parameters, database versions etc. as well as access to required materials at reasonable cost) should be available for any published paper. The challenge here lies in actually recording experiments in such a way that this information can be provided. But if all of the record is available in this form then it can be made available whenever the researcher chooses. Thus by providing the tools that enable the fully supported paper you are also providing tools that enable open science.

Finally we discussed what we could actually do: Jean-Claude Bradley discussed the idea of an Open Notebook Science challenge to raise the profile of ONS (this is now setup – more on this to follow). Essentially a competition type approach where individuals or groups can contribute to a larger scientific problem by collecting data – where the teams get judged on how well they describe what they have done and how quickly they make it available.

The most specific action proposed was to draft a ‘Letter to Nature’ proposing the idea of the fully supported paper as a submission standard. The idea would be to get a large number of high profile signatories on a document which describes  a concrete step by step plan to work towards the final goal, and to send that as correspondence to a high profile journal. I have been having some discussions about how to frame such a document and hope to be getting a draft up for discussion reasonably soon.

Overall there was much enthusiasm for things Open and a sense that many elements of the puzzle are falling into place. What is missing is effective coordinated action, communication across the whole community of interested and sympathetic scientsts, and critically the high profile success stories that will start to shift opinion. These ought to, in my opinion, be the targets for the next 6-12 months.

A personal view of Open Science – Part I

For the Open Science workshop at the Pacific Symposium on Biocomputing I wrote a very long essay as an introductory paper. It turned out that this was far too long for the space available so an extremely shortened version was submitted for the symposium proceedings. I thought I would post the full length essay in installments here as a prelude to cleaning it up and submitting to an appropriate journal.

Introduction

Openness is arguably the great strength of the scientific method. At its core is the principle that claims and the data that support them are placed before the community for examination and critique. Through open examination and critical analysis models can be refined, improved, or rejected. Conflicting data can be compared and the underlying experiments and methodology investigated to identify which, if any, is more reliable. While individuals may not always adhere to the highest standards, the community mechanisms of review, criticism, and integration have proved effective in developing coherent and useful models of the physical world around us. As Lee Smolin of the Perimeter Institute for Theoretical Physics recently put it, “we argue in good faith from shared evidence to shared conclusions“[1]. It is an open approach that drives science towards an understanding which, while never perfect, nevertheless enables the development of sophisticated technologies with practical applications.

The Internet and the World Wide Web provide the technical ability to share a much wider range of both the evidence and the argument and conclusions that drive modern research. Data, methodology, and interpretation can also be made available online at lower costs and with lower barriers to access than has traditionally been the case. Along with the ability to share and distribute traditional scientific literature, these new technologies also offer the potential for new approaches. Wikis and blogs enable geographically and temporally widespread collaborations, the traditional journal club can now span continents with online book marking tools such as Connotea and CiteULike, and the smallest details of what is happening in a laboratory (or on Mars [2]) can be shared via instant messaging applications such as Twitter.

The potential of online tools to revolutionise scientific communication and their ability to open up the details of the scientific enterprise so that a wider range of people can participate is clear. In practice, however, the reality has fallen far behind the potential. This is in part due to a need for tools that are specifically designed with scientific workflows in mind, partly due to the inertia of infrastructure providers with pre-Internet business models such as the traditional “subscriber pays” print literature and, to some extent, research funders. However it is predominantly due to cultural and social barriers within the scientific community. The prevailing culture of academic scientific research is one of possession – where control over data, methodological secrets, and exploitation of results are paramount. The tradition of Mertonian Science has receded, in some cases, so far that principled attempts to reframe an ethical view of modern science can seem charmingly naive.

It is in the context of these challenges that the movement advocating more openness in science must be seen. There will always be places where complete openness is not appropriate, such as where personal patient records may be identifiable, where research is likely to lead to patentable (and patent-worthy) results, or where the safety or privacy of environments, study subjects, or researchers might be compromised. These, however are special instances for which exceptional cases can be made, and not the general case across the whole of global research effort. Significant steps forward such as funder and institutional pre-print deposition mandates and the adoption of data sharing policies by UK Research Councils must be balanced against the legal and legislative attempts to overturn the NIH mandate and widespread confusion over what standards of data sharing are actually required and how they will be judged and enforced. Nonetheless there is a growing community interested in adopting more open practices in their research, and increasingly this community is developing as a strong voice in discussions of science policy, funding, and publication.  The aim of this workshop is to strengthen this voice by focusing the attention of the community on areas requiring technical development, the development and implementation of standards, both technical and social, and identification and celebration of success.

Why we need open science – Open Access publication, Open Data, and Open Process

The case for taxpayer access to the taxpayer funded peer reviewed literature was made personally and directly in Jonathon Eisen’s first editorial for PLoS Biology [3].

[…describing the submission of a paper to PLoS Biology as an ‘experiment’…] But then, while finalizing the paper, a two-month-long medical nightmare ensued that eventually ended in the stillbirth of my first child. While my wife and I struggled with medical mistakes and negligence, we felt the need to take charge and figure out for ourselves what the right medical care should be. And this is when I experienced the horror of closed-access publishing. For unlike my colleagues at major research universities that have subscriptions to all journals, I worked at a 300-person nonprofit research institute with a small library. So there I was—a scientist and a taxpayer—desperate to read the results of work that I helped pay for and work that might give me more knowledge than possessed by our doctors. And yet either I could not get the papers or I had to pay to read them without knowing if they would be helpful. After we lost our son, I vowed to never publish in non-OA journals if I was in control. […]

Eisen JA (2008) PLoS Biology 2.0. PLoS Biol 6(2): e48 doi:10.1371/journal.pbio.0060048

As a scientist in a small institution he was unable to access the general medical literature. More generally, as a US taxpayer he was unable to access the outputs of US government funded research or indeed of research funded by the governments of other countries. The general case for enabling access of both the general public, scientists in less well funded institutions, and in the developing world has been accepted by most in principle. While there are continuing actions being taken to limit the action of the NIH mandate by US publishers a wide range of research institutions have adopted deposition mandates. There remains much discussion about routes to open access with the debate over ‘Green’ and ‘Gold’ routes continuing as well as an energetic ongoing debate about the stability and viability of the business models of various open access journals. However it seems unlikely that the gradual increase in number and impact of open access journals is likely to slow or stop soon. The principle that the scientific literature should be available to all has been won. The question of how best to achieve that remains a matter of debate.

A similar case to that for access to the published literature can also be made for research data. At the extremes, withholding data could lead to preventable deaths or severely reduced quality of life for patients. Andrew Vickers, in a hard hitting New York Times essay [4] dissected the reasons that medical scientists give for not making data from clinical cancer trials available; data that could, in aggregate, provide valuable insights into enhancing patient survival time and quality of life. He quotes work by John Kirwan (Bristol University) showing that three quarters of researchers in one survey opposed sharing data from clinical trials. While there may be specific reasons for retaining specific types of data from clinical trials, particularly in small specialised cases where maintaining the privacy of participants is difficult or impossible, it seems unarguable that the interests of patients and the public demand that such data be available for re-use and analysis. This is particularly the case where the taxpayer has funded these trials, but for other funders, including industrial funders, there is a public interest argument for making clinical trial data public in particular.

In other fields the case for data sharing may seem less clear cut. There is little obvious damage done to the general public by not making the details of research available. However, while the argument is more subtle, it is similar to that for clinical data. There the argument is that reanalysis and aggregation can lead to new insights with an impact on patient care. In non-clinical sciences this aggregation and re-analysis leads to new insights, more effective analysis, and indeed new types of analysis. The massive expansion in the scale and ambition of biological sciences over the past twenty years is largely due to the availability of biological sequence, structural, and functional data in international and freely available archives. Indeed the entire field of bioinformatics is predicated on the availability of this data. There is a strong argument to be made that the failure of the chemical sciences to achieve a similar revolution is due to the lack of such publicly available data. Bioinformatics is a highly active and widely practiced field of science. By comparison, chemoinformatics is marginalised, and, what is most galling to those who care for the future of chemistry, primarily driven by the needs and desires of biological scientists. Chemists for the most part haven’t grasped the need because the availability of data is not part of their culture.

High energy particle physics by contrast is necessarily based on a community effort; without strong collaboration, communication, and formalised sharing of the details of what work is going on the research simply would not happen. Astronomy, genome sequencing, and protein crystallography are other fields where there is a strong history, and in some cases formalized standards of data sharing. While there are anecdotal cases of ‘cheating’ or bending the rules, usually to prevent or restrict the re-use of data, the overall impact of data sharing in these areas is generally seen as positive, leading to better science, higher data quality standards, and higher standards of data description. Again, to paraphrase Smolin, where the discussion proceeds from a shared set of evidence we are more likely to reach a valid conclusion. This is simply about doing better science by improving the evidence base.

The final piece of the puzzle, and in many ways the most socially and technically challenging is the sharing of research procedures. Data has no value in and of itself unless the process used to generate it is appropriate and reliable. Disputes over the validity of claims are rarely based on the data themselves but on the procedures used either to collect them or those used to process and analyse them. A widely reported recent case turned on the details of how a protein was purified; whether with a step or gradual gradient elution. This detail of procedure led laboratories to differing results, a year of wasted time for one researcher, and ultimately the retraction of several high profile papers [refs – nature feature, retractions, original paper etc]. Experimental scientists generally imagine that in the computational sciences where a much higher level of reproducibility and the ready availability of code and subversion repositories makes sharing and documenting material relatively straightforward, would have much higher standards. However, a recent paper [6] by Ted Pedersen (University of Minnesota, Duluth) – with the wonderful title ‘Empiricism is not a matter of faith’ – criticized the standards of both code documentation and availability. He makes the case that working with the assumption that you will make the tools available to others not only allows you to develop better tools, and makes you popular in the community, but also improves the quality of your own work.

And this really is the crux of the matter. If the central principle of the scientific method is open analysis and criticism of claims then making the data and process and conclusions avalable and accessible is just doing good science. While we may argue about the timing of release or the details of ‘how raw’ available data needs to be or the file formats or ontologies used to describe it there can be no argument that if the scientific record is to have value it must rest on an accessible body of relevant evidence. Scientists were doing mashups long before the term was invented; mixing data from more than one source; reprocessing it to provide a different view. The potential of online tools to help to do this better is massive, but the utility of these tools depends on the sharing of data, workflows, ideas, and opinions.

There are broadly three areas for development that are required to enable the more widespread adoption of open practice by research scientists. The first is the development of tools that are designed for scientists. While many of the general purpose tools and services have been adopted by researchers there are many cases where specialised design or adaptation is required for the specific needs of a research environment. In some cases the needs of research willpush development in specific areas, such as controlled vocabularies, beyond what is being done in the mainstream. The second, and most important area involves the social and cultural barriers within various research communities.These vary widely in type and importance across different fields and understanding and overcoming the fears as well as challenging entrenched interests will be an important part of the open science programme. Finally, there is a value and a need to provide top-down guidance in the form of policies and standards. The vagueness of the term ‘Open Science’ means that while it is a good banner there is a potential for confusion. Standards, policies, and brands can provide clarity for researchers, a clear articulation of aspirations (and a guide to the technical steps required to achieve them), and the support required to help people actually make this happen in their own research.

Part II will cover the issues around tools for Open Science

References

  1. Smolin L (2008), Science as an ethical community, PIRSA ID#08090035, http://pirsa.org/08090035/
  2. Mars Phoenix on Twitter, http://twitter.com/MarsPhoenix
  3. Eisen JA (2008) PLoS Biology 2.0. PLoS Biol 6(2): e48 doi:10.1371/journal.pbio.0060048
  4. Vickers A (2008), http://www.nytimes.com/2008/01/22/health/views/22essa.html?_r=1
  5. Pedersen T (2008), Computational Linguistics, Volume 34, Number 3, pp. 465-470, Self archived.

Q&A in this week’s Nature – one or two (minor) clarifications

So a bit of a first for me. I can vaguely claim to have contributed to two things into the print version of Nature this week. Strictly speaking my involvement in the first, the ‘From the Blogosphere‘ piece on the Science Blogging Challenge, was really restricted to discussing the idea (originally from Richard Grant I believe) and now a bit of cheerleading and ultimately some judging. The second item though I can claim some credit for in as much as it is a Q&A with myself and Jean-Claude Bradley that was done when we visited Nature Publishing Group in London a few weeks back.

It is great that a journal like Nature views the ideas of data publication, open notebook science, and open science in general as worthy of featuring. This is not an isolated instance either, as we can point to the good work of the Web Publishing Group, in developing useful resources such as Nature Precedings, as well as previous features in the print edition such as the Horizons article (there is also another version on Nature Precedings) written by Peter Murray-Rust. One thing I have heard said many times in recent months is that while people who advocate open science may not agree with everything NPG with respect to copyright and access, people are impressed and encouraged by the degree of engagement that they maintain with the community.

I did however just want to clarify one or two of the things I apparently said. I am not claiming that I didn’t say those things – the interview was recorded after all – but just that on paper they don’t really quite match what I think I meant to say. Quoting from the article:

CN-Most publishers regard what we do as the equivalent of presenting at a conference, or a preprint. That hasn’t been tested across a wide range of publishers, and there’s at least one — the American Chemical Society — that doesn’t allow prepublication in any form whatsoever.

That sounds a little more extreme than what I meant to say – there are a number of publishers that don’t allow submission of material that has appeared online as a pre-print and the ACS has said that they regard online publication as equivalent to a pre-print. I don’t have any particular sympathy for the ACS but I think they probably do allow publication of material that was presented at ACS conferences.

CN-Open notebooks are practical but tough at the moment. My feeling is that the tools are not yet easy enough to use. But I would say that a larger proportion of people will be publishing electronically and openly in ten years.

Here I think what I said is too conservative on one point and possibly not conservative enough on the other. I did put my neck out and say that I think the majority of scientists will be using electronic lab notebooks of one sort or another in ten years. Funder data sharing policies will drive a much greater volume of material online post publication (hopefully with higher quality description) and this may become the majority of all research data. I think that more people will be making more material available openly as it is produced as well but I doubt that this will be a majority of people in ten years – I hope for a sizeable and significant minority and that’s what we will continue to work towards.

Linking up open science online

I am currently sitting at the dining table of Peter Murray-Rust with Egon Willighagen opposite me talking to Jean-Claude Bradley. We pulling together sets of data from Jean-Claude’s UsefulChem project into CML to make it more semantically rich and do a bunch of cool stuff. Jean-Claude has a recently published preprint on Nature Precedings of a paper that has been submitted to JoVE. Egon was able to grab the InChiKeys from the relevant UsefulChem pages and passing those to CDK via a script that he wrote on the spot (which he has also just blogged) generated CML react for those molecules.

Peter at the same time has cut and pasted an existing example of a CML react XML document into a GoogleDoc which we will then modify to represent one example of the Ugi reactions that Jean-Claude reported in the precedings paper. You will be able to see all of these documents. The only way we would be able to do this is with four laptops all online – and for all the relevant documents and services to be available from where we are sitting. I’ve never been involved in a hackfest like this before but actually the ability to handle different aspects of the same document via GoogleDocs is a very powerful way of handling multiple processes at the same time.

How I got into open science – a tale of opportunism and serendipity

So Michael Nielsen, one morning at breakfast at Scifoo asked one of those questions which never has a short answer; ‘So how did you get into this open science thing?’ and I realised that although I have told the story to many people I haven’t ever written it down. Perhaps this is a meme worth exploring more generally but I thought others might be interested in my story, partly because it illustrates how funding drives scientists, and partly because it shows how the combination of opportunism and serendipity can make for successful bedfellows.

In late 2004 I was spending a lot of my time on the management of a large collaborative research project and had had a run of my own grant proposals rejected. I had a student interested in doing a PhD but no direct access to funds to support the consumables cost of the proposed project. Jeremy Frey had been on at me for a while to look at implementing the electronic lab notebook system that he had lead the development of and at the critical moment he pointed out to me a special call from the BBSRC for small projects to prototype, develop, or implement e-science technologies in the biological sciences. It was a light touch review process and a relatively short application. More to the point it was a way of funding some consumables.

So the grant was written. I wrote the majority of it, which makes somewhat interesting reading in retrospect. I didn’t really know what I was talking about at the time (which seems to be a theme with my successful grants). The original plan was to use the existing, fully semantic, rdf backed electronic lab notebook and develop models for use in a standard biochemistry lab. We would then develop systems to enable a relational database to be extracted from the rdf representation and present this on the web.

The grant was successful but the start was delayed due to shenanigans over the studentship that was going to support the grant and the movement of some of the large project to another institution with one of the investigators. Partly due to the resulting mess I applied for the job I ultimately accepted at RAL and after some negotiation organised an 80:20 split between RAL and Southampton.

By the time we had a student in place and had got the grant started it was clear that the existing semantic ELN was not in a state that would enable us to implement new models for our experiments. However at this stage there was a blog system that had been developed in Jeremy’s group and it was thought it would be an interesting experiment to use this as a notebook. This would be almost the precise opposite of the rdf backed ELN. Looking back at it now I would describe it as taking the opportunity to look at a Web 2.0 approach to the notebook as compared to a Web 3.0 approach but bear in mind that at the time I had little or no idea of what these terms meant, let alone the care with which they need to be used.

The blog based system was great for me as it meant I could follow the student’s work online and doing this I gradually became aware of blogs in general and the use of feed readers. The RSS feed of the LaBLog was a great help as it made following the details of experiments remotely straightforward. This was important as by now I was spending three or four days a week at RAL while the student was based in Southampton. As we started to use the blog, at first in a very naïve way we found problems and issues which ultimately led to us thinking about and designing the organisational approach I have written about elsewhere [1, 2]. By this stage I had started to look at other services online and was playing around with OpenWetWare and a few other services, becoming vaguely aware of Creative Commons licenses and getting a grip on the Web 2.0 versus Web 3.0 debate.

To implement our newly designed approach to organising the LaBLog we decided the student would start afresh with a clean slate in a new blog. By this stage I was playing with using the blog for other things and had started to discover that there were issues that meant the ID authentication we were using didn’t always work through the RAL firewall. I ended up having complicated VPN setups, particularly working from home, where I couldn’t log on to the blog and I have my email online at the same time. This, obviously, was a pain and as we were moving to a new blog which could have new security settings I said, ‘stuff it, let’s just make it completely visible and be done with it’.

So there you go. The critical decision to move to an Open Notebook status was taken as the result of a firewall. So serendipity, or at least the effect of outside pressures, was what made it happen.  I would like to say it was a carefully thought out philosophical decision but, although the fact that I was aware of the open access movement, creative commons, OpenWetWare, and others no doubt prepared the background that led me to think down that route, it was essentially the result of frustration.

So, so far, opportunism and serendipity, which brings us back to opportunism again, or at least seizing an opportunity. Having made the decision to ‘go open’ two things clicked in my mind. Firstly the fact that this was rather radical. Secondly, the fact that all of these Web 2.0 tools combined with an open approach could lead to a marked improvement in the efficiency of collaborative science, a kind of ‘Science 2.0’ [yes, I know, don’t laugh, this would have been around March 2007]. Here was an opportunity to get my name on a really novel and revolutionary concept! A quick Google search revealed that, funnily enough, I wasn’t the first person to think of this (yes! I’d been scooped!), but more importantly it led to what I think ought to be three of the Standard Works of Open Science, Bill Hooker’s three part series on Open Science at 3 Quarks Daily [1, 2, 3], Jean-Claude Bradley’s presentation on Open Notebook Science at Nature Precedings (and the associated original blog post coining the term), and Deepak Singh’s talk on Open Science at Ignite Seattle. From there I was inspired to seize the opportunity, get a blog of my own, and get involved. The rest of my story story, so far, is more or less available online here and via the usual sources.

Which leads me to ask. What got you involved in the ‘open’ movement? What, for you, were the ‘primary texts’ of open science and open research? There is a value in recording this, or at least our memories of it, for ourselves, to learn from our mistakes and perhaps discern the direction going forward. Perhaps it isn’t even too self serving to think of it as history in the making. Or perhaps, more in line with our own aims as ‘open scientists’, that we would be doing a poor job if we didn’t record what brought us to where we are and what is influencing our thinking going forward. I think the blogosphere does a pretty good job of the latter, but perhaps a little more recording of the former would be helpful.

Re-inventing the wheel (again) – what the open science movement can learn from the history of the PDB

One of the many great pleasures of SciFoo was to meet with people who had a different, and in many cases much more comprehensive, view of managing data and making it available. One of the long term champions of data availability is Professor Helen Berman, the head of the Protein Data Bank (the international repository for biomacromolecular structures), and I had the opportunity to speak with her for some time on the Friday afternoon before Scifoo kicked off in earnest (in fact this was one of many somewhat embarrasing situations where I would carefully explain my background in my very best ‘speaking to non-experts’ voice only to find they knew far more about it than I did – however Jim Hardy of Gahaga Biosciences takes the gold medal for this event for turning to the guy called Larry next to him while having dinner at Google Headquarters and asking what line of work he was in).

I have written before about how the world might look if the PDB and other biological databases had never existed, but as I said then I didn’t know much of the real history. One of the things I hadn’t realised was how long it was after the PDB was founded before deposition of structures became expected for all atomic resolution biomacromolecular structures. The road from a repository of seven structures with a handful of new submissions a year to the standards that mean today that any structure published in a reputable journal must be deposited was a long and rocky one. The requirement to deposit structures on publication only became general in the early 1990s, nearly twenty years after it was founded and there was a very long and extended process where the case for making the data available was only gradually accepted by the community.

Helen made the point strongly that it had taken 37 years to get the PDB to where it is today; a gold standard international and publically available repository of a specific form of research data supported by a strong set of community accepted, and enforced, rules and conventions.  We don’t want to take another 37 years to achieve the widespread adoption of high standards in data availability and open practice in research more generally. So it is imperative that we learn the lessons and benefit from the experience of those who built up the existing repositories. We need to understand where things went wrong and attempt to avoid repeating mistakes. We need to understand what worked well and use this to our advantage. We also need to recognise where the technological, social, and political environment that we find ourselves in today means that things have changed, and perhaps to recognise that in many ways, particularly in the way people behave, things haven’t changed at all.

I’ve written this in a hurry and therefore not searched as thoroughly as I might but I was unable to find any obvious ‘history of the PDB’ online. I imagine there must be some out there – but they are not immediately accessible. The Open Science movement could benefit from such documents being made available – indeed we could benefit from making them required reading. While at Scifoo Michael Nielsen suggested the idea of a panel of the great and the good – those who would back the principles of data availability, open access publication, and the free transfer of materials. Such a panel would be great from the perspective of publicity but as an advisory group it could have an even greater benefit by providing the opportunity to benefit from the experience many of these people have in actually doing what we talk about.

Southampton Open Science Workshop 31 August and 1 September

An update on the Workshop that I announced previously. We have a number of people confirmed to come down and I need to start firming up numbers. I will be emailing a few people over the weekend so sorry if you get this via more than one route. The plan of attack remains as follows:

Meet on evening of Sunday 31 August in Southampton, most likely at a bar/restaurant near the University to coordinate/organise the details of sessions.

Commence on Monday at ~9:30 and finish around 4:30pm (with the option of discussion going into the evening) with three or four sessions over the course of the day broadly divided into the areas of tools, social issues, and policy. We have people interested and expert in all of these areas coming so we should be able to to have a good discussion. The object is to keep it very informal but to keep the discussion productive. Numbers are likely to be around 15-20 people. For those not lucky enough to be in the area we will aim to record and stream the sessions, probably using a combination of dimdim, mogulus, and slideshare. Some of these may require you to be signed into our session so if you are interested drop me a line at the account below.

To register for the meeting please send me an email to my gmail account (cameronneylon). To avoid any potential confusion, even if you have emailed me in the past week or so about this please email again so that I have a comprehensive list in one place. I will get back to you with a request via PayPal for £15 to cover coffees and lunch for the day (so if you have a PayPal account you want to use please send the email from that address). If there is a problem with the cost please state so in your email and we will see what we can do. We can suggest options for accomodation but will ask you to sort it out for yourself.

I have set up a wiki to discuss the workshop which is currently completely open access. If I see spam or hacking problems I will close it down to members only (so it would be helpful if you could create an account) but hopefully it might last a few weeks in the open form. Please add your name and any relevant details you are happy to give out to the Attendees page and add any presentations or demos you would be interested in giving, or would be interested in hearing about, on the Programme suggestion page.

Notes from Scifoo

I am too tired to write anything even vaguely coherent. As will have been obvious there was little opportunity for microblogging, I managed to take no video at all, and not even any pictures. It was non-stop, at a level of intensity that I have very rarely encountered anywhere before. The combination of breadth and sharpness that many of the participants brought was, to be frank, pretty intimidating but their willingness to engage and discuss and my realisation that, at least in very specific areas, I can hold my own made the whole process very exciting. I have many new ideas, have been challenged to my core about what I do, and how; and in many ways I am emboldened about what we can achieve in the area of open data and open notebooks. Here are just some thoughts that I will try to collect some posts around in the next few days.

  • We need to stop fretting about what should be counted as ‘academic credit’. In another two years there will be another medium, another means of communication, and by then I will probably be conservative enough to dismiss it. Instead of just thinking that diversifying the sources of credit is a good thing we should ask what we want to achieve. If we believe that we need a more diverse group of people in academia than that is what we should articulate – Courtesy of a discussion with Michael Eisen and Sean Eddy.
  • ‘Open Science’ is a term so vague as to be actively dangerous (we already knew that). We need a clear articulation of principles or a charter. A set of standards that are clear, and practical in the current climate. As these will be lowest common denominator standards at the beginning we need a mechanism that enables or encourages a process of incrementally raising those standards. The electronic Geophysical Year Declaration is a good working model for this – Courtesy of session led by Peter Fox.
  • The social and personal barriers to sharing data can be codified and made sense of (and this has been done). We can use this understanding to frame structures that will make more data available – session led by Christine Borgman
  • The Open Science movement needs to harness the experience of developing the open data repositories that we now take for granted. The PDB took decades of continuous work to bring to its current state and much of it was a hard slog. We don’t want to take that much time this time round – Courtesy of discussion led by Sarah Berman
  • Data integration is tough, but it is not helped by the fact that bench biologists don’t get ontologies, and that ontologists and their proponents don’t really get what the biologists are asking. I know I have an agenda on this but social tagging can be mapped after the fact onto structured data (as demonstrated to me by Ben Good). If we get the keys right then much else will follow.
  • Don’t schedule a session at the same time as Martin Rees does one of his (aside from anything else you miss what was apparently a fabulous presentation).
  • Prosthetic limbs haven’t changed in 100 years and they suck. Might an open source approach to building a platform be the answer – discussion with Jon Kuniholm, founder of the Open Prosthetics Project.
  • The platform for Open Science is very close and some of the key elements are falling into place. In many ways this is no longer a technical problem.
  • The financial system backing academic research is broken when the cost of reproducing or refuting specific claims rises to 10 to 20-fold higher than the original work. Open Notebook Science is a route to reducing this cost – discussion with Jamie Heywood.
  • Chris Anderson isn’t entirely wrong – but he likes being provocative in his articles.
  • Google run a fantasticaly slick operation. Down to the fact that the chocolate coated oatmeal biscuit icecream sandwiches are specially ordered in made with proper sugar instead of hugh fructose corn syrup.

Enough. Time to sleep.