A breakthrough on data licensing for public science?

I spent two days this week visiting Peter Murray-Rust and others at the Unilever Centre for Molecular Informatics at Cambridge. There was a lot of useful discussion and I learned an awful lot that requires more thinking and will no doubt result in further posts. In this one I want to relay a conversation we had over lunch with Peter, Jim Downing, Nico Adams, Nick Day and Rufus Pollock that seemed extremely productive. It should be noted that what follows is my recollection so may not be entirely accurate and shouldn’t be taken to accurately represent other people’s views necessarily.

The appropriate way to license published scientific data is an argument that has now been rolling on for some time. Broadly speaking the argument has devolved into two camps. Firstly those who have a belief in the value of share-alike or copyleft provisions of GPL and similar licenses. Many of these people come from an Open Source Software or Open Content background. The primary concern of this group is spreading the message and use of Open Content and to prevent “freeloaders” from being able to use Open material and not contribute back to the open community. A presumption in this view is that a license is a good, or at least acceptable, way of achieving both these goals. Also included here are those who think that it is important to allow people the freedom to address their concerns through copyleft approaches. I think it is fair to characterize Rufus as falling into this latter group.

On the other side are those, including myself, who are concerned more centrally with enabling re-use and re-purposing of data as far as is possible. Most of us are scientists of one sort or another and not programmers per se. We don’t tend to be concerned about freeloading (or in some cases welcome it as effective re-use). Another common characteristic is that we have been prevented from being able to make our own content as free as we would like due to copyleft provisions. I prefer to make all my content CC-BY (or cc0 where possible). I am frequently limited in my ability to do this by the wish to incorporate CC-BY-SA or GFDL material. We are deeply worried by the potential for licensing to make it harder to re-use and re-mix disparate sets of data and content into new digital objects. There is a sense amongst this group that “data is different” to other types of content, particulary in its diversity of types and re-uses. More generally there is the concern that anything that “smells of lawyers”, like something called a “license”, will have scientists running screaming in the opposite direction as they try to avoid any contact with their local administration and legal teams.

What I think was productive about the discussion on Tuesday is that we focused on what we could agree on with the aim of seeing whether it was possible to find a common position statement on the limited area of best practice for the publication of data that arises from public science. I believe such a statement is important because there is a window of opportunity to influence funder positions. Many funders are adopting data sharing policies but most refer to “following best practice” and that best practice is thin on the ground in most areas. With funders wielding the ultimate potential stick there is great potential to bootstrap good practice by providing clear guidance and tools to make it easy for researchers to deliver on their obligations. Funders in turn will likely adopt this best practice as policy if it is widely accepted by their research communities.

So we agreed on the following (I think – anyone should feel free to correct me of course!):

  1. A simple statement is required along the forms of  “best practice in data publishing is to apply protocol X”. Not a broad selection of licenses with different effects, not a complex statement about what the options are, but “best practice is X”.
  2. The purpose of publishing public scientific data and collections of data, whether in the form of a paper, a patent, data publication, or deposition to a database, is to enable re-use and re-purposing of that data. Non-commercial terms prevent this in an unpredictable and unhelpful way. Share-alike and copyleft provisions have the potential to do the same under some circumstances.
  3. The scientific research community is governed by strong community norms, particularly with respect to attribution. If we could successfully expand these to include share-alike approaches as a community expectation that would obviate many concerns that people attempt to address via licensing.
  4. Explicit statements of the status of data are required and we need effective technical and legal infrastructure to make this easy for researchers.

So in aggregate I think we agreed a statement similar to the following:

Where a decision has been taken to publish data deriving from public science research, best practice to enable the re-use and re-purposing of that data, is to place it explicitly in the public domain via {one of a small set of protocols e.g. cc0 or PDDL}.”

The advantage of this statement is that it focuses purely on what should be done once a decision to publish has been made, leaving the issue of what should be published to a separate policy statement. This also sidesteps issues of which data should not be made public. It focuses on data generated by public science, narrowing the field to the space in which there is a moral obligation to make such data available to the public that fund it. By describing this as best practice it also allows deviations that may, for whatever reason, be justified by specific people in specific circumstances. Ultimately the community, referees, and funders will be the judge of those justifications. The BBSRC data sharing policy states for instance:

BBSRC expects research data generated as a result of BBSRC support to be made available…no later than the release through publication…in-line with established best practice  in the field [CN – my emphasis]…

The key point for me that came out of the discussion is perhaps that we can’t and won’t agree on a general solution for data but that we can articulate best practice in specific domains. I think we have agreed that for the specific domain of published data from public science there is a way forward. If this is the case then it is a very useful step forward.

Best practice for data availability – the debate starts…well over there really

The issue of licensing arrangements and best practice for making data available has been brewing for some time but has just recently come to a head. John Wilbanks and Science Commons have a reasonably well established line that they have been developing for some time. Michael Nielsen has a recent blog post and Rufus Pollock, of the Open Knowledge Foundation, has also just synthesised his thoughts in response into a blog essay. I highly recommend reading John’s article on licensing at Nature Precedings, Michael’s blog post, and Rufus’ essay before proceeding. Another important document is the discussion of the license that Victoria Stodden is working to develop. Actually if you’ve read them go and read them again anyway – it will refresh the argument.

To crudely summarize, Rufus makes a cogent argument for the use of explicit licenses applied to collections of data, and feels that share-alike provisions in licenses or otherwise do not cause major problems and that the benefit that arises from enforcing re-use outweighs the problem. John’s position is that it far better for standards to be applied through social pressure (“community norms”) rather than licensing arrangements. He also believes that share-alike provisions are bad because they break interoperability between different types of objects and domains. One point that I think is very important and (I think) is a point of agreement is that some form of license or at dedication to the public domain will be crucial to developing best practice. Even if the final outcome of debate is that everything will go in the public domain it should be part of best practice to make that explicit.

Broadly speaking I belong to John’s camp but I don’t want to argue that case with this post. What is important in my view is that the debate takes place and that we are clear about what the aims of that debate are. What is it we are trying to achieve in the process of coming to (hopefully) some consensus of what best practice should look like?
It is important to remember that anyone can assert a license (or lack thereof) on any object that they (assert they) own or have rights over. We will never be able to impose a specific protocol on all researchers, all funders. Therefore what we are looking for is not the perfect arrangement but a balance between what is desired, what can be practically achieved, and what is politically feasible. We do need a coherent consensus view that can be presented to research communities and research funders. That is why the debate is important. We also need something that works, and is extensible into the future, where it will stand up to the development of new types of research, new types of data, new ways of making that data available, and perhaps new types of researchers altogether.

I think we agree that the minimal aim is to enable, encourage, and protect into the future the ability to re-use and re-purpose the publicly published products of publicly funded research. Arguments about personal or commercial work are much harder and much more subtle. Restricting the argument to publicly funded researchers makes it possible to open a discussion with a defined number of funders who have a public service and public engagement agenda. It also makes the moral arguments much clearer.

In focussing on research that is being made public we short circuit the contentious issue of timing. The right, or the responsibility, to commercially exploit research outputs and the limitations this can place on data availability is a complex and difficult area and one in which agreement is unlikely any time soon. I would also avoid the word “Open”. This is becoming a badly overloaded term with both political and emotional overtones, positive and negative. Focussing on what should happen after the decision has been to go public reduces the argument to “what is best practice for making research outputs available”. The question of when to make them available can then be kept separate. The key question for the current debate is not when but how.

So what I believe the debate should be about is the establishment, if possible, of a consensus  protocol or standard or license for enabling and ensuring the availability of the research outputs associated with publicly published, publicly funded research.  Along side this is the question of establishing mechanisms, for researchers to implement and be supported to observe these standards, as well as for “enforcement”. These might be trademarks, community standards, or legal or contractual approaches as well as systems and software to make all of this work, including trackbacks, citation aggregators, and effective data repositories. In addition we need to consider the public relations issue of selling such standards to disparate research funders and research communities.

Perhaps a good starting point would be to pinpoint the issues where there is general agreement and map around those. If we agree some central principles then we can take an empirical approach to the mechanisms. We’re scientists after all aren’t we?