Licenses and protocols for Open Science – the debate continues

This is an important discussion that has been going on in disparate places, but primarily at the moment is on the Open Science mailing list maintained by the OKF (see here for an archive of the relevant thread). To try and keep things together and because Yishay Mor asked, I thought I would try to summarize the current state of the debate.

The key aim here is to find a form of practice that will enhance data availability, and protect it into the future.

There is general agreement that there is a need for some sort of declaration associated with making data available. Clarity is important and the minimum here would be a clear statement of intention.Where there is disagreement is over what form this should take. Rufus Pollock started by giving the reasons why this should be a formal license. Rufus believes that a license provides certainty and clarity in a way that a protocol, statement of principles, or expression of community standards can not.  I, along with Bill Hooker and John Wilbanks [links are to posts on mailing list], expressed a concern that actually the use of legal language, and the notion of “ownership” of this by lawyers rather than scientists would have profound negative results. Andy Powell points out that this did not seem to occur either in the Open Source movement or with much of the open content community. But I believe he also hits the nail on the head with the possible reason:

I suppose the difference is that software space was already burdened with heavily protective licences and that the introduction of open licences was perceived as a step in the right direction, at least by those who like that kind of thing.         

Scientific data has a history of being assumed to be in public domain (see the lack of any license at PDB or Genbank or most other databases) so there isn’t the same sense of pushing back from an existing strong IP or licensing regime. However I think there is broad agreement that this protocol or statement would look a lot like a license and would aim to have the legal effect of at least providing clarity over the rights of users to copy, re-purpose, and fork the objects in question.

Michael Nielsen and John Wilbanks have expressed a concern about the potential for license proliferation and incompatibility. Michael cites the example of Apache, Mozilla, and GPL2 licenses. This feeds into the issue of the acceptability, or desirability of share-alike provisions which is an area of significant division. Heather Morrison raises the issue of dealing with commercial entities who may take data and use technical means to effectively take it out of the public domain, citing the takeover of OAIster by OCLC as a potential example.

This is a real area of contention I think because some of us (including me) would see this in quite a positive light (data being used effectively in a commercial setting is better than it not being used at all) as long as the data is still both legally and technically in the public domain. Indeed this is at the core of the power of a public domain declaration. The issue of finding the resources that support the preservation of research objects in the (accessible) public domain is a separate one but in my view if we don’t embrace the idea that money can and should be made off data placed in the public domain then we are going to be in big trouble sooner or later because the money will simply run out.

On the flip side of the argument is a strong tradition of arguing that viral licensing and share alike provisions protect the rights and personal investment of individuals and small players against larger commercial entities. Many of the people who support open data belong to this tradition, often for very good historical reasons. I personally don’t disagree with the argument on a logical level, but I think for scientific data we need to provide clear paths for commercial exploitation because using science to do useful things costs a lot of money. If you want people want to invest in using the outputs of publicly funded research you need to provide them with the certainty that they can legitimately use that data within their current business practice. I think it is also clear that those of us who take this line need to come up with a clear and convincing way of expressing this argument because it is at the centre of the objection to “protection” via licenses and share alike provisions.

Finally Yishay brings us back to the main point. Something to keep focussed on:

I may be off the mark, but I would argue that there’s a general principle to consider here. I hold that any data collected by public money should be made freely available to the public, for any use that contributes to the public good. Strikes me as a no-brainer, but of course – we have a long way to go. If we accept this principle, the licensing follows.         

Obviously I don’t agree with the last sentence – I would say that dedication to the public domain follows – but the principle I think is something we can agree that we are aiming for.

Best practice for data availability – the debate starts…well over there really

The issue of licensing arrangements and best practice for making data available has been brewing for some time but has just recently come to a head. John Wilbanks and Science Commons have a reasonably well established line that they have been developing for some time. Michael Nielsen has a recent blog post and Rufus Pollock, of the Open Knowledge Foundation, has also just synthesised his thoughts in response into a blog essay. I highly recommend reading John’s article on licensing at Nature Precedings, Michael’s blog post, and Rufus’ essay before proceeding. Another important document is the discussion of the license that Victoria Stodden is working to develop. Actually if you’ve read them go and read them again anyway – it will refresh the argument.

To crudely summarize, Rufus makes a cogent argument for the use of explicit licenses applied to collections of data, and feels that share-alike provisions in licenses or otherwise do not cause major problems and that the benefit that arises from enforcing re-use outweighs the problem. John’s position is that it far better for standards to be applied through social pressure (“community norms”) rather than licensing arrangements. He also believes that share-alike provisions are bad because they break interoperability between different types of objects and domains. One point that I think is very important and (I think) is a point of agreement is that some form of license or at dedication to the public domain will be crucial to developing best practice. Even if the final outcome of debate is that everything will go in the public domain it should be part of best practice to make that explicit.

Broadly speaking I belong to John’s camp but I don’t want to argue that case with this post. What is important in my view is that the debate takes place and that we are clear about what the aims of that debate are. What is it we are trying to achieve in the process of coming to (hopefully) some consensus of what best practice should look like?
It is important to remember that anyone can assert a license (or lack thereof) on any object that they (assert they) own or have rights over. We will never be able to impose a specific protocol on all researchers, all funders. Therefore what we are looking for is not the perfect arrangement but a balance between what is desired, what can be practically achieved, and what is politically feasible. We do need a coherent consensus view that can be presented to research communities and research funders. That is why the debate is important. We also need something that works, and is extensible into the future, where it will stand up to the development of new types of research, new types of data, new ways of making that data available, and perhaps new types of researchers altogether.

I think we agree that the minimal aim is to enable, encourage, and protect into the future the ability to re-use and re-purpose the publicly published products of publicly funded research. Arguments about personal or commercial work are much harder and much more subtle. Restricting the argument to publicly funded researchers makes it possible to open a discussion with a defined number of funders who have a public service and public engagement agenda. It also makes the moral arguments much clearer.

In focussing on research that is being made public we short circuit the contentious issue of timing. The right, or the responsibility, to commercially exploit research outputs and the limitations this can place on data availability is a complex and difficult area and one in which agreement is unlikely any time soon. I would also avoid the word “Open”. This is becoming a badly overloaded term with both political and emotional overtones, positive and negative. Focussing on what should happen after the decision has been to go public reduces the argument to “what is best practice for making research outputs available”. The question of when to make them available can then be kept separate. The key question for the current debate is not when but how.

So what I believe the debate should be about is the establishment, if possible, of a consensus  protocol or standard or license for enabling and ensuring the availability of the research outputs associated with publicly published, publicly funded research.  Along side this is the question of establishing mechanisms, for researchers to implement and be supported to observe these standards, as well as for “enforcement”. These might be trademarks, community standards, or legal or contractual approaches as well as systems and software to make all of this work, including trackbacks, citation aggregators, and effective data repositories. In addition we need to consider the public relations issue of selling such standards to disparate research funders and research communities.

Perhaps a good starting point would be to pinpoint the issues where there is general agreement and map around those. If we agree some central principles then we can take an empirical approach to the mechanisms. We’re scientists after all aren’t we?