Data is free or hidden – there is no middle ground

Science commons and other are organising a workshop on Open Science issues as a satellite meeting of the European Science Open Forum meeting in July. This is pitched as an opportunity to discuss issues around policy, funding, and social issues with an impact on the â€˜Open Research Agendaâ€™. In preparation for that meeting I wanted to continue to explore some of the conflicts that arise between wanting to make data freely available as soon as possible and the need to protect the interests of the researchers that have generated data and (perhaps) have a right to the benefits of exploiting that data.

John Cumbers proposed the idea of a â€˜Protocolâ€™ for open science that included the idea of a â€˜use embargoâ€™; the idea that when data is initially made available, no-one else should work on it for a specified period of time. I proposed more generally that people could ask that people leave data alone for any particular period of time, but that there ought to be an absolute limit on this type of embargo to prevent data being tied up. These kinds of ideas revolve around the need to forge community norms â€“ standards of behaviour that are expected, and to some extent enforced, by a community. The problem is that these need to evolve naturally, rather than be imposed by committee. If there isnâ€™t community buy in then proposed standards have no teeth.

An alternative approach to solving the problem is to adopt some sort â€˜licenseâ€™. A legal or contractual framework that creates obligation about how data can be used and re-used. This could impose embargoes of the type that John suggested, perhaps as flexible clauses in the license. One could imagine an â€˜Open data â€“ six month analysis embargoâ€™ license. This is attractive because it apparently gives you control over what is done with your data while also allowing you to make it freely available. This is why people who first come to the table with an interest in sharing content always start with CC-BY-NC. They want everyone to have their content, but not to make money out of it. It is only later that people realise what other effects this restriction can have.

I had rejected the licensing approach because I thought it could only work in a walled garden, something which goes against my view of what open data is about. More recently John Wilbanks has written some wonderfully clear posts on the nature of the public domain, and the place of data in it, that make clear that it canâ€™t even work in a walled garden. Because data is in the public domain, no contractual arrangement can protect your ability to exploit that data, it can only give you a legal right to punish someone who does something you havenâ€™t agreed to. This has important consequences for the idea of Open Science licences and standards.

If we argue as an â€˜Open Science Movementâ€™ that data is in and must remain in the public domain then, if we believe this is in the common good, we should also argue for the widest possible interpretation of what is data. The results of an experiment, regardless of how clever its design might be, are a â€˜fact of natureâ€™, and therefore in the public domain (although not necessarily publically available). Therefore if any person has access to that data they can do whatever the like with it as long as they are not bound by a contractual arrangement. If someone breaks a contractual arrangement and makes the data freely available there is no way you can get that data back. You can punish the person who made it available if they broke a contract with you. But you canâ€™t recover the data. The only way you can protect the right to exploit data is by keeping it secret. The is entirely different to creative content where if someone ignores or breaks licence terms then you can legally recover the content from anyone that has obtained it.

Why does this matter to the Open Science movement? Arenâ€™t we all about making the data available for people to do whatever anyway? It matters because you canâ€™t place any legal limitations on what people do with data you make available. You canâ€™t put something up and say â€˜you can only use this for Xâ€™ or â€˜you can only use it after six monthsâ€™ or even â€˜you must attribute this dataâ€™. Even in a walled garden, once there is one hole, the entire edifice is gone. The only way we can protect the rights of those who generate data to benefit from exploiting it is through the hard work of developing and enforcing community norms that provide clear guidelines on what can be done. Itâ€™s that or simply keep the data secret.

What is important is that we are clear about this distinction between legal and ethical protections. We must not tell people that their data can be protected because essentially they can’t. And this is a real challenge to the ethos of open data because it means that our only absolutely reliable method for protecting people is by hiding data. Strong community norms will, and do, help but there is a need to be careful about how we encourage people to put data out there. And we need to be very strong in condemning people who do the ‘wrong’ thing. Which is why a discussion on what we believe is ‘right’ and ‘wrong’ behaviour is incredibly important. I hope that discussion kicks off in Barcelona and continues globally over the next few months. I know that not everyone can make the various meetings that are going on – but between them and the blogosphere and the ‘streamosphere‘ we have the tools, the expertise, and hopefully the will, to figure these things out.

The Open Data licensing issue – Deepak Singh [viaÂ Zemanta]
On the erosion of the public domain – John Wilbanks
Chemspider: Good intentions and the fog of licensing – John Wilbanks
Going Legal on CC-0 [viaÂ Zemanta]

6 Replies to “Data is free or hidden – there is no middle ground”

Mostly for my own clarification, since I think you address this in a couple ways, but what would happen if the license said something like “here is my data, if you want to use it please tell me how you plan to use it. If your planned use directly conflicts with what I’m already doing or planning to do in the immediate future (say 6 months), you should not work on the data unless it is through a collaboration with me. Otherwise, have at it (and keep me updated if you’re so inclined)!” It sounds like the problem is that this isn’t enforceable, so people could just take the data and run – but don’t they have to attribute the data? And then you and everyone else would know that they didn’t follow the community guidelines, which could be grounds for rejection from journals. Maybe they could take the data and claim it as their own – but then they’d need to say how they produced the data, and it would probably be difficult to come up with something that produces exactly the same data if they don’t know your protocol. Maybe keeping the protocol hidden OR the data hidden until publication, but not both?

My point is precisely that you can’t protect the data with a contract or licence. You can beat someone around the head or demand monetary damages even but you can’t get the data back because they were never yours in the first place. In the example you give if someone passes your data to someone else without your permission you have no recourse to getting the data back from the second person. You can, if your contract/licence says so, punish the first person, but the second person has done nothing legally wrong.

Once someone has the data they can do anything with it without attribution. That is what they _can_ do. Which is different to what they _should_. In practice if you could show that someone used your data without attribution they would be likely to face a misconduct hearing – but that is different to having legal recourse.

The idea of hiding part is interesting – I guess this is similar to placing a hashed code or watermark in the data in some ways. We may need tools like this to enforce community norms – but first we need some agreement on those norms!

When researchers publish a paper, their ideas are protected as they can easily claim authorship. The journals are publicly and legally recognized by our society.

A media would need such recognition for researchers to feel secure to share their data.

Now think about OWW. Imagine how would the scientific community respond to an authorship claim based on a OWW log.
While it might sound crazy at first, all the entries made in OWW are dated and preserved and OWW is backed by a large and strong community.

When researchers publish a paper, their ideas are protected as they can easily claim authorship. The journals are publicly and legally recognized by our society.

A media would need such recognition for researchers to feel secure to share their data.

Comments are closed.