I spent two days this week visiting Peter Murray-Rust and others at the Unilever Centre for Molecular Informatics at Cambridge. There was a lot of useful discussion and I learned an awful lot that requires more thinking and will no doubt result in further posts. In this one I want to relay a conversation we had over lunch with Peter, Jim Downing, Nico Adams, Nick Day and Rufus Pollock that seemed extremely productive. It should be noted that what follows is my recollection so may not be entirely accurate and shouldn’t be taken to accurately represent other people’s views necessarily.
The appropriate way to license published scientific data is an argument that has now been rolling on for some time. Broadly speaking the argument has devolved into two camps. Firstly those who have a belief in the value of share-alike or copyleft provisions of GPL and similar licenses. Many of these people come from an Open Source Software or Open Content background. The primary concern of this group is spreading the message and use of Open Content and to prevent “freeloaders†from being able to use Open material and not contribute back to the open community. A presumption in this view is that a license is a good, or at least acceptable, way of achieving both these goals. Also included here are those who think that it is important to allow people the freedom to address their concerns through copyleft approaches. I think it is fair to characterize Rufus as falling into this latter group.
On the other side are those, including myself, who are concerned more centrally with enabling re-use and re-purposing of data as far as is possible. Most of us are scientists of one sort or another and not programmers per se. We don’t tend to be concerned about freeloading (or in some cases welcome it as effective re-use). Another common characteristic is that we have been prevented from being able to make our own content as free as we would like due to copyleft provisions. I prefer to make all my content CC-BY (or cc0 where possible). I am frequently limited in my ability to do this by the wish to incorporate CC-BY-SA or GFDL material. We are deeply worried by the potential for licensing to make it harder to re-use and re-mix disparate sets of data and content into new digital objects. There is a sense amongst this group that “data is different” to other types of content, particulary in its diversity of types and re-uses. More generally there is the concern that anything that “smells of lawyersâ€, like something called a “licenseâ€, will have scientists running screaming in the opposite direction as they try to avoid any contact with their local administration and legal teams.
What I think was productive about the discussion on Tuesday is that we focused on what we could agree on with the aim of seeing whether it was possible to find a common position statement on the limited area of best practice for the publication of data that arises from public science. I believe such a statement is important because there is a window of opportunity to influence funder positions. Many funders are adopting data sharing policies but most refer to “following best practice†and that best practice is thin on the ground in most areas. With funders wielding the ultimate potential stick there is great potential to bootstrap good practice by providing clear guidance and tools to make it easy for researchers to deliver on their obligations. Funders in turn will likely adopt this best practice as policy if it is widely accepted by their research communities.
So we agreed on the following (I think – anyone should feel free to correct me of course!):
- A simple statement is required along the forms of “best practice in data publishing is to apply protocol Xâ€. Not a broad selection of licenses with different effects, not a complex statement about what the options are, but “best practice is Xâ€.
- The purpose of publishing public scientific data and collections of data, whether in the form of a paper, a patent, data publication, or deposition to a database, is to enable re-use and re-purposing of that data. Non-commercial terms prevent this in an unpredictable and unhelpful way. Share-alike and copyleft provisions have the potential to do the same under some circumstances.
- The scientific research community is governed by strong community norms, particularly with respect to attribution. If we could successfully expand these to include share-alike approaches as a community expectation that would obviate many concerns that people attempt to address via licensing.
- Explicit statements of the status of data are required and we need effective technical and legal infrastructure to make this easy for researchers.
So in aggregate I think we agreed a statement similar to the following:
“Where a decision has been taken to publish data deriving from public science research, best practice to enable the re-use and re-purposing of that data, is to place it explicitly in the public domain via {one of a small set of protocols e.g. cc0 or PDDL}.”
The advantage of this statement is that it focuses purely on what should be done once a decision to publish has been made, leaving the issue of what should be published to a separate policy statement. This also sidesteps issues of which data should not be made public. It focuses on data generated by public science, narrowing the field to the space in which there is a moral obligation to make such data available to the public that fund it. By describing this as best practice it also allows deviations that may, for whatever reason, be justified by specific people in specific circumstances. Ultimately the community, referees, and funders will be the judge of those justifications. The BBSRC data sharing policy states for instance:
BBSRC expects research data generated as a result of BBSRC support to be made available…no later than the release through publication…in-line with established best practice in the field [CN – my emphasis]…
The key point for me that came out of the discussion is perhaps that we can’t and won’t agree on a general solution for data but that we can articulate best practice in specific domains. I think we have agreed that for the specific domain of published data from public science there is a way forward. If this is the case then it is a very useful step forward.
Cameron, it seems to me there is a 3rd camp, that believes that under some circumstances licences are not just desirable but necessary. The defining cases for me would be of data which could be used to identify individuals, the subjects of research (though there are many other possible cases). Here there are conflicting requirements that must be carefully balanced: to make the data re-usable, even interoperable, and to honour the implied contract of informed consent, which means placing restrictions on some aspects of re-use. These data may still be “published”, but they cannot be under any of the current CC licences.
But for the data you are interested in, the approach does seem reasonable!
Cameron, it seems to me there is a 3rd camp, that believes that under some circumstances licences are not just desirable but necessary. The defining cases for me would be of data which could be used to identify individuals, the subjects of research (though there are many other possible cases). Here there are conflicting requirements that must be carefully balanced: to make the data re-usable, even interoperable, and to honour the implied contract of informed consent, which means placing restrictions on some aspects of re-use. These data may still be “published”, but they cannot be under any of the current CC licences.
But for the data you are interested in, the approach does seem reasonable!
Chris, absolutely, and that is why we ended up drawing the scope so narrowly. The around e.g. the privately identifiable data within this definition is that there is clear policy that says such data should not be published. In other cases a risk assessment may come to the same conclusions (certain types of animal experiment might be a case in point – where the safety of the experimenter could be compromised for instance). But in this case you make the choice not to publish the data. Once the decision is taken to publish, and crucially precisely what to publish, then it seems reasonable that for that specific data then the public domain is appropriate.
In one sense what we are saying is circular. If you choose to make data public (i.e. published) then it should be done so in a way that the public can use it and re-use it. If, for whatever reason it can’t be made public then that is a separate issue but then it has not been published. So if you e.g. publish an analysis of patient data then the aggregate data supporting the claims in the publication need to be PD but the background detailed data but not be published. That’s an issue for reproducibility and understanding and auditing the process but its not the data that was chosen to be published. But there is going to be lots of challenges particularly around DNA sequence data. Things that aren’t currently identifiable but where it is clear that they may be in the future. The release (or not) of the 1000 genomes data is proving a serious headache.
Chris, absolutely, and that is why we ended up drawing the scope so narrowly. The around e.g. the privately identifiable data within this definition is that there is clear policy that says such data should not be published. In other cases a risk assessment may come to the same conclusions (certain types of animal experiment might be a case in point – where the safety of the experimenter could be compromised for instance). But in this case you make the choice not to publish the data. Once the decision is taken to publish, and crucially precisely what to publish, then it seems reasonable that for that specific data then the public domain is appropriate.
In one sense what we are saying is circular. If you choose to make data public (i.e. published) then it should be done so in a way that the public can use it and re-use it. If, for whatever reason it can’t be made public then that is a separate issue but then it has not been published. So if you e.g. publish an analysis of patient data then the aggregate data supporting the claims in the publication need to be PD but the background detailed data but not be published. That’s an issue for reproducibility and understanding and auditing the process but its not the data that was chosen to be published. But there is going to be lots of challenges particularly around DNA sequence data. Things that aren’t currently identifiable but where it is clear that they may be in the future. The release (or not) of the 1000 genomes data is proving a serious headache.
I agree with Cameron on that cc0 or similar is best for most types of scientific data but Chris is correctly pointing out that doing this with data that may allow identification of individuals protected by “informed consent” forms is not an option. In fields with many such cases (like mine, neuroimaging), following Cameron in making “the choice not to publish the data” would represent an embargo to large-scale collaboration or data pooling, as it is increasingly needed for applications like Alzheimer diagnostics or even the “simple” screening of the brain development and aging in the healthy population. Because of this, neuroimaging data is increasingly shared (e.g. via initiatives like http://www.loni.ucla.edu/ADNI/ , http://www.loni.ucla.edu/BIRN/ , http://www.na-mic.org/ ) but never with CC0. The individual licenses are sometimes very elaborate (typically not applying to the database as a whole but to certain subsets of data deposited therein, with additional opt-in and opt-out options) and present considerable barriers to reuse (which can nonetheless normally be overcome by researchers in the field) and especially aggregation (as per http://neurogateway.org/ , for instance) – an issue that certainly requires further thought. Some relevant papers: http://www.springerlink.com/content/x516109723011t47/ (2003, subscription required) and http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2039786 (2007, Open Access) and http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2661130 (2008, Open Access).
I agree with Cameron on that cc0 or similar is best for most types of scientific data but Chris is correctly pointing out that doing this with data that may allow identification of individuals protected by “informed consent” forms is not an option. In fields with many such cases (like mine, neuroimaging), following Cameron in making “the choice not to publish the data” would represent an embargo to large-scale collaboration or data pooling, as it is increasingly needed for applications like Alzheimer diagnostics or even the “simple” screening of the brain development and aging in the healthy population. Because of this, neuroimaging data is increasingly shared (e.g. via initiatives like http://www.loni.ucla.edu/ADNI/ , http://www.loni.ucla.edu/BIRN/ , http://www.na-mic.org/ ) but never with CC0. The individual licenses are sometimes very elaborate (typically not applying to the database as a whole but to certain subsets of data deposited therein, with additional opt-in and opt-out options) and present considerable barriers to reuse (which can nonetheless normally be overcome by researchers in the field) and especially aggregation (as per http://neurogateway.org/ , for instance) – an issue that certainly requires further thought. Some relevant papers: http://www.springerlink.com/content/x516109723011t47/ (2003, subscription required) and http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2039786 (2007, Open Access) and http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2661130 (2008, Open Access).
Daniel, don’t disagree with anything there but I would say that that is data sharing and not data publication. Publication is to make public. As I say this is drawn very specifically very narrowly. It is precisely the legal and regulatory requirements here that make it impossible to publish the data in its raw form.
What we need to figure out on top of a lot of this is how one audits claims based on data which cannot be made public. But that, I think, is a separate issue?
Friendfeed conversation at: http://friendfeed.com/cameronneylon/e5111080/breakthrough-on-data-licensing-for-public
Daniel, don’t disagree with anything there but I would say that that is data sharing and not data publication. Publication is to make public. As I say this is drawn very specifically very narrowly. It is precisely the legal and regulatory requirements here that make it impossible to publish the data in its raw form.
What we need to figure out on top of a lot of this is how one audits claims based on data which cannot be made public. But that, I think, is a separate issue?
Friendfeed conversation at: http://friendfeed.com/cameronneylon/e5111080/breakthrough-on-data-licensing-for-public
Agreed on the distinction between data sharing and data publication in this context.
Agreed on the distinction between data sharing and data publication in this context.