The BMC 10th Anniversary Celebrations and Open Data Prize

Anopheles gambiae mosquito
Image via Wikipedia

Last Thursday night I was privileged to be invited to the 10th anniversary celebrations for BioMedCentral and to help announce and give the first BMC Open Data Prize. Peter Murray-Rust has written about the night and the contribution of Vitek Tracz to the Open Access movement. Here I want to focus on the prize we gave, the rationale behind it, and the (difficult!) process we went through to select a winner.

Prizes motivate behaviour in researchers. There is no question that being able to put a prize down on your CV is a useful thing. I have long felt, originally following a suggestion from Jeremiah Faith, that a prize for Open Research would be a valuable motivator and publicity aid to support those who are making an effort. I was very happy therefore to be asked to help judge the prize, supported by Microsoft, to be awarded at the BMC celebration for the paper in a BMC journal that was an oustanding example of Open Data. Iain Hrynaszkiewicz and Matt Cockerill from BMC, Lee Dirks from Microsoft Research, along with myself, Rufus Pollock, John Wilbanks, and Peter Murray-Rust tried to select from a very strong list of contenders a shortlist and a prize winner.

Early on we decided to focus on papers that made data available rather than software frameworks or approaches that supported data availability. We really wanted to focus attention on conventional scientists in traditional disciplines that were going beyond the basic requirements. This meant in turn that a whole range of very important contributions from developers, policy experts, and others were left out. Particularly noteable examples were “Taxonomic information exchange and copyright: the Plazi approach” and “The FANTOM web resource: from mammalian transcriptional landscape to its dynamic regulation“.

This still left a wide field of papers making significant amounts of data available. To cut down at this point we looked at the licences (or lack thereof) under which resources were being made available. Anything that wasn’t broadly speaking “open” was rejected at this point. This included code that wasn’t open source, data that was only available via a login, or that had non-commercial terms. None of the data provided was explicitly placed in the public domain, as recommended by Science Commons and the Panton Principles, but a reasonable amount was made available in an accessible form with no restrictions beyond a request for citation. This is an area where we expect best practice to improve and we see the prize as a way to achieve that. To be considered any external resource will have to be compliant ideally with all of Science Commons Protocols, the Open Knowledge Definition, and the Panton Principles. This means an explicit dedication of data to the public domain via PDDL or ccZero.

Much of the data that we looked at was provided in the form of Excel files. This is not ideal but in terms of accessibility it’s actually not so bad. While many of us might prefer XML, RDF, or at any rate CSV files the bottom line is that it is possible to open most Excel files with freely available open source software, which means the data is accessible to anyone. Note that “most” though. It is very easy to create Excel files that make data very hard to extract. Column headings are crucial (and were missing or difficult to understand in many cases) and merging and formatting cells is an absolute disaster. I don’t want to point to examples but a plea to those who are trying to make data available: if you must use Excel just put column headings and row headings. No merging, no formatting, no graphs. And ideally export it as CSV as well. It isn’t as pretty but useful data isn’t about being pretty. The figures and tables in your paper are for the human readers, for supplementary data to be useful it needs to be in a form that computers can easily access.

We finally reduced our shortlist to only about ten papers where we felt people had gone above and beyond the average. “Large-scale insertional mutagenesis of a coleopteran stored grain pest, the red flour beetle Tribolium castaneum, identifies embryonic lethal mutations and enhancer traps” received particular plaudits for making not just data but the actual beetles available. “Assessment of methods and analysis of outcomes for comprehensive optimization of nucleofection” and “An Open Access Database of Genome-wide Association Results” were both well received as efforts to make a comprehensive data resource available.

In the end though we were required to pick just one winner. The winning paper got everyone’s attention right from the beginning as it came from an area of science not necessarily known for widespread data publication. It simply provided all of the pieces of information, almost without comment, in the form of clearly set out tables. They are in Excel and there are some issues with formatting and presentation, multiple sheets, inconsistent tabulation. It would have been nice to see more of the analysis code used as well. But what appealed most was that the data were simply provided above and beyond what appeared in the main figures as a natural part of the presentation and that the data were in a form that could be used beyond the specific study. So it was a great pleasure to present the prize to Yoosuk Lee on behalf of the authors of “Ecological and genetic relationships of the Forest-M form among chromosomal and molecular forms of the malaria vector Anopheles gambiae sensu stricto“.

Many challenges remain, making this data discoverable, and improving the licensing and accessibility all round. Given that it is early days, we were impressed by the range of scientists making an effort to make data available. Next year we will hope to be much stricter on the requirements and we also hope to see many more nominations. In a sense for me, the message of the evening was that the debate on Open Access publishing is over, its only a question of where the balance ends up. Our challenge for the future is to move on and solve the problems of making data, process, and materials more available and accessible so as to drive more science.

Enhanced by Zemanta

The Panton Principles: Finding agreement on the public domain for published scientific data

Drafters of the Panton principlesI had the great pleasure and privilege of announcing the launch of the Panton Principles at the Science Commons Symposium – Pacific Northwest on Saturday. The launch of the Panton Principles, many months after they were first suggested is really largely down to the work of Jonathan Gray. This was one of several projects that I haven’t been able to follow through properly on and I want to acknowledge the effort that Jonathan has put into making that happen. I thought it might be helpful to describe where they came from, what they are intended to do and perhaps just as importantly what they don’t.

The Panton Principles aim to articulate a view of what best practice should be with respect to data publication for science. They arose out of an ongoing conversation between myself Peter Murray-Rust and Rufus Pollock. Rufus founded the Open Knowledge Foundation, an organisation that seeks to promote and support open culture, open source, and open science, with the emphasis on the open. The OKF position on licences has always been that share-alike provisions are an acceptable limitation to complete freedom to re-use content. I have always taken the Science Commons position that share-alike provisions, particularly on data have the potential to make it difficult or impossible to get multiple datasets or systems to interoperate. In another post I will explore this disagreement which really amounts to a different perspective on the balance of the risks and consequences of theft vs things not being used or useful. Peter in turn is particularly concerned about the practicalities – really wanting a straightforward set of rules to be baked right into publication mechanisms.

The Principles came out of a discussion in the Panton Arms a pub near to the Chemistry Department of Cambridge University, after I had given a talk in the Unilever Centre for Molecular Informatics. We were having our usual argument trying to win the others over when we actually turned to what we could agree on. What sort of statement could we make that would capture the best parts of both positions with a focus on science and data. We focussed further by trying to draw out one specific issue. Not the issue or when people should share results, or the details of how, but the mechanisms that should be used for re-use. The principles are intended to focus on what happens when a decision has been made to publish data and where we assume that the wish is for that data to be effectively re-used.

Where we found agreement was that for science, and for scientific data, and particularly science funded by public investment, that the public domain was the best approach and that we would all recommend it. We brought John Wilbanks in both to bring the views of Creative Commons and to help craft the words. It also made a good excuse to return to the pub. We couldn’t agree on everything – we will never agree on everything – but the form of words chosen – that placing data explicitly, irrevocably, and legally in the public domain satisfies both the Open Knowledge Definition and the Science Commons Principles for Open Data was something that we could all personally sign up to.

The end result is something that I have no doubt is imperfect. We have borrowed inspiration from the Budapest Declaration, but there are three B’s. Perhaps it will take three P’s to capture all the aspects that we need. I’m certainly up for some meetings in Pisa or Portland, Pittsburgh or Prague (less convinced about Perth but if it works for anyone else it would make my mother happy). For me it captures something that we agree on – a way forwards towards making the best possible practice a common and practical reality. It is something I can sign up to and I hope you will consider doing so as well.

Above all, it is a start.

Reblog this post [with Zemanta]