The BMC 10th Anniversary Celebrations and Open Data Prize

Anopheles gambiae mosquito
Image via Wikipedia

Last Thursday night I was privileged to be invited to the 10th anniversary celebrations for BioMedCentral and to help announce and give the first BMC Open Data Prize. Peter Murray-Rust has written about the night and the contribution of Vitek Tracz to the Open Access movement. Here I want to focus on the prize we gave, the rationale behind it, and the (difficult!) process we went through to select a winner.

Prizes motivate behaviour in researchers. There is no question that being able to put a prize down on your CV is a useful thing. I have long felt, originally following a suggestion from Jeremiah Faith, that a prize for Open Research would be a valuable motivator and publicity aid to support those who are making an effort. I was very happy therefore to be asked to help judge the prize, supported by Microsoft, to be awarded at the BMC celebration for the paper in a BMC journal that was an oustanding example of Open Data. Iain Hrynaszkiewicz and Matt Cockerill from BMC, Lee Dirks from Microsoft Research, along with myself, Rufus Pollock, John Wilbanks, and Peter Murray-Rust tried to select from a very strong list of contenders a shortlist and a prize winner.

Early on we decided to focus on papers that made data available rather than software frameworks or approaches that supported data availability. We really wanted to focus attention on conventional scientists in traditional disciplines that were going beyond the basic requirements. This meant in turn that a whole range of very important contributions from developers, policy experts, and others were left out. Particularly noteable examples were “Taxonomic information exchange and copyright: the Plazi approach” and “The FANTOM web resource: from mammalian transcriptional landscape to its dynamic regulation“.

This still left a wide field of papers making significant amounts of data available. To cut down at this point we looked at the licences (or lack thereof) under which resources were being made available. Anything that wasn’t broadly speaking “open” was rejected at this point. This included code that wasn’t open source, data that was only available via a login, or that had non-commercial terms. None of the data provided was explicitly placed in the public domain, as recommended by Science Commons and the Panton Principles, but a reasonable amount was made available in an accessible form with no restrictions beyond a request for citation. This is an area where we expect best practice to improve and we see the prize as a way to achieve that. To be considered any external resource will have to be compliant ideally with all of Science Commons Protocols, the Open Knowledge Definition, and the Panton Principles. This means an explicit dedication of data to the public domain via PDDL or ccZero.

Much of the data that we looked at was provided in the form of Excel files. This is not ideal but in terms of accessibility it’s actually not so bad. While many of us might prefer XML, RDF, or at any rate CSV files the bottom line is that it is possible to open most Excel files with freely available open source software, which means the data is accessible to anyone. Note that “most” though. It is very easy to create Excel files that make data very hard to extract. Column headings are crucial (and were missing or difficult to understand in many cases) and merging and formatting cells is an absolute disaster. I don’t want to point to examples but a plea to those who are trying to make data available: if you must use Excel just put column headings and row headings. No merging, no formatting, no graphs. And ideally export it as CSV as well. It isn’t as pretty but useful data isn’t about being pretty. The figures and tables in your paper are for the human readers, for supplementary data to be useful it needs to be in a form that computers can easily access.

We finally reduced our shortlist to only about ten papers where we felt people had gone above and beyond the average. “Large-scale insertional mutagenesis of a coleopteran stored grain pest, the red flour beetle Tribolium castaneum, identifies embryonic lethal mutations and enhancer traps” received particular plaudits for making not just data but the actual beetles available. “Assessment of methods and analysis of outcomes for comprehensive optimization of nucleofection” and “An Open Access Database of Genome-wide Association Results” were both well received as efforts to make a comprehensive data resource available.

In the end though we were required to pick just one winner. The winning paper got everyone’s attention right from the beginning as it came from an area of science not necessarily known for widespread data publication. It simply provided all of the pieces of information, almost without comment, in the form of clearly set out tables. They are in Excel and there are some issues with formatting and presentation, multiple sheets, inconsistent tabulation. It would have been nice to see more of the analysis code used as well. But what appealed most was that the data were simply provided above and beyond what appeared in the main figures as a natural part of the presentation and that the data were in a form that could be used beyond the specific study. So it was a great pleasure to present the prize to Yoosuk Lee on behalf of the authors of “Ecological and genetic relationships of the Forest-M form among chromosomal and molecular forms of the malaria vector Anopheles gambiae sensu stricto“.

Many challenges remain, making this data discoverable, and improving the licensing and accessibility all round. Given that it is early days, we were impressed by the range of scientists making an effort to make data available. Next year we will hope to be much stricter on the requirements and we also hope to see many more nominations. In a sense for me, the message of the evening was that the debate on Open Access publishing is over, its only a question of where the balance ends up. Our challenge for the future is to move on and solve the problems of making data, process, and materials more available and accessible so as to drive more science.

Enhanced by Zemanta

Pub-sub/syndication patterns and post publication peer review

I think it is fair to say that even those of us most enamored of post-publication peer review would agree that its effectiveness remains to be demonstrated in a convincing fashion. Broadly speaking there are two reasons for this; the first is the problem of social norms for commenting. As in there aren’t any. I think it was Michael Nielsen who referred to the “Kabuki Dance of scientific discourse”. It is entirely allowed to stab another member of the research community in the back, or indeed the front, but there are specific ways and forums in which it is acceptable to do. No-one quite knows what the appropriate rules are for commenting on online fora, as best described most recently by Steve Koch.

My feeling is that this is a problem that will gradually go away as we evolve norms of behaviour in specific research communities. The current “rules” took decades to build up. It should not be surprising if it takes a few years or more to sort out an adapted set for online interactions. The bigger problem is the one that is usually surfaced as “I don’t have any time for this kind of thing”. This in turn can be translated as, “I don’t get any reward for this”. Whether that reward is a token for putting on your CV, actual cash, useful information coming back to you, or just the warm feeling that someone else found your comments useful, rewards are important for motivating people (and researchers).

One of the things that links these two together is a sense of loss of control over the comment. Commenting on journal web-sites is just that, commenting on the journal’s website. The comment author has “given up” their piece of value, which is often not even citeable, but also lost control over what happens to their piece of content. If you change your mind, even if the site allows you to delete it, you have no way of checking whether it is still in the system somewhere.

In a sense, when the Web 2.0 world was built it was got nearly precisely wrong for personal content. For me Jon Udell has written most clearly about this when he talks about the publish-subscribe pattern for successful frameworks. In essence I publish my content and you choose to subscribe to it. This works well for me, the blogger, at this site, but it is not so great for the commenter who has to leave their comment to my tender mercies on my site. It would be better if the commenter could publish their comment and I could syndicate it back to my blog. This creates all sorts of problems; it is challenging for you to aggregate your own comments together and you have to rely on the functionality of specific sites to help you follow responses to your comments. Jon wrote about this better than I can in his blog post.

So a big part of the problem could be solved if people streamed their own content. This isn’t going to happen quickly in the general sense of everyone having a web server of their own – it still remains too difficult for even moderately skilled people to be bothered doing this. Services will no doubt appear in the future but current broadcast services like twitter offer a partial solution (its “my” twitter account, I can at least pretend to myself that I can delete all of it). The idea of using something like the twitter service at microrevie.ws as suggested by Daniel Mietchen this week can go a long way towards solving the problem. This takes a structured tweet of the form @hreview {Object};{your review} followed optionally by a number of asterisks for a star rating. This doesn’t work brilliantly for papers because of problems with the length of references for the paper, even with shortened dois, the need for sometimes lengthy reviews and the shortness of tweets. Additionally the twitter account is not automatically associated with a unique research contributor ID. However the principle of the author of the review controlling their own content, while at the same time making links between themselves and that content in a linked open data kind of way is extremely powerful.

Imagine a world in which your email outbox or local document store is also webserver (via any one of an emerging set of tools like Wave, DropBox, or Opera Unite). You can choose who to share your review with and change that over the time. If you choose to make it public the journal, or the authors can give you some form of credit. It is interesting to think that author-side charges could perhaps be reduced for valuable reviews. This wouldn’t work in a naive way, with $10 per review, because people would churn out  large amounts of rubbish reviews, but if those reviews are out on the linked data web then their impact can be measured by their page rank and the authors rewarded accordingly.

Rewards and control linked together might provide a way of solving the problem – or at least of solving it faster than we are at the moment.