Fork, merge and crowd-sourcing data curation

I like to call this one "Fork"

Over the past few weeks there has been a sudden increase in the amount of financial data on scholarly communications in the public domain. This was triggered in large part by the Wellcome Trust releasing data on the prices paid for Article Processing Charges by the institutions it funds. The release of this pretty messy dataset was followed by a substantial effort to clean that data up. This crowd-sourced data curation process has been described by Michelle Brook. Here I want to reflect on the tools that were available to us and how they made some aspects of this collective data curation easy, but also made some other aspects quite hard.

The data started its life as a csv file on Figshare. This is a very frequent starting point. I pulled that dataset and did some cleanup using OpenRefine, a tool I highly recommend as a starting point for any moderate to large dataset, particularly one that has been put together manually. I could use OpenRefine to quickly identify and correct variant publisher and journal name spellings, clean up some of the entries, and also find issues that looked like mistakes. It’s a great tool for doing that initial cleanup, but its a tool for a single user, so once I’d done that work I pushed my cleaned up csv file to github so that others could work with it.

After pushing to github a number of people did exactly what I’d intended and forked the dataset. That is, they took a copy and added it to their own repository. In the case of code people will fork a repository, add to or improve the code, and then make a pull request that gives the original repository owner that there is new code that they might want to merge into their version of the codebase. The success of github has been built on making this process easy, even fun. For data the merge process can get a bit messy but the potential was there for others to do some work and for us to be able to combine it back together.

But github is really only used by people comfortable with command line tools – my thinking was that people would use computational tools to enhance the data. But Theo Andrews had the idea to bring in many more people to manually look at and add to the data. Here an online spreadsheet such as those provided by GoogleDocs that many people can work with is a powerful tool and it was through that adoption of the GDoc that somewhere over 50 people were able to add to the spreadsheet and annotate it to create a high value dataset that allowed the Wellcome Trust to do a much deeper analysis than had previously been the case. The dataset had been forked again, now to a new platform, and this tool enabled what you might call a “social merge” collecting the individual efforts of many people through an easy to use tool.

The interesting thing was that exactly the facilities that made the GDoc attractive for manual crowdsourcing efforts made it very difficult for those of us working with automated tools to contribute effectively. We could take the data and manipulate it, forking again, but if we then pushed that re-worked data back we ran the risk of overwriting what anyone else had done in the meantime. That live online multi-person interaction that works well for people, was actually a problem for computational processing. The interface that makes working with the data easy for people actually created a barrier to automation and a barrier to merging back what others of us were trying to do. [As an aside, yes we could in principle work through the GDocs API but that’s just not the way most of us work doing this kind of data processing].

Crowdsourcing of data collection and curation tends to follow one of two paths. Collection of data is usually done into some form of structured data store, supported by a form that helps the contributor provide the right kind of structure. Tools like EpiCollect provide a means of rapidly building these kinds of projects. At the other end large scale data curation efforts, such as GalaxyZoo, tend to create purpose built interfaces to guide the users through the curation process, again creating structured data. Where there has been less tool building and less big successes are the space in the middle, where messy or incomplete data has been collected and a community wants to enhance it and clean it up. OpenRefine is a great tool, but isn’t collaborative. GDocs is a great collaborative platform but creates barriers to using automated cleanup tools. Github and code repositories are great for supporting the fork, work, and merge back patterns but don’t support direct human interaction with the data.

These issues are part of a broader pattern of issues with the Open Access, Data, and Educational Resources more generally. With the right formats, licensing and distribution mechanisms we’ve become very very good at supporting the fork part of the cycle. People can easily take that content and re-purpose it for their own local needs. What we’re not so good at is providing the mechanisms, both social and technical, to make it easy to contribute those variations, enhancements and new ideas back to the original resources. This is both a harder technical problem and challenging from a social perspective. Giving stuff away, letting people use it is easy because it requires little additional work. Working with people to accept their contributions back in takes time and effort, both often in short supply.

The challenge may be even greater because the means for making one type of contribution easier may make others harder. That certainly felt like the case here. But if we are to reap the benefits of open approaches then we need to do more than just throw things over the fence. We need to find the ways to gather back and integrate all the value that downstream users can add.

Enhanced by Zemanta

A new sustainability model: Major funders to support OA journal

Open Access logo and text
Image via Wikipedia

”The Howard Hughes Medical Institute, the Max Planck Society and the Wellcome Trust announced today that they are to support a new, top-tier, open access journal for biomedical and life sciences research. The three organisations aim to establish a new journal that will attract and define the very best research publications from across these fields. All research published in the journal will make highly significant contributions that will extend the boundaries of scientific knowledge.” [Press Release]

It has been clear for some time that the slowness of the adoption of open access publication models by researchers is in large part down to terror that we have of stepping out of line and publishing in the ‘wrong’ journals. More radical approaches to publication will clearly lag even further behind while this inherent conservatism is dominant. Publishers like PLoS and BMC have tackled this head on by aiming to create prestigous journals but the top of the pile has remained the traditional clutch of Nature, Science, and Cell.

The incumbent publishers have simultaneously been able to sit back due to a lack of apparent demand from researchers. As the demand from funders has increased they have held back, complaining the business models to support Open Access publication are not clear. I’ve always found the ‘business model’ argument slightly specious. Sustainability is important but scholarly publishing has never really had a viable business model, it has had a subsidy from funders. Part of the problem has been the multiple layers and channels that subsidy has gone through but essentially funders, through indirect funding of academic libraries, have been footing the bill.

Some funders, and the Wellcome Trust has lead on this, have demanded that their researchers make their outputs accessible while simultaneously requiring publishers comply with their requirements on access and re-use rights. But progress has been slow, particularly in opening up what is perceived as the top of the market. Despite major inroads made by PLoS Biology and PLos Medicine those journals perceived as the most prestigious have remain resolutely closed.

Government funders are mostly constrained in their freedom to act but the Wellcome Trust, HHMI, and Max Planck Society have the independence to take the logical step. They are already paying for publication, why not actively support the formation of a new journal, properly open access, and at the same time lend the prestige that their names can bring?

This will send a very strong message, both to researchers and publishers, about what these funders value, and where they see value for money. It is difficult to imagine this will not lead to a seismic shift in the publishing landscape, at least from a political and financial perspective. I don’t believe this journal will be as technically radical as I would like, but it is unlikely it could be while achieving the aims that it has. I do hope the platform it is built on enables innovation both in terms of what is published and the process by which it is selected.

But in a sense that doesn’t matter. This venture can remain incredibly conservative and still have a huge impact on taking the research communication space forward. What it means is that three of the worlds key funders have made an unequivocal statement that they want to see Open Access, full open access on publication without restrictions on commercial use, or text-mining, or re-use in any form, across the whole of the publication spectrum. And if they don’t get it from the incumbent publishers they’re prepared to make it happen themselves.

Full Disclosure: I was present at a meeting at Janelia Farm in 2010 where the proposal to form a journal was discussed by members of the Wellcome, HHMI, and MPG communities.

Enhanced by Zemanta

Metrics of use: How to align researcher incentives with outcomes

slices of carrot
Image via Wikipedia

It has become reflexive in the Open Communities to talk about a need for “cultural change”. The obvious next step becomes to find strong and widely respected advocates of change, to evangelise to young researchers, and to hope for change to follow. Inevitably this process is slow, perhaps so slow as to be ineffective. So beyond the grassroots evangelism we move towards policy change as a top down mechanism for driving improved behaviour. If funders demand that data be open, that papers be accessible to the wider community, as a condition of funding then this will happen. The NIH mandate and the work of the Wellcome Trust on Open Access show that this can work, and indeed that mandates in some form are necessary to raise levels of compliance to acceptable levels.

But policy is a blunt instrument, and researchers being who they are don’t like to be pushed around. Passive aggressive responses from researchers are relatively ineffectual in the peer reviewed articles space. A paper is a paper. If its under the right licence then things will probably be ok and a specific licence is easy to mandate. Data though is a different fish. It is very easy to comply with a data availability mandate but provide that data in a form which is totally useless. Indeed it is rather hard work to provide it in a form that is useful. Data, software, reagents and materials, are incredibly diverse and it is difficult to make good policy that can be both effective and specific enough, as well as general enough to be useful. So beyond the policy mandate stick, which will only ever provide a minimum level of compliance, how do we motivate researchers to putting the effort into making their outputs available in a useful form? How do we encourage them to want to do the right thing? After all what we want to enable is re-use.

We need more sophisticated motivators than blunt policy instruments, so we arrive at metrics. Measuring the ouputs of researchers. There has been a wonderful animation illustrating a Daniel Pink talk doing the rounds in the past week. Well worth a look and important stuff but I think a naive application of it to researchers’ motivations would miss two important aspects. Firstly, money is never “off the table” in research. We are always to some extent limited by resources. Secondly the intrinsic motivators, the internal metrics that matter to researchers, are tightly tied to the metrics that are valued by their communities. In turn those metrics are tightly tied to resource allocation. Most researchers value their papers, the places they are published and the citations received, as measures of their value, because that’s what their community values. The system is highly leveraged towards rapid change, if and only if a research community starts to value a different set of metrics.

What might the metrics we would like to see look like? I would suggest that they should focus on what we want to see happen. We want return on the public investment, we want value for money, but above all we want to maximise the opportunity for research outputs to be used and to be useful. We want to optimise the usability and re-usability of research outputs and we want to encourage researchers to do that optimisation. Thus if our metrics are metrics of use we can drive behaviour in the right direction.

If we optimise for re-use then we automatically value access, and we automatically value the right licensing arrangements (or lack thereof). If we value and measure use then we optimise for the release of data in useful forms and for the release of open source research software. If we optimise for re-use, for discoverability, and for value add, then we can automatically tension the loss of access inherent in publishing in Nature or Science vs the enhanced discoverability and editorial contribution and put a real value on these aspects. We would stop arguing about whether tenure committees should value blogging and start asking how much those blogs were used by others to provide outreach, education, and research outcomes.

For this to work there would need to be mechanisms that automatically credit the use of a much wider range of outputs. We would need to cite software and data, would need to acknowledge the providers of metadata that enabled our search terms to find the right thing, and we would need to aggregate this information in a credible and transparent way. This is technically challenging, and technically interesting, but do-able. Many of the pieces are in place, and many of the community norms around giving credit and appropriate citation are in place, we’re just not too sure how to do it in many cases.

Equally this is a step back towards what the mother of all metrics, the Impact Factor was originally about. The IF was intended as a way of measuring the use of journals through counting citations, as a means of helping librarians to choose which journals to subscribe to. Article Level Metrics are in many ways the obvious return to this where we want to measure the outputs of specific researchers. The H-factor for all its weaknesses is a measure of re-use of outputs through formal citations. Influence and impact are already an important motivator at the policy level. Measuring use is actually a quite natural way to proceed. If we can get it right it might also provide the motivation we want to align researcher interests with the wider community and optimise access to research for both researchers and the public.

Reblog this post [with Zemanta]