Beyond the Impact Factor: Building a community for more diverse measurement of research

An old measuring tape
Image via Wikipedia

I know I’ve been a bit quiet for a few weeks. Mainly I’ve been away for work and having a brief holiday so it is good to be plunging back into things with some good news. I am very happy to report that the Open Society Institute has agreed to fund the proposal that was built up in response to my initial suggestion a month or so ago.

OSI, which many will know as one of the major players in bringing the Open Access movement to its current position, will fund a workshop that will identify both potential areas where the measurement and aggregation of research outputs can be improved as well as barriers to achieving these improvements. This will be immediately followed by a concentrated development workshop (or hackfest) that will aim to deliver prototype examples that show what is possible. The funding also includes further development effort to take one or two of these prototypes and develop them to proof of principle stage, ideally with the aim of deploying these into real working environments where they might be useful.

The workshop structure will be developed by the participants over the 6 weeks leading up to the date itself. I aim to set that date in the next week or so, but the likelihood is early to mid-March. The workshop will be in southern England, with the venue to be again worked out over the next week or so.

There is a lot to pull together here and I will be aiming to contact everyone who has expressed an interest over the next few weeks to start talking about the details. In the meantime I’d like to thank everyone who has contributed to the effort thus far. In particular I’d like to thank Melissa Hagemann and Janet Haven at OSI and Gunner from Aspiration who have been a great help in focusing and optimizing the proposal. Too many people contributed to the proposal itself to name them all (and you can check out the GoogleDoc history if you want to pull apart their precise contributions) but I do want to thank Heather Piwowar and David Shotton in particular for their contributions.

Finally, the success of the proposal, and in particular the community response around it has made me much more confident that some of the dreams we have for using the web to support research are becoming a reality. The details I will leave for another post but what I found fascinating is how far the network of people spread who could be contacted, essentially through a single blog post. I’ve contacted a few people directly but most have become involved through the network of contacts that spread from the original post. The network, and the tools, are effective enough that a community can be built up rapidly around an idea from a much larger and more diffuse collection of people. The challenge of this workshop and the wider project is to see how we can make that aggregated community into a self sustaining conversation that produces useful outputs over the longer term.

It’s a complete co-incidence that Michael Nielsen posted a piece in the past few hours that forms a great document for framing the discussion. I’ll be aiming to write something in response soon but in the meantime follow the top link below.

Enhanced by Zemanta

A breakthrough on data licensing for public science?

I spent two days this week visiting Peter Murray-Rust and others at the Unilever Centre for Molecular Informatics at Cambridge. There was a lot of useful discussion and I learned an awful lot that requires more thinking and will no doubt result in further posts. In this one I want to relay a conversation we had over lunch with Peter, Jim Downing, Nico Adams, Nick Day and Rufus Pollock that seemed extremely productive. It should be noted that what follows is my recollection so may not be entirely accurate and shouldn’t be taken to accurately represent other people’s views necessarily.

The appropriate way to license published scientific data is an argument that has now been rolling on for some time. Broadly speaking the argument has devolved into two camps. Firstly those who have a belief in the value of share-alike or copyleft provisions of GPL and similar licenses. Many of these people come from an Open Source Software or Open Content background. The primary concern of this group is spreading the message and use of Open Content and to prevent “freeloaders” from being able to use Open material and not contribute back to the open community. A presumption in this view is that a license is a good, or at least acceptable, way of achieving both these goals. Also included here are those who think that it is important to allow people the freedom to address their concerns through copyleft approaches. I think it is fair to characterize Rufus as falling into this latter group.

On the other side are those, including myself, who are concerned more centrally with enabling re-use and re-purposing of data as far as is possible. Most of us are scientists of one sort or another and not programmers per se. We don’t tend to be concerned about freeloading (or in some cases welcome it as effective re-use). Another common characteristic is that we have been prevented from being able to make our own content as free as we would like due to copyleft provisions. I prefer to make all my content CC-BY (or cc0 where possible). I am frequently limited in my ability to do this by the wish to incorporate CC-BY-SA or GFDL material. We are deeply worried by the potential for licensing to make it harder to re-use and re-mix disparate sets of data and content into new digital objects. There is a sense amongst this group that “data is different” to other types of content, particulary in its diversity of types and re-uses. More generally there is the concern that anything that “smells of lawyers”, like something called a “license”, will have scientists running screaming in the opposite direction as they try to avoid any contact with their local administration and legal teams.

What I think was productive about the discussion on Tuesday is that we focused on what we could agree on with the aim of seeing whether it was possible to find a common position statement on the limited area of best practice for the publication of data that arises from public science. I believe such a statement is important because there is a window of opportunity to influence funder positions. Many funders are adopting data sharing policies but most refer to “following best practice” and that best practice is thin on the ground in most areas. With funders wielding the ultimate potential stick there is great potential to bootstrap good practice by providing clear guidance and tools to make it easy for researchers to deliver on their obligations. Funders in turn will likely adopt this best practice as policy if it is widely accepted by their research communities.

So we agreed on the following (I think – anyone should feel free to correct me of course!):

  1. A simple statement is required along the forms of  “best practice in data publishing is to apply protocol X”. Not a broad selection of licenses with different effects, not a complex statement about what the options are, but “best practice is X”.
  2. The purpose of publishing public scientific data and collections of data, whether in the form of a paper, a patent, data publication, or deposition to a database, is to enable re-use and re-purposing of that data. Non-commercial terms prevent this in an unpredictable and unhelpful way. Share-alike and copyleft provisions have the potential to do the same under some circumstances.
  3. The scientific research community is governed by strong community norms, particularly with respect to attribution. If we could successfully expand these to include share-alike approaches as a community expectation that would obviate many concerns that people attempt to address via licensing.
  4. Explicit statements of the status of data are required and we need effective technical and legal infrastructure to make this easy for researchers.

So in aggregate I think we agreed a statement similar to the following:

Where a decision has been taken to publish data deriving from public science research, best practice to enable the re-use and re-purposing of that data, is to place it explicitly in the public domain via {one of a small set of protocols e.g. cc0 or PDDL}.”

The advantage of this statement is that it focuses purely on what should be done once a decision to publish has been made, leaving the issue of what should be published to a separate policy statement. This also sidesteps issues of which data should not be made public. It focuses on data generated by public science, narrowing the field to the space in which there is a moral obligation to make such data available to the public that fund it. By describing this as best practice it also allows deviations that may, for whatever reason, be justified by specific people in specific circumstances. Ultimately the community, referees, and funders will be the judge of those justifications. The BBSRC data sharing policy states for instance:

BBSRC expects research data generated as a result of BBSRC support to be made available…no later than the release through publication…in-line with established best practice  in the field [CN – my emphasis]…

The key point for me that came out of the discussion is perhaps that we can’t and won’t agree on a general solution for data but that we can articulate best practice in specific domains. I think we have agreed that for the specific domain of published data from public science there is a way forward. If this is the case then it is a very useful step forward.

Creating a research community monoculture – just when we need diversity

This post is a follow on from a random tweet that I sent a few weeks back in response to a query on twitter from Lord Drayson, the UK’s Minister of State for Science and Innovation. I thought it might be an idea to expand from the 140 characters that I had to play with at the time but its taken me a while to get to it. It builds on the ideas of a post from last year but is given a degree of urgency by the current changes in policy proposed by EPSRC.

Government money for research is limited, and comes from the pockets of taxpayers. It is incumbent on those of us who spend it to ensure that this investment generates maximum impact. Impact, for me comes in two forms. Firstly there is straightforward (although not straightforward to measure) economic impact; increases in competitivenes, standard of living, development of business opportunities, social mobility, reductions in the burden of ill health and hopefully in environmental burden at some point in the future. The problem with economic impact is that it is almost impossible to measure in any meaningful way. The second area of impact is, at least on the surface, a little easier to track, that is research outputs delivere. How efficiently do we turn money into science? Scratch beneath the surface and you realise rapidly that measurement is a nightmare, but we can at least look at where there are inefficiencies, where money is being wasted, and being lost from the pipelines before it can be spent on research effort.

The approach that is being explicitly adopted in the UK is to concentrate research in “centres of excellence” and to “focus research on areas where the UK leads” and where “they are relevant to the UK’s needs”. At one level this sounds like motherhood and apple pie. It makes sense in terms of infrastructure investment to focus research funding both geographically and in specific subject areas. But at another level it has the potential to completely undermine the UK’s history of research excellence.

There is a fundamental problem with trying to maximise the economic impact of research. And it is one that any commercial expert, or indeed politician should find obvious. Markets are good at picking winners, commitees are very bad at it. Using committees of scientists, with little or no experience of commercialising research outputs is likely to be an unmitigated disaster. There is no question that some research leds to commercial outcomes but to the best of my knowledge there is no evidence that anyone has ever had any success in picking the right projects in advance. The simple fact is that the biggest form of economic impact from research is in providing and supporting the diverse and skilled workforce that support a commercially responsive, high technology economy. To a very large extent it doesn’t actually matter what specific research you support as long as it is diverse. And you will probably generate just exactly the same amount of commercial outcomes by picking at random as you will by trying to pick winners.

The world, and the UK in particular, is facing severe challenges both economic and environmental for which there may be technological solutions. Indeed there is a real opportunity in the current economic climate to reboot the economy with low carbon technologies and at the same time take the opportunity to really rebuild the information economy in a way that takes advantage of the tools the web provides, and in turn to use this to improve outcomes in health, social welfare, to develop new environmentally friendly processes and materials. The UK has great potential to lead these developments precisely because it has a diverse research community and a diverse highly trained research and technology workforce. We are well placed to solve todays problems with tomorrow’s technology.

Now let us return to the current UK policy proposals. These are to concentrate research, to reduce diversity, and to focus on areas of UK strength. How will those strengths be identified? No doubt by committee. Will they be forward looking strengths? No, they will be what a bunch of old men, already selected by their conformance to a particular stereotype, i.e. the ones doing fundable research i fundable places, identify in a closed room. It is easy to identify the big challenges. It is not easy, perhaps not even possible, to identify the technological solutions that will eventually solve them. Not the currently most promising solutions, the ones that will solve the problem five or ten years down the track.

As a thought experiment think back to what the UK’s research strengths and challenges were 20 years ago and imagine a world in which they were exclusively funded. It would be easy to argue that many of the UK’s current strengths simply wouldn’t even exist (web technology? biotechnology? polymer materials?). And that disciplines that have subsequently reduced in size or entirely disappeared would have been maintained at the cost of new innovation. Concentrating research in a few places, on a few subjects, will reduce diversity, leading to the loss of skills, and probably the loss of skilled people as researchers realise there is no future career for them in the UK. It will not provide the diverse and skilled workforce required to solve the problems we face today. Concentrating on current strengths, no matter how worthy, will lead to ossification and conservatism making UK research ultimately irrelevant on a world stage.

What we need more than ever now, is a diverse and vibrant research community working on a wide range of problems, and to find better communication tools so as to efficiently connect unexpected solutions to problems in different areas. This is not the usual argument for “blue skies research”, whatever that may be. It is an argument for using market forces to do what they are best at (pick the winners from a range of possible technologies) and to use the smart people currently employed in research positions at government expense to actually do what they are good at; do research and train new researchers. It is an argument for critically looking at the expenditure of government money in a wholistic way and to seriously consider radical change where money is being wasted. I have estimated in the past that the annual cost of failed grant proposals to the UK government is somewhere between £100M – £500M, a large sum of money in anybody’s books. More rigorous economic analysis of a Canadian government funding scheme has shown that the cost of preparing and refeering the proposals ($CAN40k) is more than the cost of giving every eligible applicant a support grantof $CAN30k. This is not just farcical, it is an offensive waste of taxpayer’s money.

The funding and distribution of research money requires radicaly overhaul. I do not beleive that simply providing more money is the solution. Frankly we’ve had a lot more money, it makes life a little more comfortable if you are in the right places, but it has reduced the pressure to solve the underlying problems. We need responsive funding at a wide range of levels that enables both bursts of research, the kind of instant collaboration that we know can work, with little or no review, and large scale data gathering projects of strategic importance that need extensive and careful critical review before being approved.  And we need mechanisms to tension these against each other. We need baseline funding to just let people get on with research and we need access to larger sums where appropriate.

We need less buearacracy, less direction from the top, and more direction from the sides, from the community, and not just necessarily the community of researchers. What we have at the moment are strategic initiatives announced by research councils that are around five years behind the leading edge, which distort and constrain real innovation. Now we have ministers proposing to identify the UK’s research strengths. No doubt these will be five to ten years out of date and they will almost certainly stifle those pockets of excellence that will grow in strengths over the next decade. No-one will ever agree what tomorrow’s strengths will be. Much better would be to get on and find out.