open data – Science in the Open

April 26, 2014April 26, 2014

Fork, merge and crowd-sourcing data curation

Over the past few weeks there has been a sudden increase in the amount of financial data on scholarly communications in the public domain. This was triggered in large part by the Wellcome Trust releasing data on the prices paid for Article Processing Charges by the institutions it funds. The release of this pretty messy dataset was followed by a substantial effort to clean that data up. This crowd-sourced data curation process has been described by Michelle Brook. Here I want to reflect on the tools that were available to us and how they made some aspects of this collective data curation easy, but also made some other aspects quite hard.

The data started its life as a csv file on Figshare. This is a very frequent starting point. I pulled that dataset and did some cleanup using OpenRefine, a tool I highly recommend as a starting point for any moderate to large dataset, particularly one that has been put together manually. I could use OpenRefine to quickly identify and correct variant publisher and journal name spellings, clean up some of the entries, and also find issues that looked like mistakes. It’s a great tool for doing that initial cleanup, but its a tool for a single user, so once I’d done that work I pushed my cleaned up csv file to github so that others could work with it.

After pushing to github a number of people did exactly what I’d intended andÂ forked the dataset. That is, they took a copy and added it to their own repository. In the case of code people will fork a repository, add to or improve the code, and then make aÂ pull request that gives the original repository owner that there is new code that they might want toÂ merge into their version of the codebase. The success of github has been built on making this process easy, even fun. For data the merge process can get a bit messy but the potential was there for others to do some work and for us to be able to combine it back together.

But github is really only used by people comfortable with command line tools – my thinking was that people would use computational tools to enhance the data. But Theo Andrews had the idea to bring in many more people to manually look at and add to the data. Here an online spreadsheet such as those provided by GoogleDocs that many people can work with is a powerful tool and it was through that adoption of the GDoc that somewhere over 50 people were able to add to the spreadsheet and annotate it to create a high value dataset that allowed the Wellcome Trust to do a much deeper analysis than had previously been the case. The dataset had been forked again, now to a new platform, and this tool enabled what you might call a “social merge” collecting the individual efforts of many people through an easy to use tool.

The interesting thing was that exactly the facilities that made the GDoc attractive for manual crowdsourcing efforts made it very difficult for those of us working with automated tools to contribute effectively. We could take the data and manipulate it, forking again, but if we then pushed that re-worked data back we ran the risk of overwriting what anyone else had done in the meantime. That live online multi-person interaction that works well for people, was actually a problem for computational processing. The interface that makes working with the data easy for people actually created a barrier to automation and a barrier to merging back what others of us were trying to do. [As an aside, yes we could in principle work through the GDocs API but that’s just not the way most of us work doing this kind of data processing].

Crowdsourcing of data collection and curation tends to follow one of two paths. Collection of data is usually done into some form of structured data store, supported by a form that helps the contributor provide the right kind of structure. Tools like EpiCollect provide a means of rapidly building these kinds of projects. At the other end large scale data curation efforts, such as GalaxyZoo, tend to create purpose built interfaces to guide the users through the curation process, again creating structured data. Where there has been less tool building and less big successes are the space in the middle, where messy or incomplete data has been collected and a community wants to enhance it and clean it up. OpenRefine is a great tool, but isn’t collaborative. GDocs is a great collaborative platform but creates barriers to using automated cleanup tools. Github and code repositories are great for supporting the fork, work, and merge back patterns but don’t support direct human interaction with the data.

These issues are part of a broader pattern of issues with the Open Access, Data, and Educational Resources more generally. With the right formats, licensing and distribution mechanisms we’ve become very very good at supporting the fork part of the cycle. People can easily take that content and re-purpose it for their own local needs. What we’re not so good at is providing the mechanisms, both social and technical, to make it easy to contribute those variations, enhancements and new ideas back to the original resources. This is both a harder technical problem and challenging from a social perspective. Giving stuff away, letting people use it is easy because it requires little additional work. Working with people to accept their contributions back in takes time and effort, both often in short supply.

The challenge may be even greater because the means for making one type of contribution easier may make others harder. That certainly felt like the case here. But if we are to reap the benefits of open approaches then we need to do more than just throw things over the fence. We need to find the ways to gather back and integrate all the value that downstream users can add.

May 3, 2012May 3, 2012

Parsing the Willetts Speech on Access to UK Research Outputs

David Willetts speaking at the Big Society pol... — David Willetts speaking at the Big Society policy launch, Coin St, London. (Photo credit: Wikipedia)

Yesterday David Willetts, the UK Science and Universities Minister gave a speech to the Publishers Association that has got wide coverage. However it is worth pulling apart both the speech and the accompanying opinion piece from the Guardian because there are some interesting elements in there, and also some things have got a little confused.

The first really key point is that there is nothing new here. This is basically a re-announcement of the previous position from the DecemberÂ Innovation StrategyÂ on moving towards a freely accessible literature and a more public announcement of the Gateway to Research project previously mentioned in the RCUK response to the Innovation Statement.

The Gateway to Research project is a joint venture of the Department of Business Innovation and Skills and Research Councils UK to provide a one stop shop for information on UK research fundingÂ as well as pointers to outputs. It will essentially draw information directly from sources that already exist (the Research Outputs System and eVal) as well as some new ones with the intention of helping the UK public and enterprise find research and researchers that is of interest to them, and see how they are funded.

The new announcement was that Jimmy Wales of Wikipedia fame will be advising on the GTR portal. This is a good thing and he is well placed to provide both technical and social expertise on the provision of public facing information portals as well as providing a more radical perspective than might come out of BIS itself. While this might in part be cynically viewed as another example of bringing in celebrities to advise on policy this is a celebrity with relevant expertise and real credibility based on making similar systems work.

The rest of the information that we can gather relates to government efforts in moving towards making the UK research literature accessible. Wales also gets a look in here, and will be “advising us on [..] common standardsÂ to ensure information is presented in a readily reusable form”. My reading of this is that the Minister understands the importance of interoperability and my hope is that this will mean that government is getting good advice on appropriate licensing approaches to support this.

However, many have read this section of the speech as saying that GTR will act as some form of national repository for research articles. I do not believe this is the intention, and reading between the lines the comment that it will “provide direct linksÂ to actual research outputs such as data sets and publications” [my emphasis]Â is the key. The point of GTR is to make UK research more easily discoverable. Access is a somewhat orthogonal issue. This is better read as an expression of Willetts’ and the wider government’s agenda on transparency of public spending than as a mechanism for providing access.

What else can we tell from the speech? Well the term “open access” is used several times, something that was absent from the innovation statement, but still the emphasis is on achieving “public access” in the near term with “open access” cast as the future goal as I read it. It’s not clear to me whether this is a well informed distinction. There is a somewhat muddled commentary on Green vs Gold OA but not that much more muddled than what often comes from our own community. There are also some clear statements on the challenges for all involved.

As an aside I found it interesting that Willetts gave aÂ parentheticalÂ endorsement of usage metrics for the research literature when speaking of his own experience.

As well as reading some of the articles set by my tutors, I also remember browsing through the pages of the leading journals to see which articles were well-thumbed. It helped me to spot the key ones I ought to be familiar with â€“ a primitive version of crowd-sourcing. The web should make that kind of search behaviour far easier.

This is the most sophisticated appreciation of the potential for the combination of measurement and usage data in discovery that I have seen from any politician. It needs to be set against his endorsement of rather cruder filters earlier in the speech but it nonetheless gives me a sense that there is a level of understanding within government that is greater than we often fear.

Much of the rest of the speech is hedging. Options are discussed but not selected and certainly not promoted. The key message: wait for the Finch Report which will be the major guide for the route the government will take and the mechanisms that will be put in place to support it.

But there are some clearer statements. There is a strong sense that Hargreave’s recommendations on enabling text mining should be implemented. And the logic for this is well laid out. The speech and the policy agenda is embedded in a framework of enabling innovation – making it clear what kinds of evidence and argument we will need toÂ marshalÂ in order to persuade. There is also a strong emphasis on data as well as an appreciation that there is much to do in this space.

But the clearest statement made here is on the end goals. No-one can be left in any doubt of Willetts’ ultimate target. Full access to the outputs of research, ideally at the time of publication, in a way that enables them to be fully exploited, manipulated and modified for any purpose by any party. Indeed the vision is strongly congruent with the Berlin, Bethesda, and Budapest declarations on Open Access. There is still much to be argued about the route and and its length, but in the UK at least, the destination appears to be in little doubt.

Key questions in the UK’s shift to open-access research(blogs.nature.com)

Crossing the Rubicon – Is the UK Going to Enable Open Access for All Taxpayer-Funded Research by 2014? (scholarlykitchen.sspnet.org)
UK Government on open access: better than I could have hoped (svpow.com)
Willetts’ Speech on Open Access: AnalysisÂ (reciprocal space)

January 13, 2012

Response to the OSTP Request for Information on Public Access to Research Data

Response to Request for Information – FR Doc. 2011-28621

Dr Cameron Neylon â€“ U.K. based research scientist writing in a personal capacity

Introduction

Thankyou for the opportunity to respond to this request for information and to the parallel RFI on access to scientific publications. Many of the higher level policy issues relating to data are covered in my response to the other RFI and I refer to that response where appropriate here. Specifically I re-iterate my point that a focus on IP in the publication is a non-productive approach. Rather it is more productive to identify the outcomes that are desired as a result of the federal investment in generating data and from those outcomes to identify the services that are required to convert the raw material of the research process into accessible outputs that can be used to support those outcomes.

Response

(1) What specific Federal policies would encourage public access to and the preservation of broadly valuable digital data resulting from federally funded scientific research, to grow the U.S. economy and improve the productivity of the American scientific enterprise?

Where the Federal government has funded the generation of digital data, either through generic research funding or through focussed programs that directly target data generation, the purpose of this investment is to generate outcomes. Some data has clearly defined applications, and much data is obtained to further very specific research goals. However while it is possible to identify likely applications it is not possible, indeed is foolhardy, to attempt to define and limit the full range of uses which data may find.

Thus to ensure that data created through federal investment is optimally exploited it is crucial that data be a) accessible, b) discoverable, c)interpretable and d) legally re-usable by any person for any purpose. To achieve this requires investment in infrastructure, markup,Â and curation. This investment is not currently seen as either a core activity for researchers themselves, or a desirable service for them to purchase. It is rare therefore for such services or resource need to be thoughtfully costed in grant applications.

The policy challenge is therefore to create incentives, both symbolic and contractual, but also directly meaningful to researchers with an impact on their career and progression, that encourage researchers to either undertake these necessary activities directly themselves or to purchase and appropriately cost third party services to have them carried out.

Policy intervention in this area will be complex and will need to be thoughtful. Three simple policy moves however are highly tractable and productive, without requiring significant process adjustments in the short term:

a) Require researchers to provide a data management or data accessibility plan within grant requests. The focus of these plans should be showing how the project will enable third party groups to discover and re-use data outputs from the project.

b) As part of the project reporting, require measures of how data outputs have been used. These might include download counts, citations, comments, or new collaborations generated through the data. In the short term this assessment need to be directly used but it sends a message that agencies consider this important.

c) Explicitly measure performance on data re-use. Require as part of bio sketches and provide data on previous performance to grant panels. In the longer term it may be appropriate to provide guidance to panels on the assessment of previous performance on data re-use but in the first instance simply providing the information will affect behaviour and the general awareness of issues of data accessibility, discoverability, and usability.

(2) What specific steps can be taken to protect the intellectual property interests of publishers, scientists, Federal agencies, and other stakeholders, with respect to any existing or proposed policies for encouraging public access to and preservation of digital data resulting from federally funded scientific research?

As noted in my response to the other RFI, the focus on intellectual property is note helpful. Private contributors of data such as commercial collaborators should be free to exploit their own contribution of IP to projects as they see fit. Federally funded research should seek to maximise the exploitation and re-use of data generated through public investment.

It has been consistently and repeatedly demonstrated in a wide range of domains that the most effective way of exploiting the outputs of research innovation, be they physical samples, or digital data, to support further research, to drive innovation, or to support economic activity globally is to make those outputs freely available with no restrictive terms. That is, the most effective way to use research data to drive economic activity and innovation at a national level is to give the data away.

The current IP environment means that in specific cases, such as where there is very strong evidence of a patentable result with demonstrated potential, that the optimisation of outcomes does require protection of the IP. There are also situations where privacy and other legal considerations mean that data cannot be released or not be fully released. These should however be seen as the exception rather than the rule.

(3) How could Federal agencies take into account inherent differences between scientific disciplines and different types of digital data when developing policies on the management of data?

At the Federal level only very high-level policy decisions should be taken. These should provide direction and strategy but enable tactics and the details of implementation to be handled at agency or community levels. What both the Federal Agencies and coordination bodies such as OSTP can provide is an oversight and, where appropriate, funding support to maintain, develop, and expand interoperability between developing standards in different communities. Federal agencies can also effectively provide an oversight function that supports activities that enhance interoperability.

Local custom, dialects, and community practice will always differ and it is generally unproductive to enforce standardisation on implementation details. The policy objectives should be to set the expectations and the frameworks within local implementation can be developed and approaches to developing criteria against which those local implementations can be assessed.

(4) How could agency policies consider differences in the relative costs and benefits of long-term stewardship and dissemination of different types of data resulting from federally funded research?

Prior to assessing differences in performance and return on investment it will be necessary to provide data gathering frameworks and to develop significant expertise in the detailed assessment of the data gathered. A general principle that should be considered is that the administrative and performance data related to accessibility and re-use of research data should provide an outstanding exemplar of best practice in terms of accessibility, curation, discoverability, and re-usability.

The first step in cost benefit analysis must be to develop an information and data base that supports that analysis. This will mean tracking and aggregating forms of data use that are available today (download counts, citations) as well as developing mechanisms for tracking the use and impact of data in ways that are either challenging or impossible today (data use in policy development, impact of data in clinical practice guidelines).

Only once this assessment data framework is in place can detailed process of cost benefit analysis be seriously considered. Differences will exist in the measurable and imponderable return on investment in data availability, and also in the timeframes over which these returns are realised. We have only a very limited understanding of these issues today.

(5) How can stakeholders (e.g., research communities, universities, research institutions, libraries, scientific publishers) best contribute to the implementation of data management plans?

If stakeholders have serious incentives to optimise the use and re-use of data then all players will seek to gain competitive advantage through making the highest quality contributions. An appropriate incentives framework obviates the need to attempt to design in or pre-suppose how different stakeholders can, will, or should best contribute going forward.

(6) How could funding mechanisms be improved to better address the real costs of preserving and making digital data accessible?

As with all research outputs there should be a clear obligation on researchers to plan on a best efforts basis to publish these (as in make public) in a form that most effectively support access and re-use tensioned against the resources available. Funding agencies should make clear that they expect communication of research outputs to be a core activity for their funded research, that researchers and their institutions will be judged based on their performance in optimising the choices they make in selecting the appropriate modes of communication.

Further funding agencies should explicitly set guidance levels on the proportion of a research grant that is expected under normal circumstances to be used to support the communication of outputs. Based on calculations from the Wellcome Trust where projected expenditure on the publication of traditional research papers was around 1-1.5% of total grant costs, it would be reasonable to project total communication costs once data and other research communications are considered of 2-4% of total costs. This guidance and the details of best practice should clearly be adjusted as data is collected on both costs and performance.

(7) What approaches could agencies take to measure, verify, and improve compliance with Federal data stewardship and access policies for scientific research? How can the burden of compliance and verification be minimized?

Ideally compliance and performance will be trackable through automated systems that are triggered as a side effect of activities required for enabling data access. Thus references for new data should be registered with appropriate services to enable discovery by third parties â€“ these services can also be used to support the tracking of these outputs automatically. Frameworks and infrastructure for sharing should be built with tracking mechanisms built in. Much of the aggregation of data at scale can build on the existing work in the STARMETRICS program and draw inspiration from that experience.

Overall it should be possible to reduce the burden of compliance from its current level while gathering vastly more data and information of much higher quality than is currently collected.

(8) What additional steps could agencies take to stimulate innovative use of publicly accessible research data in new and existing markets and industries to create jobs and grow the economy?

There are a variety of proven methods for stimulating innovative use of data at both large and small scale. The first is to make it available. If data is made available at scale then it is highly likely that some of it will be used somewhere. The more direct encouragement of specific uses can be achieved through directed â€œhack eventsâ€ that bring together data handling and data production expertise from specific domains. There is significant US expertise in successfully managing these events and generating exciting outcomes. These in turn lead to new startups and new innovation.

There is also a significant growth in the number of data-focussed entrepreneurs who are now veterans of the early development of the consumer web. Many of these have a significant interest in research as well as significant resources and there is great potential for leveraging their experience to stimulate further growth. However this interface does need to be carefully managed as the cultures involved in research data curation and web-scale data mining and exploitation are very different.

(9) What mechanisms could be developed to assure that those who produced the data are given appropriate attribution and credit when secondary results are reported?

The existing norms of the research community that recognise and attribute contributions to further work should be strengthened and supported. While it is tempting to use legal instruments to enforce a need for attribution there is growing evidence that this can lead to inflexible systems that cannot adapt to changing needs. Thus it is better to utilise social enforcement than legal enforcement.

The current good work on data citation and mechanisms for tracking the re-use of data should be supported and expanded. Funders should explicitly require that service providers add capacity for tracking data citation to the products that are purchased for assessment purposes. Where possible the culture of citation should be expanded into the wider world in the form of clinical guidelines, government reports, and policy development papers.

(10) What digital data standards would enable interoperability, reuse, and repurposing of digital scientific data? For example, MIAME (minimum information about a microarray experiment; see Brazma et al., 2001, Nature Genetics 29, 371) is an example of a community-driven data standards effort.

At the highest level there are a growing range of interoperable information transfer formats that can provide machine readable and integratable data transfer including RDF, XML, OWL, JSON and others. My own experience is that attempting to impose global interchange standards is an enterprise doomed to failure and it is more productive to support these standards within existing communities of practice.

Thus the appropriate policy action is to recommend that communities adopt and utilise the most widely used possible set of standards and to support the transitions of practice and infrastructure required to support this adoption. Selecting standards at the highest level is likely to counterproductive. Identifying and disseminating best practice in the development and adoption of standards is however something that is the appropriate remit of federal agencies.

(11) What are other examples of standards development processes that were successful in producing effective standards and what characteristics of the process made these efforts successful?

There is now a significant literature on community development and practice and this should be referred to. Many lessons can also be drawn from the development of effective and successful open source software projects.

(12) How could Federal agencies promote effective coordination on digital data standards with other nations and international communities?

There are a range of global initiatives that communities should engage with. The most effective means of practical engagement will be to identify communities that have a desire to standardise or integrate systems and to support the technical and practical transitions to enable this. For instance there is a widespread desire to support interoperable data formats from analytical instrumentation but few examples of bringing this to transition. Funding could be directed to supporting a specific analytical community and the vendors that support them to apply an existing standard to their work.

(13) What policies, practices, and standards are needed to support linking between publications and associated data?

Development in this area is at an early stage. There is a need to reconsider the form of publication in its widest sense and this will have a significant impact on the forms and mechanisms of linking. This is a time for experimentation and exploration rather than standards development.

January 11, 2012

Response to the RFI on Public Access to Research Communications

Have you written your response to the OSTP RFIs yet? If not why not? This is amongst the best opportunities in years to directly tell the U.S. government how important Open Access to scientific publications is and how to start moving to a much more data centric research process. You’d better believe that the forces of stasis, inertia, and vested interests are getting their responses in. They need to be answered.

I’ve written mine on public access and you can read and comment on it here. I will submit it tomorrow just in front of the deadline but in the meantime any comments are welcome. It expands on and discusses many of the same issues, specifically on re-configuring the debate on access away from IP and towards services, that have been in my recent posts on the Research Works Act.

November 11, 2011

Reflections on research data management: RDM is on the up and up but data driven policy development seems a long way off.

Data Represented in an Interactive 3-D Form — Image by Idaho National Laboratory via Flickr

I wrote this post for the Digital Curation Centre blog following the Research Data Management Forum meetingÂ run in Warwick a few weeks back. If you feel moved to comment I’d ask you to do it over there.

The Research Data Management movement is moving on apace. Tools are working and adoption is growing. Policy development is starting to back up the use of those tools and there are some big ambitious goals set out for the next few years. But has the RDM movement taken the vision of data intensive research to its heart? Does the collection, sharing, and analysis of data about research data management meet our own standards? And is policy development based on and assessed against that data? Can we be credible if it is not?

Watching the discussion on research data management over the past few years has been an exciting experience. The tools, that have been possible for some years, now show real promise as the somewhat raw and ready products of initial development are used and tested.

Practice is gradually changing, if unevenly across different disciplines, but there is a growing awareness of data and that it might be considered important. And all of this is being driven increasingly by the development of policies on data availability, data management, and data archiving that stress the importance of data as a core output of public research.

The vision of the potential of a data rich research environment is what is driving this change. It is not important whether individual researchers, or even whole community, gets how fundamental a change the capacity to share and re-use data really is. The change is driven by two forces fundamentally external to the community.

The first is political, the top down view from government that publicly funded research needs to gain from the benefits they see in data rich commerce.Â A handful of people really understand how data works at these scales but these people have the ear of government.

The second force is one of competition. In the short term adopting new practices, developing new ways of doing research, is a risk. In the longer term, those who adopt more effective and efficient approaches will simply out compete those who do not or can not. This is already starting happening in those disciplines already rich in shared data and the signs are there that other disciplines are approaching a tipping point.

Data intensive research enables new types of questions to be asked, and it allows us to answer questions that were previously difficult or impossible to get reliable answers on. Questions about weak effects, small correlations, and complex interactions.Â The kind of questions that bedevil strategic decision-making and evidence based policy.

So naturally youâ€™d expect that the policy development in this area, being driven by people excited by the vision of data intensive research, would have deeply embedded data gathering, model building, and analysis of how research data is being collected, made available, and re-used.

I donâ€™t mean opinion surveys, or dipstick tests, or case studies.Â These are important but theyâ€™re not the way data intensive research works. They donâ€™t scale, they donâ€™t integrate, and they canâ€™t provide the insight into the weak effects in complex systems that are needed to support decision making about policy.

Data intensive research is about tracking everything, logging every interaction, going through download logs, finding every mention of a specific thing wherever on the web it might be.

Itâ€™s about capturing large amounts of weakly structured data and figuring out how to structure it in a way that supports answering the question of interest. And its about letting the data guide you the answers it suggests, rather than looking within it for what we â€œknowâ€ should be in there.

What I donâ€™t see when I look at RDM policy development is the detailed analysis of download logs, the usage data, the click-throughs on website. Where are the analyses of IP ranges of users, automated reporting systems, and above all, when new policy directions are set where is the guidance on data collection and assessment of performance against those policies?

Without this, the RDM community is arguably doing exactly the same things that we complain about in researcher communities. Not taking a data driven view of what we are doing.

I know this is hard. I know it involves changing systems, testing things in new ways, collecting data in ways we are not used. Even imposing disciplinary approaches that are well outside the comfort zone of those involved.

I also know there are pockets of excellent practice and significant efforts to gather and integrate information. But they are pockets. And these are exactly the things that funders and RDM professionals and institutions are asking of researchers. They are the right things to be asking for, and weâ€™re making real progress towards realizing the vision of what is possible with data intensive research.

But just imagine if we could support policy development with that same level of information. At a pragmatic and political level it makes a strong statement when we â€œeat our own dogfoodâ€. And there is no better way to understand which systems and approaches are working and not working than by using them ourselves.

September 8, 2011September 8, 2011

Incentives: Definitely a case of rolling your own

Lakhovsky: The Convesation; oil on panel (Ð‘ÐµÑÐµ... — Image via Wikipedia

Science Online London ran late last week and into the weekend and I was very pleased to be asked to run a panel, broadly speaking focused on evaluation and incentives. Now I had thought that the panel went pretty well but Iâ€™d be fibbing if I said I wasnâ€™t a bit disappointed. Not disappointed with the panel members or what they said. Yes it was a bit more general than I had hoped and there were things that I wished weâ€™d covered but the substance was good from my perspective. My disappointment was with the response from the audience, really on two slightly different points.

The first was the lack of response to what I thought were some of the most exciting things Iâ€™ve heard in a long time from major stakeholders. Iâ€™ll come back to that later. But a bigger disappointment was that people didnâ€™t seem to connect the dots to their own needs and experiences.

Science Online, both in London and North Carolina forms, has for me always been a meeting where the conversation proceeds at a more sophisticated level than the usual. So I pitched the plan of the session at where I thought the level should be. Yes we needed to talk about the challenges and surface the usual problems, non-traditional research outputs and online outputs in particular donâ€™t get the kind of credit that papers do, institutions struggle to give credit for work that doesnâ€™t fit in a pigeonhole, funders seem to reward only the conventional and traditional, and people outside the ivory tower struggle to get either recognition or funding. These are known challenges, the question is how to tackle them.

The step beyond this is the hard one. It is easy to say that incentives need to change. But incentives donâ€™t drop from heaven. Incentives are created within communities and they become meaningful when they are linked to the interests of stakeholders with resources. So the discussion wasnâ€™t really about impact, or funding, or to show that nothing can be done by amateurs. The discussion was about the needs of institutions and funders and how they can be served by what is being generated by the online community. It was also about the constraints they face in acting. But fundamentally you had major players on the stage saying â€œthis is the kind of thing we need to get the ball rollingâ€.

Make no mistake, this is tough. Everyone is constrained and resources are tight but at the centre of the discussion were the key pointers to how to cut through the knot. The head of strategy at a major research university stated that universities want to play a more diverse role, want to create more diverse scholarly outputs, and want to engage with the wider community in new ways. That smart institutions will be looking to diversify. The head of evaluation at a major UK funder said that funders really want to know about non-traditional outputs and how they were having a wider impact. That these outputs are amongst the best things they can talk about to government. That they will be crucial to make the case to sustain science funding.

Those statements are amongst the most direct and exciting I have heard in some years of advocacy in this space. The opportunity is there, if youâ€™re willing to put the effort in to communicate and to shape what you are doing to match to match their needs. As Michael Nielsen said in his morning keynote this is a collective action problem. That means finding what unites the needs of those doing with the needs of those with resources. It means compromise, and it means focusing on the achievable, but the point of the discussion was to identify what might be achievable.

So mostly I was disappointed that the excitement I felt wasnâ€™t mirrored in the audience. The discussion about incentives has to move on. Saying that â€œinstitutions should do Xâ€ or â€œfunders should do Yâ€ gets us nowhere. Understanding what we can do together with funders and institutions and other communities to take the online agenda forward and understanding what the constraints are is where we need to go. The discussion showed that both institutions and funders know that they need what the community of online scientists can do. They donâ€™t know how to go about it, and they donâ€™t even know very much what we are doing, but they want to know. And when they do know they can advise and help and they can bring resources to bear. Maybe not all the resources you would like, and maybe not for all the things you would like, but resources nonetheless.

With a lot of things it is easy to get too immersed in the detail of these issues and to forget that people are looking in from the outside without the same context. I guess the fact that I pulled out what might have seemed to the audience to be just asides as the main message is indicative of that. But I really want to get that message out because I think it Â is critical if the community of online scientists wants to be the mainstream. And I think it should be.

The bottom line is that smart funders and smart institutions value what is going on online. They want to support it, they want to be seen to support it, but theyâ€™re not always sure how to go about it and how to judge its quality. But they want to know more. Thatâ€™s where you come in and thatâ€™s why the session was relevant. Lars Fischer had it absolutely right: â€œI think the biggest and most consequential incentive for scientists is (informal) recognition by peers.â€ You know, we know, who is doing the good stuff and what is valuable. Take that conversation to the funders and the institutions, explain to them whatâ€™s good and why, and tell the story of what the value is. Put it in your CV, demand that promotion panels take account of it, whichever side of the table you are on. Show that you make an impact in language that they understand. They want to know. They may not always be able to act – funding is an issue – but they want to and they need your help. In many ways they need our help more than we need theirs. And if that isnâ€™t an incentive then I donâ€™t know what is.

Submission to the Royal Society Enquiry (cameronneylon.net)
Recognising, Appreciating, Measuring and Evaluating the Impact of Open Science (ukwebfocus.wordpress.com)
A new sustainability model: Major funders to support OA journal (cameronneylon.net)
(S)low impact research and the importance of open in maximising re-use (cameronneylon.net)
Some questions for British research policy (softmachines.org)
Proceed with caution when setting up financial incentives for general practice doctors (eurekalert.org)

August 5, 2011August 5, 2011

Submission to the Royal Society Enquiry

Title page of Philosophical Transactions of th... — Image via Wikipedia

The Royal Society is running a public consultation exercise on Science as a Public Enterprise. Submissions are requested to answer a set of questions. Here are my answers.

1. What ethical and legal principles should govern access to research results and data? How can ethics and law assist in simultaneously protecting and promoting both public and private interests?

There are broadly two principles that govern the ethics of access to research results and data. Firstly there is the simple position that publicly funded research should by default be accessible to the public (with certain limited exceptions, see below). Secondly claims that impinge on public policy, health, safety, or the environment, that are based on research should be supported by public access to the data. See more detail in answer to Q2.

2 a) How should principles apply to publicly-funded research conducted in the public interest?

By default research outputs from publicly funded research should be made publicly accessible and re-usable in as timely a manner as possible. In an ideal world the default would be immediate release, however this is not a practically accessible goal in the near future. Cultural barriers and community inertia prevent the exploitation of technological tools that demonstrably have the potential enable research to move faster and more effectively. Research communication mechanisms are currently shackled to the requirements of the research community to monitor career progression and not optimised for effective communication.

In the near term it is practical to move towards an expectation that research outputs that support published research should be accessible and re-usable. Reasonable exceptions to this include data that is personally identifiable, that may place cultural or environmental heritage at risk, that places researchers at risk, or that might affect the integrity of ongoing data collection. The key point is that while there are reasonable exceptions to the principle of public access to public research outputs that these are exceptions and not the general rule.

What is not reasonable is to withhold or limit the re-use of data, materials, or other research outputs from public research for the purpose of personal advancement, including the “squeezing out of a few more papers”. If these outputs can be more effectively exploited elsewhere then this a more efficient use of public resources to further our public research agenda. The community has placed the importance of our own career advancement ahead of the public interest in achieving outcomes from public research for far too long.

What is also politically naive is to believe or even to create the perception that it is acceptable to withhold data on the basis that “the public won’t understand” or “it might be misused”. The web has radically changed the economics of information transfer but it has perhaps more importantly changed the public perception on access to data. The wider community is rightly suspicious of any situation where public information is withheld. This applies equally to publicly funded research as it does to government data.

2 b) How should principles apply to privately-funded research involving data collected about or from individuals and/or organisations (e.g. clinical trials)?

Increasingly public advocacy groups are becoming involved in contributing to a range of research activities including patient advocacy groups supporting clinical trials, environmental advocacy groups supporting data collection, as well as a wider public involvement in, for instance, citizen science projects.

In the case where individuals or organisations are contributing to research they have a right for that contribution to be recognised and a right to participate on their own terms (or to choose not to participate where those terms are unacceptable).

Organised groups (particularly patient groups) are of growing importance to a range of research. Researchers should expect to negotiate with such groups as to the ultimate publication of data. Such groups should have the ability to demand greater public release and to waive rights to privacy. Equally contributors have a right to expect a default right to privacy where personally identifiable information is involved.

Privacy trumps the expectation of data release and the question of what is personally identifiable information is a vexed question which as a society we are working through. Researchers will need to explore these issues with participants and to work to ensure that data generated can be anonymised in a way that enables the released data to effectively support the claims made from it. This is a challenging area which requires significant further technical, policy, and ethics work.

2 c) How should principles apply to research that is entirely privately-funded but with possible public implications?

It is clear that public funded research is a public good. By contrast privately funded research is properly a private good and the decision to release or not release research outputs lies with the funder.

It is worth noting that much of the privately funded research in UK universities is significantly subsidised through the provision of public infrastructure and this should be taken into consideration when defining publicly and privately funded research. Here I consider research that is 100% privately funded.

Where claims are made on the basis of privately funded research (e.g. of environmental impact or the efficacy of health treatments) then such claims SHOULD be fully supported by provision of the underlying evidence and data if they are to be credible. Where such claims are intended to influence public policy such evidence and data MUST be made available. That is, evidence based public policy must be supported by the publication of the full evidence regardless of the source of that evidence. Claims made to influence public policy that are not supported by provision of evidence must be discounted for the purposes of making public policy.

2 d) How should principles apply to research or communication of data that involves the promotion of the public interest but which might have implications from the privacy interests of citizens?

See above: the right to privacy trumps any requirement to release raw data. Nonetheless research should be structured and appropriate consent obtained to ensure that claims made on the basis of the research can be supported by an adequate, publicly accessible, evidence base.

3. What activities are currently under way that could improve the sharing and communication of scientific information?

A wide variety of technical initiatives are underway to enable the wider collection, capture, archival and distribution of research outputs including narrative, data, materials, and other elements of the research process. It is technically possible for us today to immediately publish the entire research record if we so choose. Such an extreme approach is resource intensive, challenging, and probably not ultimately a sensible use of resources. However it is clear that more complete and rapid sharing has the potential to increase the effectiveness and efficiency of research.

The challenges in exploiting these opportunities are fundamentally cultural. The research community is focussed almost entirely on assessment through the extremely narrow lens of publication of extended narratives in high profile peer reviewed journals. This cultural bias must be at least partially reversed before we can realise the opportunities that technology affords us. This involves advocacy work, policy development, the addressing of incentives for researchers and above all the slow and arduous process of returning the research culture to one which takes responsibility for the return on the public investment, including economic, health, social, education, and research returns and one that takes responsibility for effective communication of research outputs.

4. How do/should new media, including the blogosphere, change how scientists conduct and communicate their research?

New media (not really new any more and increasingly part of the mainstream) democratise access to communications and increase the pace of communication. This is not entirely a good thing and en masse the quality of the discourse is not always high. High quality depends on the good will, expertise, and experience of those taking part.There is a vast quantity of high quality, rapid response discourse that occurs around research on the web today even if it occurs in many places. The most effective means of determining whether a recent high profile communication stands up to criticism is to turn to discussion on blogs and news sites, not to wait months for a possible technical criticism to appear in a journal. In many ways this is nothing new, it is return to the traditional approaches of communication seen at the birth of the Royal Society itself of direct and immediate communication between researchers by the most efficient means possible; letters in the 17C and the web today.

Alongside the potential for more effective communication of researchers with each other there is also an enormous potential for more effective engagement with the wider community, not merely through “news and views” pieces but through active conversation, and indeed active contributions from outside the academy. A group of computer consultants are working to contribute their expertise in software development to improving legacy climate science software. This is a real contribution to the research effort. Equally the right question at the right time may come from an unexpected source but lead to new insights. We need to be open to this.

At the same time there is a technical deficiency in the current web and that is the management of the sheer quantity of potential connections that can be made. Our most valuable resource in research is expert attention. This attention may come from inside or outside the academy but it is a resource that needs to be efficiently directed to where it can have the most impact. This will include the necessary development of mechanisms that assist in choosing which potential contacts and information to follow up. These are currently in their infancy. Their development is in any case a necessity to deal with the explosion of traditional information sources.

5. What additional challenges are there in making data usable by scientists in the same field, scientists in other fields, â€˜citizen scientistsâ€™ and the general public?

Effective sharing of data and indeed most research outputs remains a significant challenge. The problem is two-fold, first of ensuring sufficient contextual information that an expert can understand the potential uses of the research output. Secondly the placing of that contextual information in a narrative that is understandable to the widest possible range of users. These are both significant challenges that are being tackled by a large number of skilled people. Progress is being made but a great deal of work remains in developing the tools, techniques, and processes that will enable the cost effective sharing of research outputs.

A key point however is that in a world where publication is extremely cheap then simply releasing whatever outputs exist in their current form can still have a positive effect. Firstly where the cost of release is effectively zero even if there is only a small chance of those data being discovered and re-used this will still lead to positive outcomes in aggregate. Secondly the presence of this underexploited resource of released, but insufficiently marked up and contextualised, data will drive the development of real systems that will make them more useful.

6 a) What might be the benefits of more widespread sharing of data for the productivity and efficiency of scientific research?

Fundamentally more efficient, more effective, and more publicly engaging research. Less repetition and needless rediscovery of negative results and ideally more effective replication and critiquing of positive results are enabled by more widespread data sharing. As noted above another important outcome is that even suboptimal sharing will help to drive the development of tools that will help to optimise the effective release of data.

6 b) What might be the benefits of more widespread sharing of data for new sorts of science?

The widespread sharing of data has historically always lead to entirely new forms of science. The modern science of crystallography is based largely on the availability of crystal structures, bioinformatics would simply not exist without genbank, the PDB, and other biological databases and the astronomy of today would be unrecognizable to someone whose career ended prior to the availability of the Sloan Digital Sky Survey. Citizen science projects of the type of Galaxy Zoo, Fold-IT and many others are inconceivable without the data to support them. Extrapolating from this evidence provides an exciting view of the possibilities. Indeed one which it would be negligent not to exploit.

6 c) What might be the benefits of more widespread sharing of data for public policy?

Policy making that is supported by more effective evidence is something that appeals to most scientists. Of course public policy making is never that simple. Nonetheless it is hard to see how a more effective and comprehensive evidence base could fail to support better evidence based policy making. Indeed it is to be hoped that a wide evidence base, and the contradictions it will necessarily contain, could lead to a more sophisticated understanding of the scope and critique of evidence sources.

6 d) What might be the benefits of more widespread sharing of data for other social benefits?

The potential for wider public involvement in science is a major potential benefit. As in e) above a deeper understanding of how to treat and parse evidence and data throughout society can only be positive.

6 e) What might be the benefits of more widespread sharing of data for innovation and economic growth?

Every study of the release of government data has shown that it leads to a nett economic benefit. This is true even when such data has traditionally been charged for. The national economy benefits to a much greater extent than any potential loss of revenue. While this is notÂ necessarilyÂ sufficient incentive for private investors to release data in this case of public investment the object is to maximise national ROI. Therefore release in a fully open form is the rational economic approach.

The costs of lack of acces to publicly funded research outputs by SMEs is well established. Improved access will remove the barriers that currently stifle innovation and economic growth.

6 f) What might be the benefits of more widespread sharing of data for public trust in the processes of science?

There is both a negative and a positive side to this question. On the positive greater transparency, more potential for direct involvement, and a greater understanding of the process by which research proceeds will lead to greater public confidence. On the negative, doing nothing is simply not an option. Recent events have shown not so much that the public has lost confidence in science and scientists but that there is deep shock at the lack of transparency and the lack of availability of data.

If the research community does not wish to be perceived in the same way as MPs and other recent targets of public derision then we need to move rapidly to improve the degree of transparency and accessibility of the outputs of public research.

7. How should concerns about privacy, security and intellectual property be balanced against the proposed benefits of openness?

There is little evidence that the protection of IP supports a nett increase on the return on the public investment in research. While there may be cases where it is locally optimal to pursue IP protection to exploit research outputs and maximise ROI this is not generally the case. The presumption that everything should be patented is both draining resources and stifling British research. There should always be an avenue for taking this route to exploitation but there should be a presumption of open communication of research outputs and the need for IP protection should be justified on a case by case basis. It should be unacceptable for the pursuit of IP protection to damage the communication and downstream exploitation of research.

Privacy issues and concerns around the personal security of researchers have been discussed above. National security issues will in many cases fall under a justifiable exception to the presumption of openness although it is clear that this needs care and probably oversight to retain public confidence.

8. What should be expected and/or required of scientists (in companies, universities or elsewhere), research funders, regulators, scientific publishers, research institutions, international organisations and other bodies?

British research could benefit from a statement of values, something that has the cultural significance of the Haldane principle (although perhaps better understood) or the Hippocratic oath. A shared cultural statement that captures a commitment to efficiently discharging the public trust invested in us, to open processes as a default, and to specific approaches where appropriate would act as a strong centre around which policy and tools could be developed. Leadership is crucial here in setting values and embedding these within our culture. Organisations such as the Royal Society have an important role to play.

Researchers and the research community need to take these responsibilities on ourselves in a serious and considered manner. Funders and regulators need to provide a policy framework, and where appropriate community sanctions for transgression of important principles. Research institutions are for the most part tied into current incentive systems that are tightly coupled to funding arrangements and have limited freedom of movement. Nonetheless a serious consideration of the ROI of technology transfer arrangements and of how non-traditional outputs, including data, contribute to the work of the institution and its standing are required. In the current economic climate successful institutions will diversify in their approach. Those that do not are unlikely to survive in their current form.

Other comments

This is not the first time that the research community has faced this issue. Indeed it is not even the first time the Royal Society has played a central role. Several hundred years ago it was a challenge to persuade researchers to share information at all. Results were hidden. Sharing was partial, only within tight circles, and usually limited in scope. The precursors of the Royal Society played a key role in persuading the community that effective sharing of their research outputs would improve research. Many of the same concerns were raised; concerns about the misuse of those outputs, concerns about others stealing ideas, concerns about personal prestige and the embarrassment potential of getting things wrong.

The development of journals and the development of a values system that demanded that results be made public took time, it took leadership, and with the technology of the day the best possible system was developed over an extended period. With a new technology now available we face the same issues and challenges. It is to be hoped that we tackle those challenges and opportunities with the same sense of purpose.

July 19, 2011July 19, 2011

(S)low impact research and the importance of open in maximising re-use

This is an edited version of the text that I spoke from at the Altmetrics Workshop in Koblenz in June. There is also an audio recording of the talk I gave available as well as the submitted abstract for the workshop.

I developed an interest in research evaluation as an advocate of open research process. It is clear that researchers are not going to change themselves so someone is going to have to change them and it is funders who wield the biggest stick. The only question, I thought, Â was how to persuade them to use it

Of course itâ€™s not that simple. It turns out that funders are highly constrained as well. They can lead from the front but not too far out in front if they want to retain the confidence of their community. And the actual decision making processes remain dominated by senior researchers. Successful senior researchers with little interest in rocking the boat too much.

The thing you realize as you dig deeper into this as that the key lies in finding motivations that work across the interests of different stakeholders. The challenge lies in finding the shared objectives. What it is that unites both researchers and funders, as well as government and the wider community. So what can we find that is shared?

Iâ€™d like to suggest that one answer to that is Impact. The research community as a whole has stake in convincing government that research funding is well invested. Government also has a stake in understanding how to maximize the return on its investment. Researchers do want to make a difference, even if that difference is a long way off. You need a scattergun approach to get the big results, but that means supporting a diverse range of research in the knowledge that some of it will go nowhere but some of it will pay off.

Impact has a bad name but if we step aside from the gut reactions and look at what we actually want out of research then we start to see a need to raise some challenging questions. What is research for?Â What is its role in our society really? What outcomes would we like to see from it, and over what timeframes? What would we want to evaluate those outcomes against? Economic impact yes, as well as social, health, policy, and environmental impact. This is called the ‘triple bottom line’ in Australia. But alongside these there is also research impact.

All these have something in common. Re-use. What we mean by impact is re-use. Re-use in industry, re-use in public health and education, re-use in policy development and enactment, and re-use in research.

And this frame brings some interesting possibilities. We can measure some types of re-use. Citation, retweets, re-use of data or materials, or methods or software. We can think about gathering evidence of other types of re-use, and of improving the systems that acknowledge re-use. If we can expand the culture of citation and linking to new objects and new forms of re-use, particularly for objects on the web, where there is some good low hanging fruit, then we can gather a much stronger and more comprehensive evidence base to support all sorts of decision making.

There are also problems and challenges. The same ones that any social metrics bring. Concentration and community effects, the Matthew effect of the rich getting richer. We need to understand these feedback effects much better and I am very glad there are significant projects addressing this.

But there is also something more compelling for me in this view. It let’s us reframe the debate around basic research. The argument goes we need basic research to support future breakthroughs. We know neither what we will need nor where it will come from. But we know that its very hard to predict – that’s why we support curiosity driven research as an important part of the portfolio of projects. Yet the dissemination of this investment in the future is amongst the weakest in our research portfolio. At best a few papers are released then hidden in journals that most of the world has no access to and in many cases without the data, or other products either being indexed or even made available. And this lack of effective dissemination is often because the work is perceived as low, or perhaps better, slow impact.

We may not be able to demonstrate or to measure significant re-use of the outputs of this research for many years. But what we can do is focus on optimizing the capacity, the potential, for future exploitation. Where we can’t demonstrate re-use and impact we should demand that researchers demonstrate that they have optimized their outputs to enable future re-use and impact.

And this brings me full circle. My belief is that the way to ensure the best opportunities for downstream re-use, over all timeframes, is that the research outputs are open, in the Budapest Declaration sense. But we don’t have to take my word for it, we can gather evidence. Making everything naively open will not always be the best answer, but we need to understand where that is and how best to deal with it. We need to gather evidence of re-use over time to understand how to optimize our outputs to maximize their impact.

But if we choose to value re-use, to value the downstream impact that our research or have, or could have, then we can make this debate not about politics or ideology but how about how best to take the public investment in research and to invest it for the outcomes that we need as a society.

Evidence to the European Commission Hearing on Access to Scientific Information (cameronneylon.net)
QUT | Library | Measuring research impact (library.qut.edu.au)
iASSIST: Data Reuse Modelling (jwhyteappleby.wordpress.com)

July 17, 2011July 18, 2011

Wears the passion? Yes it does rather…

Quite some months ago an article in Cancer Therapy and Biology by Scott Kern of Johns Hopkins kicked up an almighty online stink. The article entitled “Where’s the passion” bemoaned the lack of hard core dedication amongst the younger researchers that the author saw around him…starting with:

It is Sunday afternoon on a sunny, spring day.

I’m walking the hallsâ€”all of themâ€”in a modern $59 million building dedicated to cancer research. A half hour ago, I completed a stroll through another, identical building. You see, I’m doing a survey. And the two buildings are largely empty.

The point being that if they really cared, those young researchers would be there day in-day out working their hearts out to get to the key finding. At one level this is risible, expecting everyone to work 24×7 is not a good or efficient way to get results. Furthermore you have to wonder why these younger researchers have “lost their passion”. Why doesn’t the environment create that naturally, what messages are the tenured staff sending through their actions. But I’d be being dishonest if there wasn’t a twinge of sympathy for me as well. Anyone who’s run a group has had that thought that the back of their mind; “if only they’d work harder/smarter/longer we’d be that much further ahead…”.

But all of that has been covered by others. What jumped out of the piece for me at the time were some other passages, ones that really got me angry.

When the mothers of the Mothers March collected dimes, they KNEW that teams, at that minute, were performing difficult, even dangerous, research in the supported labs. Modern cancer advocates walk for a cure down the city streets on Saturday mornings across the land. They can comfortably know that, uhâ€¦let’s see hereâ€¦, some of their donations might receive similar passion. Anyway, the effort should be up to full force by 10 a.m. or so the following Monday.

[…]

During the survey period, off-site laypersons offer comments on my observations. â€œDon’t the people with families have a right to a career in cancer research also?â€ I choose not to answer. How would I? Do the patients have a duty to provide this â€œrightâ€, perhaps by entering suspended animation?

Now these are all worthy statements. We’d all like to see faster development of cures and I’ve no doubt that the people out there pounding the streets are driven to do all they can to see those cures advance. But is the real problem here whether the postdocs are here on a Sunday afternoon or are there things we could do to advance this? Maybe there are other parts of the research enterprise that could be made more efficient…like I don’t know making the results of research widely available and ensuring that others are in the best position possible to build on their results?

It would be easy to pick on Kern’s record on publishing open access papers. Has he made all the efforts that would enable patients and doctors to make the best decisions they can on the basis of his research? His lab generates cell lines that can support further research. Are those freely available for others to use and build on? But to pick on Kern personally is to completely miss the point.

No, the problem is that this is systemic. Researchers across the board seem to have no interest whatsoever in looking closely at how we might deliver outcomes faster. No-one is prepared to think about how the system could be improved so as to deliver more because everyone is too focussed on climbing up the greasy pole; writing the next big paper and landing the next big grant. What is worse is that it is precisely in those areas where there is most public effort to raise money, where there is a desperate need, that attitudes towards making research outputs available are at their worse.

What made me absolutely incandescent about this piece was a small piece of data that some of use have known about for a while but has only just been published. Heather Piwowar, who has done a lot of work on how and where people share, took a close look at the sharing of microarray data. What kind of things are correlated with data sharing. The paper bears close reading (Full disclosure: I was the academic editor for PLoS ONE on this paper) but one thing has stood out from me as shocking since the first time I heard Heather discuss it: microarray data linked to studies of cancer is systematically less shared.

This is not an isolated case. Across the board there are serious questions to be asked about why it seems so difficult to get the data from studies that relate to cancer. I don’t want to speculate on the reasons because whatever they are, they are unnacceptable. I know I’ve recommended this video of Josh Sommer speaking many times before, but watch it again. Then read Heather’s paper. And then decide what you think we need to do about it. Because this can not go on.

May 31, 2011June 2, 2011

Evidence to the European Commission Hearing on Access to Scientific Information

On Monday 30 May I gave evidence at a European Commission hearing on Access to Scientific Information. This is the text that I spoke from. Just to re-inforce my usual disclaimer I was not speaking on behalf of my employer but as an independent researcher.

We live in a world where there is more information available at the tips of our fingers than even existed 10 or 20 years ago. Much of what we use to evaluate research today was built in a world where the underlying data was difficult and expensive to collect. Companies were built, massive data sets collected and curated and our whole edifice of reputation building and assessment grew up based on what was available. As the systems became more sophisticated new measures became incorporated but the fundamental basis of our systems weren’t questioned. Somewhere along the line we forgot that we had never actually been measuring what mattered, just what we could.

Today we can track, measure, and aggregate much more, and much more detailed information. It’s not just that we can ask how much a dataset is being downloaded but that we can ask who is downloading it, academics or school children, and more, we can ask who was the person who wrote the blog post or posted it to Facebook that led to that spike in downloads.

This is technically feasible today. And make no mistake it will happen. And this provides enormous potential benefits. But in my view it should also give us pause. It gives us a real opportunity to ask why it is that we are measuring these things. The richness of the answers available to us means we should spend some time working out what the right questions are.

There are many reasons for evaluating research and researchers. I want to touch on just three. The first is researchers evaluating themselves against their peers. While this is informed by data it will always be highly subjective and vary discipline by discipline. It is worthy of study but not I think something that is subject to policy interventions.

The second area is in attempting to make objective decisions about the distribution of research resources. This is clearly a contentious issue. Formulaic approaches can be made more transparent and less easy to legal attack but are relatively easy to game. A deeper challenge is that by their nature all metrics are backwards looking. They can only report on things that have happened. Indicators are generally lagging (true of most of the measures in wide current use) but what we need are leading indicators. It is likely that human opinion will continue to beat naive metrics in this area for some time.

Finally there is the question of using evidence to design the optimal architecture for the whole research enterprise. Evidence based policy making in research policy has historically been sadly lacking. We have an opportunity to change that through building a strong, transparent, and useful evidence base but only if we simultaneously work to understand the social context of that evidence. How does collecting information change researcher behavior? How are these measures gamed? What outcomes are important? How does all of this differ cross national and disciplinary boundaries, or amongst age groups?

It is my belief, shared with many that will speak today, that open approaches will lead to faster, more efficient, and more cost effective research. Other groups and organizations have concerns around business models, quality assurance, and sustainability of these newer approaches. We don’t need to argue about this in a vacuum. We can collect evidence, debate what the most important measures are, and come to an informed and nuanced inclusion based on real data and real understanding.

To do this we need to take action in a number areas:

1. We need data on evaluation and we need to able to share it.

Research organizations must be encouraged to maintain records of the downstream usage of their published artifacts. Where there is a mandate for data availability this should include mandated public access to data on usage.

The commission and national funders should clearly articulate that that provision of usage data is a key service for publishers of articles, data, and software to provide, and that where a direct payment is made for publication provision for such data should be included. Such data must be technically and legally reusable.

The commission and national funders should support work towards standardizing vocabularies and formats for this data as well critiquing it’s quality and usefulness. This work will necessarily be diverse with disciplinary, national, and object type differences but there is value in coordinating actions. At a recent workshop where funders, service providers, developers and researchers convened we made significant progress towards agreeing routes towards standardization of the vocabularies to describe research outputs.

2. We need to integrate our systems of recognition and attribution into the way the webÂ works through identifying research objects and linking them together in standard ways.

The effectiveness of the web lies in its framework of addressable items connected by links. Researchers have a strong culture of making links and recognizing contributions through attribution and citation of scholarly articles and books but this has only recently being surfaced in a way that consumer web tools can view and use. And practice is patchy and inconsistent for new forms of scholarly output such as data, software and online writing.

The commission should support efforts to open up scholarly bibliography to the mechanics of the web through policy and technical actions. The recent Hargreaves report explicitly notes limitations on text mining and information retrieval as an area where the EU should act to modernize copyright law.

The commission should act to support efforts to develop and gain wide community support for unique identifiers for research outputs, and for researchers. Again these efforts are diverse and it will be community adoption which determines their usefulness but coordination and communication actions will be useful here. Where there is critical mass, such as may be the case for ORCID and DataCite, this crucial cultural infrastructure should merit direct support.

Similarly the commission should support actions to develop standardized expressions of links, through developing citation and linking standards for scholarly material. Again the work of DataCite, CoData, Dryad and other initiatives as well as technical standards development is crucial here.

3. Finally we must closely study the context in which our data collection and indicator assessment develops. Social systems cannot be measured without perturbing them and we can do no good with data or evidence if we do not understand and respect both the systems being measured and the effects of implementing any policy decision.

We need to understand the measures we might develop, what forms of evaluation they are useful for and how change can be effected where appropriate. This will require significant work as well as an appreciation of the close coupling of the whole system.
We have a generational opportunity to make our research infrastructure better through effective evaluation and evidence based policy making and architecture development. But we will squander this opportunity if we either take a utopian view of what might technically feasible, or fail to act for a fear of a dystopian future. The way to approach this is through a careful, timely, transparent and thoughtful approach to understanding ourselves and the system we work within.

The commission should act to ensure that current nascent efforts work efficiently towards delivering the technical, cultural, and legal infrastructure that will support an informed debate through a combination of communication, coordination, and policy actions.

Related articles

Related articles

Related articles