Chapter, Verse, and CHORUS: A first pass critique

And this is the chorus
This is the chorus
It goes round and around and gets into your brain
This is the chorus
A fabulous chorus
And thirty seconds from now you’re gonna to hear it again

This is the Chorus - Morris Major and the Minors

The Association of American Publishers have launched a response to the OSTP White House Executive Order on public access to publicly funded research. In this they offer to set up a registry or system called CHORUS which they suggest can provide the same levels of access to research funded by Federal Agencies as would the widespread adoption of existing infrastructure like PubMedCentral. It is necessary to bear in mind that this substantially the same group that put together the Research Works Act, a group with a long standing, and in some cases personal, antipathy to the success of PubMedCentral. There is therefore some grounds for scepticism about the motivations of the proposal.

However here I want to dig a bit more into the details of whether the proposal can deliver. I will admit to being sceptical from the beginning but the more I think about this, the more it seems that either there is nothing there at all –  just a restatement of already announced initiatives – or alternately the publishers involved are setting themselves up for a potentially hugely expensive failure. Let’s dig a little deeper into this to see where the problems lie.

First the good bits. The proposal is to leverage FundRef to identify federally funded research papers that will be subject to the Executive Order. FundRef is a newly announced initiative from CrossRef which will include Funder grant information within the core metadata that CrossRef collects and can provide to users and will start to address the issues of data quality and completeness. To the extent that this is a commitment from a large group of publishers to support FundRef it is a very useful step forward. Based on the available funding information the publishers would then signal that these papers are accessible and this information would be used to populate a registry. Papers that are in the registry would be made available via the publisher websites in some manner.

Now the difficulties. You will note two sets of weasel words in the previous paragraph: “…the available funding information…” and “…made available via the publisher websites in some manner”. The second is really a problem for the publishers but I think a much bigger one than they realise. Simply making the version of record available without restrictions is “easy” but ensuring that access works properly in the context of a largely paywalled corpus is not as easy as people tend to think. Nature Publishing Group have spent years sorting out the fact that every time they do a system update that they remove access to the genome papers that are supposed to be freely accessible. If publishers decide they just want to make the final author manuscripts available then they will have to build up a whole parallel infrastructure to provide these – an infrastructure that will look quite a lot like PubMedCentral in fact, leading to potential duplication of effort and potential costs. This is probably less of an issue for the big publishers but for small publishers could become a real issue.

Bad for the agencies

But its the first set of weasel words that are the most problematic. The whole of CHORUS seems to be based on assumption that the FundRef information will be both accurate and complete. Anyone who has dealt with funding information inside publication workflows knows this is far from true. Comparison of funder information pulled from different sources can give nearly disjunct sets. And we know that authors are terrible at giving the correct grant codes when they can bothered including them at all. The Executive Order and FASTR put the agencies on the hook to report on success, compliance, and the re-use of published content. It is the agencies who get good information in the long term on the outputs of projects they fund – information that is often at odds with what is reported in the acknowledgement sections of papers.

Put this issue of data quality alongside the fact that the agencies will be relying on precisely those organisations that have worked to prevent, limit, and where that failed slow down the widening of public access and we have a serious problem of mismatched incentives. For the publishers there is direct incentive to fail to solve the data quality issue at the front end – it lets them make less papers available. The agencies are not in a position to force this issue at paper submission because their data isn’t complete until the grant finally reports. The NIH already has high compliance and an operating system, precisely because they couple grant reports to deposition. Other agencies will struggle to catch up using CHORUS and will deliver very poor compliance based on their own data. This is not a criticism of FundRef incidentally. FundRef is a necessary and well designed part of the effort to solve this problem in the longer term – but it is going to take years for the necessary systems changes to work their way through and there a big changes required to submission and editorial management systems to make this work well. And this brings us to the problems for publishers.

Bad for the publishers

If the agencies agree to adopt CHORUS they will do so with these issues very clear in their minds. The Office of Management and Budget oversight means that agencies have to report very closely on cost-benefit analyses for new projects. This alongside the issues with incentive misalignment, and just plain lack of trust, means that the agencies will do two things: they will insist that the costs are firewalled onto the publisher side, and they will put strong requirements on compliance levels and completeness. If I were an agency negotiator I would place a compliance requirement of 60% on CHORUS in year one rising to 75% and 90% in years two and three and stipulate that that compliance will be measured against final grant reports on an ongoing basis. Where compliance didn’t meet the requirements the penalty would be for all the relevant papers from that publisher to be placed in PubMedCentral at the publisher’s expense. Even if they’re not this tough they are certainly going to demand that the registry be updated to include all the papers that got missed at the publisher’s expense necessitating an on-going manual grind of metadata update, paper corrections, index notifications. Bear in mind that if we generously assume that 50% of submitted papers have good grant metadata and the US agencies contribute to around 25% of all global publications that this means around 10% of the entire corpus will need to be updated year on year, probably through a process of semi-automated and manual reconciliation. If you’ve worked with agency data then you know its generally messy and difficult to manage – this is being worked on by building shared repositories and data systems that leverage a lot of the tooling provided by PubMed and PubMedCentral.

Alternately this could be a “triggering event” meaning that content would become available in the archives like CLOCKSS and PORTICO because access wasn’t properly provided. Putting aside the potential damage to the publisher brand if this happens, and the fact that it destroys the central aim of CHORUS – to control the dissemination path – this will also cost money. These archives are not well set up to provide differential access to triggered content, they release whole journals when a publisher goes bust. It’s likely that a partial trigger would require specialist repository sites to be set up to serve the content – again sites that would like an awful lot like PubMedCentral. The process is likely to lead to significantly more trigger events, requiring these dark repositories to function more actively as publishers, raising costs, and requiring them to build up repositories to serve content that would look an awful lot like…well you get the idea.

Finally there is the big issue – this puts the costs of improving funding data collection firmly in the hands of CHORUS publishers and means it needs to be done extremely rapidly. This work needs to be done, but it would be much better done through effective global collaboration between all funders, institutions and publishers. What CHORUS has effectively done is offer to absorb the full cost of this transition. As noted above the agencies will firewall their contributions. You can bet that institutions – for whom CHORUS will not assist and might hamper their efforts to ensure the collection of research outputs – will not pay for it through increased subscriptions. And publishers who don’t want to engage with CHORUS will be unlikely to contribute. It’s also almost certain that this development process will be rushed and ham fisted and irritate authors even more than they already are by current submission systems.

Finally of course a very large proportion of federal money moves through the NIH. The NIH has a system in place, it works, and they’re not about to adopt something new and unproven, especially given the popularity of PubMedCentral as demonstrated by the public response to the Research Works Act. So publishers will have to maintain dual systems anyway – indeed the most likely outcome of CHORUS will be to make it easier for authors to deposit works into PubMedCentral, and easier for the NIH to prod them into doing so raising the compliance rates for the NIH policy and making them look even better on the annual reports to the White House, leading ultimately to some sharp questions about why agencies didn’t adopt PMC in the first place.

Bad for the user

From the perspective of an Open Access advocate putting access into the hands of publishers who have actively worked to limit access and invested vast sums of money in systems to limit and control access seems a bad idea. But that’s a personal perspective – the publishers in questions will say they are guiding these audiences to the “right” version of papers in the best place for them to consume it. But lets look at the incentives for the different players. The agencies are on the hook to report on usage and impact of their work. They have the incentives to insure that whatever systems are in place work well and provide access well. Subscription publishers? They have a vested interest in trying to show there is a lack of public interest, in tweaking embargoes so as to only make things available after interest has waned, in providing systems that are poorly resourced so page loads are slow, and in general making the experience as poor as possible. After all if you need to show you’re adding value with your full cost version, then its really helpful to be in complete control of the free version so as to cripple it. On the plus side it would mean that these publishers would almost certainly be forced to provide detailed usage information which would be immensely valuable.

…which is bad for the publishers…

The more I think about this, the less it seems to have been thought through in detail. Is it just a commitment to use FundRef? This would be a great step but it goes nowhere near even beginning to satisfy the White House requirements. If its more than that what is it? A registry? But that requires a crucial piece of metadata, which appears as “Licence Reference” in the diagram, that is needed to assert things are available. This hasn’t been agreed yet (I should know, I’ve been involved in drafting the description). And even when it is no piece of metadata can make sure access actually happens. Is it a repository that would guarantee access? No – that’s what the CHORUS members hate above all other things. Is it a firm contractual commitment to making those articles with agency grant numbers attached available? Not that I’ve seen, but even it were it wouldn’t address the requirements of either the Executive Order or FASTR. As noted above, the mandate applies to all agency funded research, not just those where the authors remembered to put in all the correct grant numbers.

Is it a commitment to ensuring the global collection of comprehensive grant information at manuscript submission? With the funding to make it happen – and the funding to ensure the papers become available - and real penalties if it doesn’t happen? With provision of comprehensive usage data for both subscription and freely available content? This is the only level at which the agencies will bite. And this is a horrendous and expensive can of worms.

In the UK we have a Victorian infrastructure for delivering water. It just about works but a huge proportion of the total just leaks out of the pipes – its not like we have a shortage of rain but when we have a “drought” we quickly run into serious problems. The cost of fixing the pipes? Vastly more than we can afford. What I think happened with CHORUS is what happens with a lot industry wide tech projects. Someone had a bright idea, and went to each player asking them whether they could deliver their part of the pipeline. Each player has slightly overplayed the ease of delivery, and slightly underplayed the leakage and problems. A few percent here and a few percent there isn’t a problem for each step in isolation – but along the whole pipeline it adds up to the point where the whole system simply can’t deliver. And delivering means replacing the whole set of pipes.

 

Enhanced by Zemanta

Draft White Paper – Researcher identifiers

National Science Foundation (NSF) Logo, reprod...
Image via Wikipedia

On April 26 I am attending a joint meeting of the NSF and EuroHORCS (European Heads of Research Councils) on “Changing the Conduct of Science in the Information Age”. I have been asked to submit a one page white paper in advance of the meeting and have been struggling a bit with this. This is stage one, a draft document relating to researcher identifiers. I’m not happy with it but reckon that other people out there may well be able to help where I am struggling. I may write a second one on metrics or at least a brief literature collection. Any and all comments welcome.

Summary

Citation lies at the core of research practice, recognizing both the contributions that others have made in the development of a specific piece of work and in linking related knowledge together. The technology of the web make it technically feasible to radically improve the precision, accuracy, and completeness of these links. Such improvements are crucial to the successful implementation of any system that purports to measure research outputs or quality.

Additionally the web offers the promise of extended and distributed projects involving diverse parties, many of which may not be professional researchers. Nonetheless such contributions deserve the same level of recognition as comparable contributions from professional researchers. Providing an open system in which people can contribute to research efforts and receive credit raises significant social issues of control and validation of identities.

Such open systems and federated systems are exposed to potential failure through lack of technical expertise on the part of users, particularly where a person loses or creates a new identity. This in many ways is already the case where we use institutional email addresses as proxies for researcher identity. Is D.C.Neylon@####.ac.uk (a no longer functioning email address), the same person as Cameron.Neylon@####.ac.uk? It is technically feasible to consider systems in which the actual identity token used is widely available and compatible with the wider consumer web but centralised and trusted authorities provide validation services to confirm specific claims around identity. Such a broker or clearing house would provide a similar role for identities as CrossRef provides for scholarly articles via the DOI.

General points

  • By adding the concept of a semantic-web ready researcher identifier, i.e. an identifier that provides a URL endpoint that uniquely represents a specific researcher, the existing technical capacity of the semantic web stack can be used to provide a linked data representation of contributions to existing published research objects that are accessible at URL endpoints. Such a representation could be readily expanded beyond authorship to funder contributions.
  • Crediting a researcher as a contributor to a specific object published on the web is a specific form of citation or linking in this view
  • The authoring tools to support the linking and publishing of research objects in this form do not currently exist in a widely useable form.
  • Semantic web technology provides an extensible means to adding and recognising diverse forms of contribution.

Authorisation, validation, and control

  • Access to such identifiers must be under the control of the researcher and not limited to those with institutional affiliations. Any person must be able to obtain and control a unique researcher identifier that refers to them.
  • Authorisation and validation of claimed rights of access or connections to specific institutions can be technically handled separately from the provision of identifiers.

Technical and legal issues

  • OpenID and OAuth provide a developing internet standard that provides technical means to achieve a distributed availability of identifiers and to separate issues of authentication from those of identification. They are a current leader for federated identity and authorisation solutions on the consumer web.
  • OpenID and OAuth do not currently provide the levels of security required in several jurisdictions for personal or sensitive information (e.g. UK data protection act).  Such federated systems may fall foul of jurisdictions with strong generic privacy requirements, e.g. Canada
  • To interoperate with the wider web and enable a wider notion of citation as a declaration of a piece of knowledge “Person X authored Paper Y”, identities must resolve on the web, in the sense of being a clickable hyperlink that takes a human or machine reader to a page containing information representing that person.

Social issues

  • There are profound social issues of trust in the maintenance of such identifiers, especially for non-professional researchers in the longer term.
  • A centralised trusted authority (or authorities) that validates specific claims about identity (a “CrossRef for people”) might provide a trusted broker for identity transactions in the research space that solves many of these trust problems.
  • Issues around trust and jurisdiction as well as scope and control are likely to limit and fragment any effort to coordinate, federate, or integrate differing identity solutions in the research space. Therefore interoperability of any developed system with the wider web must be a prime consideration.

Conclusions

Identity, unique identifiers, authorisation of access and validation of claims are issues that need to be solved before any transparent and believable metric systems can be reliably implemented. In the current world ill-considered, non-transparent, and irreproducible metric systems will almost inevitably lead to legal claims. At the same time there is a massive opportunity for wider involvement in research for which a much more diverse range of people’s contributions will need recognition.

A system in which recognition and citation takes the form of a link to a specified address on the web that represents a person has the potential to simultaneously make it much easier to unambiguously confirm who is being credited but additionally provides the opportunity to leverage an existing stack of tools and services to aggregate and organize information relating to identity. This is in fact a specific example of a wider view of addressable research objects on the web that can be part of a web of linked data. In this view a person is simply another object that can have specified relationships (links) to other objects.

Partial technical solutions in the form of OAuth, and OpenID exist that solve some subset of these problems. However these systems are currently not technically secure to a level compatible with handling the transfer of sensitive data. However they can interoperate with more secure transfer systems. They provide a federated and open system that enables any person to obtain and assert an identity and to control the appearance of that identity. Severe social issues around trust and persistence exist for this kind of system. This may be addressed through trusted centralized repositories that can act as a reliable broker.

Given expected issues with uptake of any system, systems that are interoperable with competitive or complementary offerings are crucial.

Reblog this post [with Zemanta]