PubMed Central – Science in the Open

And this is the chorus
This is the chorus
It goes round and around and gets into your brain
This is the chorus
A fabulous chorus
And thirty seconds from now you’re gonna to hear it again

This is the Chorus -Â Morris Major and the Minors

The Association of American Publishers have launched a response to the OSTP White House Executive Order on public access to publicly funded research. In this they offer to set up a registry or system called CHORUS which they suggest can provide the same levels of access to research funded by Federal Agencies as would the widespread adoption of existing infrastructure like PubMedCentral. It is necessary to bear in mind that this substantially the same group that put together the Research Works Act, a group with a long standing, and in some cases personal, antipathy to the success of PubMedCentral. There is therefore some grounds for scepticism about the motivations of the proposal.

However here I want to dig a bit more into the details of whether the proposalÂ canÂ deliver. I will admit to being sceptical from the beginning but the more I think about this, the more it seems that either there is nothing there at all – Â just a restatement of already announced initiatives – or alternately the publishers involved are setting themselves up for a potentially hugely expensive failure. Let’s dig a little deeper into this to see where the problems lie.

First the good bits. The proposal is to leverage FundRef to identify federally funded research papers that will be subject to the Executive Order. FundRef is a newly announced initiative from CrossRef which will include Funder grant information within the core metadata that CrossRef collects and can provide to users and will start to address the issues of data quality and completeness. To the extent that this is a commitment from a large group of publishers to support FundRef it is a very useful step forward. Based on the available funding information the publishers would then signal that these papers are accessible and this information would be used to populate a registry. Papers that are in the registry would be made available via the publisher websites in some manner.

Now the difficulties. You will note two sets of weasel words in the previous paragraph: “…the available funding information…” and “…made available via the publisher websites in some manner”. The second is really a problem for the publishers but I think a much bigger one than they realise. Simply making the version of record available without restrictions is “easy” but ensuring that access works properly in the context of a largely paywalled corpus is not as easy as people tend to think. Nature Publishing Group have spent years sorting out the fact that every time they do a system update that they remove access to the genome papers that are supposed to be freely accessible. If publishers decide they just want to make the final author manuscripts available then they will have to build up a whole parallel infrastructure to provide these – an infrastructure that will look quite a lot like PubMedCentral in fact, leading to potential duplication of effort and potential costs. This is probably less of an issue for the big publishers but for small publishers could become a real issue.

Bad for the agencies

But its the first set of weasel words that are the most problematic. The whole of CHORUS seems to be based on assumption that the FundRef information will be both accurate and complete. Anyone who has dealt with funding information inside publication workflows knows this is far from true. Comparison of funder information pulled from different sourcesÂ can give nearly disjunctÂ sets. And we know that authors are terrible at giving the correct grant codes when they can bothered including them at all. The Executive Order and FASTR put the agencies on the hook to report on success, compliance, and the re-use of published content. It is the agencies who get good information in the long term on the outputs of projects they fund – information that is often at odds with what is reported in the acknowledgement sections of papers.

Put this issue of data quality alongside the fact that the agencies will be relying on precisely those organisations that have worked to prevent, limit, and where that failed slow down the widening of public access and we have a serious problem of mismatched incentives. For the publishers there is direct incentive to fail to solve the data quality issue at the front end – it lets them make less papers available. The agencies are not in a position to force this issue at paper submission because their data isn’t complete until the grant finally reports. The NIH already has high compliance and an operating system, precisely because they couple grant reports to deposition. Other agencies will struggle to catch up using CHORUS and will deliver very poor compliance based on their own data. This is not a criticism of FundRef incidentally. FundRef is a necessary and well designed part of the effort to solve this problem in the longer term – but it is going to take years for the necessary systems changes to work their way through and there a big changes required to submission and editorial management systems to make this work well. And this brings us to the problems for publishers.

Bad for the publishers

If the agencies agree to adopt CHORUS they will do so with these issues very clear in their minds. The Office of Management and Budget oversight means that agencies have to report very closely on cost-benefit analyses for new projects. This alongside the issues with incentive misalignment, and just plain lack of trust, means that the agencies will do two things: they will insist that the costs are firewalled onto the publisher side, and they will put strong requirements on compliance levels and completeness. If I were an agency negotiator I would place a compliance requirement of 60% on CHORUS in year one rising to 75% and 90% in years two and three and stipulate that that compliance will be measured against final grant reports on an ongoing basis. Where compliance didn’t meet the requirements the penalty would be for all the relevant papers from that publisher to be placed in PubMedCentral at the publisher’s expense. Even if they’re not this tough they are certainly going to demand that the registry be updated to include all the papers that got missed at the publisher’s expense necessitating an on-going manual grind of metadata update, paper corrections, index notifications. Bear in mind that if we generously assume that 50% of submitted papers have good grant metadata and the US agencies contribute to around 25% of all global publications that this means around 10% of theÂ entire corpus will need to be updated year on year, probably through a process of semi-automated and manual reconciliation. If you’ve worked with agency data then you know its generally messy and difficult to manage – this is being worked on by building shared repositories and data systems that leverage a lot of the tooling provided by PubMed and PubMedCentral.

Alternately this could be a “triggering event” meaning that content would become available in the archives like CLOCKSS and PORTICO because access wasn’t properly provided. Putting aside the potential damage to the publisher brand if this happens, and the fact that it destroys the central aim of CHORUS – to control the dissemination path – this will also cost money. These archives are not well set up to provide differential access to triggered content, they release whole journals when a publisher goes bust. It’s likely that a partial trigger would require specialist repository sites to be set up to serve the content – again sites that would like an awful lot like PubMedCentral. The process is likely to lead to significantly more trigger events, requiring these dark repositories to function more actively as publishers, raising costs, and requiring them to build up repositories to serve content that would look an awful lot like…well you get the idea.

Finally there is the big issue – this puts the costs of improving funding data collection firmly in the hands of CHORUS publishers and means it needs to be done extremely rapidly. This work needs to be done, but it would be much better done through effective global collaboration between all funders, institutions and publishers. What CHORUS has effectively done is offer to absorb the full cost of this transition. As noted above the agencies will firewall their contributions. You can bet that institutions – for whom CHORUS will not assist and might hamper their efforts to ensure the collection of research outputs – will not pay for it through increased subscriptions. And publishers who don’t want to engage with CHORUS will be unlikely to contribute. It’s also almost certain that this development process will be rushed and ham fisted and irritate authors even more than they already are by current submission systems.

Finally of course a very large proportion of federal money moves through the NIH. The NIH has a system in place, it works, and they’re not about to adopt something new and unproven, especially given the popularity of PubMedCentral as demonstrated by the public response to the Research Works Act. So publishers will have to maintain dual systems anyway – indeed the most likely outcome of CHORUS will be to make it easier for authors to deposit works into PubMedCentral, and easier for the NIH to prod them into doing so raising the compliance rates for the NIH policy and making them look even better on the annual reports to the White House, leading ultimately to some sharp questions about why agencies didn’t adopt PMC in the first place.

Bad for the user

From the perspective of an Open Access advocate putting access into the hands of publishers who have actively worked to limit access and invested vast sums of money in systems to limit and control access seems a bad idea. But that’s a personal perspective – the publishers in questions will say they are guiding these audiences to the “right” version of papers in the best place for them to consume it. But lets look at the incentives for the different players. The agencies are on the hook to report on usage and impact of their work. They have the incentives to insure that whatever systems are in place work well and provide access well. Subscription publishers? They have a vested interest in trying to show there is a lack of public interest, in tweaking embargoes so as to only make things available after interest has waned, in providing systems that are poorly resourced so page loads are slow, and in general making the experience as poor as possible. After all if you need to show you’re adding value with your full cost version, then its really helpful to be in complete control of the free version so as to cripple it.Â On the plus side it would mean that these publishers would almost certainly be forced to provide detailed usage information which would be immensely valuable.

…which is bad for the publishers…

The more I think about this, the less it seems to have been thought through in detail. Is it just a commitment to use FundRef? This would be a great step but it goes nowhere near even beginning to satisfy the White House requirements. If its more than that what is it? A registry? But that requires a crucial piece of metadata, which appears as “Licence Reference” in the diagram, that is needed to assert things are available. This hasn’t been agreed yet (I should know, I’ve been involved in drafting the description). And even when it is no piece of metadata can make sure access actually happens. Is it a repository that would guarantee access? No – that’s what the CHORUS members hate above all other things. Is it a firm contractual commitment to making those articles with agency grant numbers attached available? Not that I’ve seen, but even it were it wouldn’t address the requirements of either the Executive Order or FASTR. As noted above, the mandate applies to all agency funded research, not just those where the authors remembered to put in all the correct grant numbers.

Is it a commitment to ensuring the global collection of comprehensive grant information at manuscript submission? With the funding to make it happen – and the funding to ensureÂ the papers become available -Â and real penalties if it doesn’t happen? With provision of comprehensive usage data for both subscription and freely available content? This is the only level at which the agencies will bite. And this is a horrendous and expensive can of worms.

In the UK we have a Victorian infrastructure for delivering water. It just about works but a huge proportion of the total just leaks out of the pipes – its not like we have a shortage of rain but when we have a “drought” we quickly run into serious problems. The cost of fixing the pipes? Vastly more than we can afford. What IÂ think happened with CHORUS is what happens with a lot industry wide tech projects. Someone had a bright idea, and went to each player asking them whether they could deliver their part of the pipeline. Each player has slightly overplayed the ease of delivery, and slightly underplayed the leakage and problems. A few percent here and a few percent there isn’t a problem for each step in isolation – but along the whole pipeline it adds up to the point where the whole system simply can’t deliver. And delivering means replacing the whole set of pipes.