In the first post in this series I identified a series of challenges in scholarly publishing while stepping through some of the processes that publishers undertake in the management of articles. A particular theme was the challenge of managing a heterogenous stream of articles and their associated heterogeneous formats and problems, in particular at a large scale. An immediate reaction many people have is that there must be technical solutions to many of these problems. In this post I will briefly outline some of the charateristics of possible solutions and why they are difficult to implement. Specifically we will focus on how the pipeline of the scholarly publishing process that has evolved over time makes large scale implementation of new systems difficult.
The shape of solutions
Many of the problems raised in the previous post, as well as many broader technical challenges in scholarly communications can be characterised as issues of either standardisation (too many different formats) or many-to-many relationships (many different source image softwares and a range of different publisher systems, but equally many institutions trying to pay APCs to many publishers). Broadly speaking, the solutions to this class of problem involve adopting standards.
Without focussing too much on technical details, because there are legitimate differences of opinion on how best to implement this, solutions would likely involve re-thinking the form of the documents flowing through the system. In an ideal world authors would be working in a tool that interfaces directly and cleanly with publisher systems. At the point of submission they would be able to see what the final published article would like, not because the process of conversion to its final format would be automated but it would be unnecessary. The document would be stored throughout the whole process in a consistent form from which it could always be rendered to web or PDF or whatever is required on the fly. Metadata would be maintained throughout in a single standard form, ideally one shared across publisher systems.
The details of these solutions don’t really matter for our purposes here. What matters is that they are built on a coherent standard representation of documents, objects and bibliographic metadata that flow from the point of submission (or ideally even before) through to the point of publication and syndication (or ideally long after that).
The challenges of the solutions
Currently the scholarly publishing pipeline is really two pipelines. The first is the pipeline where format conversions of the document are carried out and the second a pipeline where data about the article is collected and managed. There is usually redundancy, and often inconsistency between these two pipelines and they are frequently managed by Heath Robinson-esque processes which have been developed to patch systems from different suppliers together. A lot of the inefficiencies in the publication process are the result of this jury-rigged assembly but at the same time its structure is one of the main things preventing change and innovation.
If the architectural solution to these problems is one of adopting a coherent set of standards throughout the submission and publication process then the technical issue is one of systems re-design to adopt these standards. From the outside this looks like a substantial job, but not an impossible one. From the inside of a publisher it is not merely Sysyphean, but more like pushing an ever growing boulder up a slope which someone is using as a (tilted) bowling alley.
There is a critical truth which few outside the publishing industry have grasped. Most publishers do not control their own submission and publication platforms. Some small publishers successfully build their entire systems (PeerJ and PenSoft) are two good examples. But these systems are built for (relatively) small scale and struggle as they grow beyond a thousand papers a year. Some medium sized publishers can successfully maintain their own web server platforms (PLOS is one example of this) but very few large publishers retain technical control over their whole pipelines.
Most medium to large publishers outsource both their submission systems (to organisations like Aries and ScholarOne) and their web server platforms (to companies like Atypon and HighWire). There were good tactical reasons for this in the past, advantages of scale and expertise, but strategically it is a disaster. It leaves essentially the entire customer facing (whether the author or the reader) part of a publishing business in the hands of third parties. And the scale gained by that centralisation, as well as the lack of flexibility that scale tends to create, makes the kind of re-tooling envisaged above next to impossible.
Indeed the whole business structure is tipped against change. These third party players hold quasi-monopoly positions. Publishers have nowhere else to go. The service providers, with publishers as their core customers, need to cover the cost of development and do so by charging their customers. This is a perfectly reasonable thing to do, but with the choice of providers limited (and a shift from to the other costly and painful), and with no plausible DIY option for most publishers, the reality is that it is in the interests of the providers to hold of implementing new functionality until they can charge the maximum for it. Publishers are essentially held to ransom for relatively small changes, radical re-organisation of the underlying platform is simply impossible.
Means of escape
So what are the means of escaping this bind? Particularly given that its a natural consequence of the market structure and scale? No-one designed the publishing industry to do this. The biggest publishers have the scale to go it alone. Elsevier took this route when they purchased the source code to Editorial Manager, the submission platform that is run by Aries as a service for many other publishers. Alongside their web platforms (and probably the best back end data system in the business) this gives Elsevier end-to-end control over their technical systems. It didn’t stop Elsevier spending a rumoured $25M on a new submission system that has never really surfaced, illustrating just how hard these systems are to build.
But Elsevier has a scale and technical expertise (not to mention cash) that most other publishers can only envy. Running such systems in house is a massive undertaking and doing further large scale development even bigger again. Most publisher can not do this alone.
Where there is a shared need for a scalable platform, Community Open Source projects can provide a solution. Publishing has not traditionally embraced Open Source or shared tools, preferring to re-invent the wheel internally. The Public Knowledge Project‘s Open Journal Systems is the most successful play in this space, running more journals (albeit small ones) than any other software platform. There are technical criticisms to be made of OJS and it is in need of an overhaul, but a big problem is that, while it is a highly successful community, it has never become a true Open Source Community Project. The Open Source code is taken and used in many places, but it is locally modified and there had not been the development of a culture of contributing code back to the core.
The same could be said of many other Open Source projects in the publishing space. Frequently the code is available and re-usable, but there is little or no effort put into creating a viable community of contribution. It is Open Source Code, not an Open Source Project. Publishers are often so focussed on maintaining their edge against competition that the hard work of creating a viable community around a shared resource gets lost or forgotten.
It is also possible that the scale of publishing is insufficient to support true Open Source Community projects. The big successful Open Source infrastructure projects are critical across the web, toolsets that are so widely used that its a no-brainer for big corporations more used to patent and copyright fights to invest in them. It could be that there are simply too few developers and too little resource in publishing to make pooling them viable. My view is that the challenge is more political than resourcing, more in fact a result of the fact that there are maybe one or two CTOs or CEOs in the entire publishing industry with deep experience of the Open Source world, but there are differences from those other web-scale Open Source projects and it is important to bear that in mind.
Barriers to change
When I was at PLOS I got about one complaint a month about “why can’t PLOS just do X” where X was a different image format, a different process for review, a step outside the standard pipeline. The truth of the matter is that it can almost always be done. If a person can be found to hand-hold that specific manuscript through the entire process. You can do it once, and then someone else wants it, and then you are in a downward spiral of monkey patching to try and keep up. It simply doesn’t scale.
A small publisher, one that has completely control over their pipeline and is operating at hundreds or low thousands of articles a year, can do this. That’s why you will continue to see most of the technical innovation within the articles themselves come from small players. The very biggest players can play both ends, they have control over their systems, the technical expertise to make something happen, and the resources to throw money at solving a problem manually to boot. But then those solutions don’t spread, because no-one else can follow them. Elsevier is alone in this category, with Springer-Nature possibly following if they can successfully integrate the best of both companies together.
But the majority of content passes through the hands of medium sized players, and players with no real interest in technical developments. These publishers are blocked. Without a significant structural change in the industry it is unlikely we will see significant change. To my mind that structural change can only happen if a platform is developed that provides scale by supporting across multiple publishers. Again, to my mind, that can only be provided by an Open Source platform. One with a real community program behind it. No medium sized publisher wants to shift to a new proprietary platform which will just lock them in again. But publishers across the board have collectively demonstrated a lack of willingness as well as a lack of understanding of what an Open Source Project really is.
Change might come from without, from a new player providing a fresh look at how to manage the publishing pipeline. It might come from projects within research institutions or a collaboration of scholarly societies. It could even come from within. Publishers can collaborate when they realise it is in their collective interests. But until we see new platforms that provide flexible and standards based mechanisms for managing documents and their associated metadata throughout their life cycle, that operate successfull at a scale of tens to hundreds of thousands of articles a year, and that above all are built and governed by systems that publishers trust we will at most incremental change.
You say that OJS’s “biggest problem is that it has never really embraced the idea of being a community project”. While the software itself can’t embrace anything, I think the PKP has very much wanted this to become a community project and in fact has so wants it to be a community project that he has deliberately avoided offering a for-fee hosted version. I think the reason people haven’t contributed back code is the pre-3.0 codebase turned out to be poorly architected and therefore difficult to modify. I hope that version 3.0 (currently in beta) brings not only an improved interface but also an improved architecture for modification by the community.
Good point. That’s badly worded and I’ll make a correction. PKP is actually a very successful community project and a very successful community. The OJS software is also very successful in terms of reach. What hasn’t worked well is OJS as a community open source project, with contributions and community governance.
The 3.0/2.0 split is a good example. This is a classic anti-pattern. The 2.0/3.0 divide happens because architectural change is necessary but because of the way the code modifications have flowed there are thousands of non-compatible 2.0 forks out there in the wild that will be difficult or impossible to reconcile and update.
What is needed IMO is some space for the OJS community to breathe and be able to change some pieces one at a time. I’m excited by the work that Ubiquity and CKF are doing in this sense because they could create an Open Source ecosystem with modular components that give some space for adoption of new pieces while the architectural shifts needed to update OJS can be pursued.
I’m interested to know what you think of Frontiers in this context. Along with PeerJ I would regard Frontiers as the poster child for “large scale implementation of new systems”, and judging by this blog post http://blog.frontiersin.org/2015/10/13/frontiers-financial-commitment-to-open-access-publishing/ a significant chunk of their APCs goes into the challenges of scaling beyond 1000 papers/year (Frontiers published >10000 articles in 2014, I think).
Suppose we try to push widely the idea that OJS should become “true” open source software. How many developers would be needed for it to work? Are there other barriers?
It’s really not a question of number of developers but commitment to community management, and resourcing of community managers. There are more than enough people doing work on OJS worldwide, and more than enough organisations with some actual resource to contribute dev resource either in kind or as money. The challenge is building the community structures that make it work and resourcing those. That’s where many OS projects go wrong in my view – failing to see that its about much more than dev resource.
If you look at something like the Apache foundation then I think there is very little actual dev resource at the centre but lots of infrastructures to support communities (and lots of argument as to whether they are the right shape/form). The dev resource flows from the community to the centre precisely because those infrastructures mean a lot of value is added to any contribution.
Add to that, as Kevin says the technical problem. OJS 2 is not well set up for contributions, but as a result of its structure almost every instance of OJS in the wild is a fork that’s not compatible with 3.0. How to shift is very hard. This is why I think the collaboration with Collaborative Knowledge Foundation is so exciting. It provides some technical space to help bring those forks back to a central plan.
Cameron NeylonProfessor of Research Communications
Centre for Culture & Technology, Curtin University cn@cameronneylon.net – http://cameronneylon.net
@cameronneylon -Â http://orcid.org/0000-0002-0068-716X