PolEcon of OA Publishing: What are the assets of a journal?

Victory Press of Type used by SFPP (Photo credit: Wikipedia)
Victory Press of Type used by SFPP (Photo credit: Wikipedia)

This post wasn’t on the original slate for the Political Economics of Publishing series but it seems apposite as the arguments and consequences of the Editorial Board of Lingua resigning en masse to form a new journal published by Ubiquity Press continue to rumble on.

The resignation of the editorial board of Lingua from the (Elsevier owned) journal to form a new journal, that is intended to really be “the same journal” raises interesting issues of ownership and messaging. Perhaps even more deeply it raises questions of what the real assets of a journal are. The mutual incomprehension on both sides really arises from very different views of what a journal is and therefore of what the assets are, who controls them, and who owns them. The views of the publisher and the editorial board are so incommensurate as to be almost comical. The views, and more importantly the actions, of the group that really matters, the community that underlies the journal, remains to be seen. I will argue that it is that community that is the most important asset of the strange composite object that is “the journal” and that it is control of that asset that determines how these kinds of process (for as many have noted this is hardly the first time this has happened) play out.

The publisher view is a fairly simple one and clearly expressed by Tom Reller in a piece on Elsevier Connect. Elsevier no doubt holds paperwork stating that they own the trademark of the journal name and masthead and other subsidiary rights to represent the journal as continuing the work of the journal first founded in 1949. That journal was published by North Holland, an entity purchased by Elsevier in the 90s. The work of North Holland in building up the name was part of the package purchased by Elsevier, and you can see this clearly in Reller’s language in his update to the post where he says Elsevier sees the work of North Holland as the work of Elsevier. The commercial investment and continuity is precisely what Elsevier purchased and this investment is represented in the holding of the trademarks and trading rights. The investment, first of North Holland, and then Elsevier in building up the value of these holdings was for the purpose of gaining future returns. Whether the returns are from subscription payments or APCs no longer matters very much, what matters is realising them and retaining control of the assets.

As a side note the ownership of these journals founded in the first half of the twentieth century is often much less clear than these claims would suggest. Often the work of an original publisher would have been seen as a collaboration, contracts may not exist and registering of trademarks and copyright may have come much later. I know nothing about the specifics of Lingua but it is not uncommon for later instantiations of an editorial board to have signed over trademarks to the publisher in a way that is legally dubious. The reality of course is that legal action to demonstrate this would be expensive, impractical and pretty risky. A theoretical claim of legal fragility is not much use against the practical fact that big publishers can hire expensive lawyers. The publisher view is that they own those core assets and have invested in them to gain future returns.  They will therefore act to protect those assets.

The view of the editorial board is almost diametrically opposed. They see themselves as representing a community of governance and creating and providing the intellectual prestige of the journal. For the editorial board that community prestige is the core asset. With the political shifts of Open Access and digital scholarship questions of governance have started to play into those issues of prestige. Communities and journals that want to position themselves as forward looking, or supporting their community, are becoming increasingly concerned with access and costs. This is painted as a question of principle, but the core underlying concern is the way that political damage caused by lack of access, or by high (perceived) prices will erode the asset value of the prestige that the editorial board has built up through their labour.

This comes to a head where the editorial board asks Elsevier to hand over the rights to the journal name. For Elsevier this is a demand to simply hand over an asset, the parcel of intellectual property rights. David Mainwaring’s description of the demands of the board as “a gun to the head” gives a sense of how many people in publishing would view that letter. From that perspective it is clearly unreasonable, even deliberately so, intended to create conflict with no expectation of resolution. This is a group demanding, with menaces, the handover of an asset which the publisher has invested in over the years. The paper holdings of trademarks and trading rights represent the core asset and the opportunities for future returns. A unilateral demand to hand them over is only one step from high way robbery. Look closely at the language Reller uses to see how the link between investment and journal name, and therefore those paper holdings is made.

For the editorial board the situation is entirely different. I would guess many researchers would look at that letter and see nothing unreasonable in it at all. The core asset is the prestige and they see Elsevier as degrading that asset, and therefore the value of their contribution over time. For them, this is the end of a long road in which they’ve tried to ensure that their investment is realised through the development of prestige and stature, for them and for the journal. The message they receive from Elsevier is that it doesn’t value the core asset of the journal and that it doesn’t value their investment. To address this they attempt to gain more control, to assert community governance over issues that they hadn’t previously engaged with. These attempts to engage over new issues – price and access – are often seen as naive, or at best fail to connect with the publisher perspective. The approaches are then rebuffed and the editorial group feel they have only a single card left to play, and the tension therefore rises in the language that they use. What the publisher sees as a gun to the head, the editorial board see as their last opportunity to engage on the terms that they see as appropriate.

Of course, this kind of comprehension gap is common to collective action problems that reach across stakeholder groups, and as in most cases that lack of comprehension leads to recrimination and dissmissal of the other parties perspective as on one hand motivated by self-interest on one hand and on the other by naivety and arrogance. There is some justice in these characterisations but regardless of which side of the political fence you may sit it is useful to understand that these incompatible views are driven by differing narratives of value, by an entirely different view as to what the core assets are. On both sides the view is that the other party is dangerously and wilfully degrading the value of the journal.

Both views; that the value of the journal is the prestige and influence realised out of expert editorial work, and that the value is the brand of the masthead and the future income it represents, are limited. They fail to engage with the root of value creation in the authorship of the content. The real core asset is the community of authors. Of course both groups realise this. The editorial board believes that they can carry the community of authors with them. Elsevier believes that the masthead will keep that community loyal. The success or failure of the move depends on which of them is right. That answer is that probably both are to some extent which means the community gets split, the asset degraded in a real-life example of the double-defection strategy in a Prisoner’s Dillemma game.

Such editorial board resignations are not new. There have been a number in the past, some more successful than others. It is important to note that the editorial board is not the community, or representative of them. It is precisely those cases where the editorial board is most directly representing the community of authors where defections are most successful. On the past evidence Elsevier are probably correct to gamble that the journal will at least survive. The factors in favour of the editorial board are that Linguistics is a relatively small, tight knit community, that they have a credible (and APC free) offer on the table that will look and feel a lot like the service offering that they had. I would guess that Lingua authors are focussed on the journal title and only think of the publisher as a distant second issue, if they are even aware of who the publisher is. In that sense the emergence of new lean publishers like Ubiquity Press and consortial sustainability schemes like Open Library of Humanities are a game changer, offering a high quality experience that otherwise looks and feels like a traditional journal process (again, it is crucial to emphasise that lack of APCs to trouble the humanities scholar) while also satisfying the social, funder and institutional pressure for Open Access.

Obviously my sympathies lie with the editorial board. I think they have probably the best chance to make this work we have yet seen. The key is to bring the author community with them. The size and interconnections of this specific community make this possible.

But perhaps more interesting is to look at it from the Elsevier perspective. The internal assessment will be that there were no options here. They’ve weathered similar defections in the past, usually with success and there would be no value in acceding to the demands of the editorial board. The choice was to hold a (possibly somewhat degraded) asset or to give it away. The internal perception will be that the new journal can’t possibly survive, probably that Ubiquity Press will be naively under funded and can’t possibly offer the level of service that the community will expect. Best case scenario is steady as she goes, with a side order of schadenfreude as the new journal fails; worst case, the loss of value to a single masthead. And on the overall profit and loss sheet a single journal doesn’t really matter as its the aggregate value that sells subscription bundles. Logical analysis points to defection as the best move in the prisoner’s dilemma.

Except I think that’s wrong, for two reasons. One is that this is not a single trial prisoners dillemma, its a repeated trial with changing conditions. Second, the asset analysis plays out differently for Elsevier than it does for the editorial board making the repeated trials more important. The asset for the editorial board is the community of the journal. The asset for Elsevier is the author community of all their journals. Thus for the editorial board they are betting everything on one play – they are all in, hence the strength of the rhetoric being deployed. Elsevier need to consider how their choices may play into future conditions.

Again, the standard analyis would be “protect the assets”. Send a strong message to the wider community (including shareholders), that the company will hold its assets. The problem for me is that this is both timid, and in the longer term potentially toxic. It’s timid, compared to what should surely be a robust internal view of the value that Elsevier offers. That the quality services they provided simply cannot be offered sustainably at a lower price. The confident response would be to call the board’s bluff, put the full costing and offer transparently on the table in front of the community and force them to do a compare and contrast. More than just saying “it can’t be done at the price you demand” put out the real costs and the real services.

The more daring move would be to let the editorial board take the name on a “borrow and return” basis, giving Elsevier first right of refusal if (and in the Elsevier view, when) they find that they’re not getting what they need in their new low cost (and therefore, in the Elsevier view, low service) environment. After all, the editorial board already have money to support APCs according to their letter. It’s risky of course, but again it would signal strong confidence in the value of the services offered. Publishers rarely have to do this, but I find it depressing that they almost always shy away from opportunities to really place their value offering in a true market in front of their author communities. To my mind it shows a lack of robust internal confidence in the value they offer.

But beyond the choice of actions there’s a reason why this standard approach is potentially toxic, and potentially more toxic long term even if, perhaps especially if, Elsevier can continue to run the journal with a new board. If Elsevier are to protext the existing asset as they see it, they need to make the case that the journal can continue to run as normal with a new board. The problem is that this case can only be made if the labour of editors is interchangeable, devaluing the contribution of the existing board and by extension the contribution of all other Elsevier editorial boards. If Elsevier can always replace the board of a journal then why would an individual editor, one who believes that it is their special and specific contribution that is building journal prestige, stay engaged? And if its merely to get the line on their CV and they really don’t care, how can Elsevier rely on the quality of their work? Note it is not that Elsevier don’t see the value of that contributed labour, it is clear that editors are part of the value creation chain that adds to Elsevier income, but that the situation forces them to claim that this labour is interchangeable. Elsevier see the masthead as the asset that attracts that labout. The editorial board see their labour and prestige as the asset that attracts the publisher investment in the masthead.

You can see this challenge in Elsevier statements. David Clark, interviewed as part of a Chronicle piece is quoted as follows:

He sees the staff departures as a routine part of the publishing world. “Journals change and editors change,” Mr. Clark said. “That happens normally.”

And Tom Reller in the statement on the Elsevier website:

The editors of Lingua wanted for Elsevier to transfer ownership of the journal to the collective of editors at no cost. Elsevier cannot agree to this as we have invested considerable amount of time, money and other resources into making it a respected journal in its field. We founded Lingua 66 years ago.

You can see here the attempt to discount the specific value of the current editorial board, but in terms that are intended to come across as conciliatory. Elsevier’s comms team are clearly aware of the risk here. Too weak a stance would look weak (and might play badly with institutional share holders) and too strong a stance sends a message to the community that their contribution is not really valued.

This is the toxic heart of the issue. In the end if Elsevier win, then what they’ve shown is that the contribution of the current editorial board doesn’t matter, that the community only cares about the brand. That’s a fine short term win and may even strengthen their hand in subscription negotiations. But it’s utterly toxic to the core message that publishers want to send to the research communities that they serve, that they are merely the platform. It completely undermines the value creation by editorial boards that Elsevier relies on to sell journals (or APCs) and generate their return on investment.

Playing both sides worked in the world before the web, when researchers were increasingly divorced from any connection with the libraries negotiating access to content. Today, context collapse is confronting both groups. Editorial boards suddenly becoming aware that they had acquiesced in giving up control, and frequently legal ownership of “their” journal, at the same time as issues of pricing and cost are finally coming to their attention. Publishers in general, and Elsevier in particular, can’t win a trick in the public arena because their messaging to researchers, lobbying of government, and actions in negotiation are now visible to all the players. But more than that, all those players are starting to pay attention.

The core issue for Elsevier is that if they win this battle, they will show that it is their conception of the core assets of the journal that is dominant. But if that’s true then it means that editorial boards contribute little or no value. That doesn’t mean that a “brand only” strategy couldn’t be pursued, and we will return to the persistence of prestige and brand value in the face of increasing evidence that they don’t reflect underlying value later in the series. But that’s a medium term strategy. In the longer term, if Elsevier and other publishers continue to seek to focus on and hold the masthead and trademarks as the core asset of the journal that they are forced into a messaging and communications stance that will be ultimately disasterous.

There’s no question that Elsevier understands the value that editorial board contributions bring. But continuing down the ownership path through continued rebellions will end up forcing them to keep signalling to senior members of research communities that their personal contribution has no value, that they can easily be replaced with someone else. In the long term that is not going to play out well.

The Political Economics of Open Access Publishing – A series

Victory Press of Type used by SFPP
Victory Press of Type used by SFPP (Photo credit: Wikipedia)

One of the odd things about scholarly publishing is how little any particular group of stakeholders seems to understand the perspective of others. It is easy to start with researchers ourselves, who are for the most part embarrassingly ignorant of what publishing actually involves. But those who have spent a career in publishing are equally ignorant (and usually dismissive to boot) of researchers’ perspectives. Each in turn fail to understand what libraries are or how librarians think. Indeed the naive view that libraries and librarians are homogenous is a big part of the problem. Librarians in turn often fail to understand the pressures researchers are under, and are often equally ignorant of what happens in a professional publishing operation. And of course everyone hates the intermediaries.

That this is a political problem in a world of decreasing research resources is obvious. What is less obvious is the way that these silos have prevented key information and insights from travelling to the places where they might be used. Divisions that emerged a decade ago now prevent the very collaborations that are needed, not even to build new systems, but to bring together the right people to realise that they could be built.

I’m increasingly feeling that the old debates (what’s a reasonable cost, green vs gold, hybrid vs pure) are sterile and misleading. That we are missing fundamental economic and political issues in funding and managing a global scholarly communications ecosystem by looking at the wrong things. And that there are deep and damaging misunderstandings about what has happened, is happening, and what could happen in the future.

Of course, I live in my own silo. I can, I think, legitimately claim to have seen more silos than the average; in jobs, organisations and also disciplines. So it seems worth setting down that perspective. What I’ve realised, particularly over the past few months is that these views have crept up on me, and that there are quite a few things to be worked through, so this is not a post, it is a series, maybe eventually something bigger. Here I want to set out some headings, as a form of commitment to writing these things down. And to continuing to work through these things in public.

I won’t claim that this is all thought through, nor that I’ve got (even the majority of) it right. What I do hope is that in getting things down there will be enough here to be provocative and useful, and to help us collectively solve, and not just continue to paper over, the real challenges we face.

So herewith a set of ideas that I think are important to work through. More than happy to take requests on priorities, although the order seems roughly to make sense in my head.

  1. What is it publishers do anyway?
  2. What’s the technical problem in reforming scholarly publishing
  3. The marginal costs of article publishing: Critiquing the Standard Analytics Paper and follow up
  4. What are the assets of a journal?
  5. A journal is a club: New Working Paper
  6. Economies of scale
  7. The costs (and savings) of community (self) management
  8. Luxury brands, platform brands and emerging markets (or why Björn might be right about pricing)
  9. Constructing authority: Prestige, impact factors and why brand is not going away
  10. Submission shaping, not selection, is the key to a successful publishing operation
  11. Challenges to the APC model I: The myth of “the cost per article”
  12. Challenges to the APC model II: Fixed and variable costs in scholarly publishing
  13. Alternative funding models and the risks of a regulated market
  14. If this is a service industry why hasn’t it been unbundled already (or where is the Uber of scholarly publishing?)
  15. Shared infrastructure platforms supporting community validation: Quality at scale. How can it be delivered and what skills and services are needed?
  16. Breaking the deadlock: Where are the points where effective change can be started?

They. Just. Don’t. Get. It…

English: Traffic Jam in Delhi Français : Un em...
Image via Wikipedia

…although some are perhaps starting to see the problems that are going to arise.

Last week I spoke at a Question Time style event held at Oxford University and organised by Simon Benjamin and Victoria Watson called “The Scientific Evolution: Open Science and the Future of Publishing” featuring Tim Gowers (Cambridge), Victor Henning (Mendeley), Alison Mitchell (Nature Publishing Group), Alicia Wise (Elsevier), and Robert Winston (mainly in his role as TV talking head on science issues). You can get a feel for the proceedings from Lucy Pratt’s summary but I want to focus on one specific issue.

As is common for me recently I emphasised the fact that networked research communication needs to be different to what we are used to. I made a comparison to the fact that when the printing press was developed one of the first things that happened was that people created facsimiles of hand written manuscripts. It took hundreds of years for someone to come up with the idea of a newspaper and to some extent our current use of the network is exactly that – digital facsimiles of paper objects, not truly networked communication.

It’s difficult to predict exactly what form a real networked communication system will take, in much the same way that asking a 16th century printer how newspaper advertising would work would not provide a detailed and accurate answer, but there are some principles of successful network systems that we can see emerging. Effective network systems distribute control and avoid centralisation, they are loosely coupled, and distributed. Very different to the centralised systems for control of access and control we have today.

This is a difficult concept and one that scholarly publishers simply don’t get for the most part. This is not particularly suprising because truly disruptive innovation rarely comes from incumbent players. Large and entrenched organisations don’t generally enable the kind of thinking that is required to see the new possibilities. This is seen in publishers statements that they are providing “more access than ever before” via “more routes”, but all routes that are under tight centralised control, with control systems that don’t scale. By insisting on centralised control over access publishers are setting themselves up to fail.

Nowhere is this going to play out more starkly than in the area of text mining. Bob Campbell from Wiley-Blackwell walked into this – but few noticed it – with the now familiar claim that “text mining is not a problem because people can ask permission”. Centralised control, failure to appreciate scale, and failure to understand the necessity of distribution and distributed systems. I have with me a device capable of holding the text of perhaps 100,000 papers It also has the processor power to mine that text. It is my phone. In 2-3 years our phones, hell our watches, will have the capacity to not only hold the world’s literature but also to mine it, in context for what I want right now. Is Bob Campbell ready for every researcher, indeed every interested person in the world, to come into his office and discuss an agreement for text mining? Because the mining I want to do and the mining that Peter Murray-Rust wants to do will be different, and what I will want to do tomorrow is different to what I want to do today. This kind of personalised mining is going to be the accepted norm of handling information online very soon and will be at the very centre of how we discover the information we need. Google will provide a high quality service for free, subscription based scholarly publishers will charge an arm and a leg for a deeply inferior one – because Google is built to exploit network scale.

The problem of scale has also just played out in fact. Heather Piwowar writing yesterday describes a call with six Elsevier staffers to discuss her project and needs for text mining. Heather of course now has to have this same conversation with Wiley, NPG, ACS, and all the other subscription based publishers, who will no doubt demand different conditions, creating a nightmare patchwork of different levels of access on different parts of the corpus. But the bit I want to draw out is at the bottom of the post where Heather describes the concerns of Alicia Wise:

At the end of the call, I stated that I’d like to blog the call… it was quickly agreed that was fine. Alicia mentioned her only hesitation was that she might be overwhelmed by requests from others who also want text mining access. Reasonable.

Except that it isn’t. It’s perfectly reasonable for every single person who wants to text mine to want a conversation about access. Elsevier, because they demand control, have set themselves up as the bottleneck. This is really the key point, because the subscription business model implies an imperative to extract income from all possible uses of the content it sets up a need for control of access for differential uses. This means in turn that each different use, and especially each new use, has to be individually negotiated, usually by humans, apparently about six of them. This will fail because it cannot scale in the same way that the demand will.

The technology exists today to make this kind of mass distributed text mining trivial. Publishers could push content to bit torrent servers and then publish regular deltas to notify users of new content. The infrastructure for this already exists. There is no infrastructure investment required. The problems that publishers raise of their servers not coping is one that they have created for themselves. The catch is that distributed systems can’t be controlled from the centre and giving up control requires a different business model. But this is also an opportunity. The publishers also save money  if they give up control – no more need for six people to sit in on each of hundreds of thousands of meetings. I often wonder how much lower subscriptions would be if they didn’t need to cover the cost of access control, sales, and legal teams.

We are increasingly going to see these kinds of failures. Legal and technical incompatibility of resources, contractual requirements at odds with local legal systems, and above all the claim “you can just ask for permission” without the backing of the hundreds or thousands of people that would be required to provide a timely answer. And that’s before we deal with the fact that the most common answer will be “mumble”. A centralised access control system is simply not fit for purpose in a networked world. As demand scales, people making legitimate requests for access will have the effect of a distributed denial of service attack. The clue is in the name; the demand is distributed. If the access control mechanisms are manual, human and centralised, they will fail. But if that’s what it takes to get subscription publishers to wake up to the fact that the networked world is different then so be it.

Enhanced by Zemanta

Network Enabled Research: Maximise scale and connectivity, minimise friction

BBN Technologies TCP/IP internet map early 1986
Image via Wikipedia

Prior to all the nonsense with the Research Works Act, I had been having a discussion with Heather Morrison about licenses and Open Access and peripherally the principle of requiring specific licenses of authors. I realized then that I needed to lay out the background thinking that leads me to where I am. The path that leads me here is one built on a technical understanding of how networks functional and what their capacity can be. This builds heavily on the ideas I have taken from (in no particular order) Jon Udell, Jonathan Zittrain, Michael Nielsen, Clay Shirky, Tim O’Reilly, Danah Boyd, and John Wilbanks among many others. Nothing much here is new but it remains something that very few people really get. Ironically the debate over the Research Works Act is what helped this narrative crystallise. This should be read as a contribution to Heather’s suggested “Articulating the Commons” series.

A pragmatic perspective

I am at heart a pragmatist. I want to see outcomes, I want to see evidence to support the decisions we make about how to get outcomes. I am happy to compromise, even to take tactical steps in the wrong direction if they ultimately help us to get where we need to be. In the case of publicly funded research we need to ensure that the public investment in research is made in such a way that it maximizes those outcomes. We may not agree currently on how to prioritize those outcomes, or the timeframe they occur on. We may not even agree that we can know how best to invest. But we can agree on the principle that public money should be effectively invested.

Ultimately the wider global public is for the most part convinced that research is something worth investing in, but in turn they expect to see outcomes of that research, jobs, economic activity, excitement, prestige, better public health, improved standards of living. The wider public are remarkably sophisticated when it comes to understanding that research may take a long time to bear fruit. But they are not particularly interested in papers. And when they become aware of academia’s obsession with papers they tend to be deeply unimpressed. We ignore that at our peril.

So it is important that when we think about the way we do research, that we understand the mechanisms and the processes that lead to outcomes. Even if we can’t predict exactly where outcomes will spring from (and I firmly believe that we cannot) that does not mean that we can avoid the responsibility of thoughtfully designing our systems so as to maximize the potential for innovation. The fact that we cannot, literally cannot under our current understanding of physics, follow the path of an electron through a circuit does not meant that we cannot build circuits with predictable overall behaviour. You simply design the system at a different level.

The assumptions underlying research communication have changed

So why are we having this conversation? And why now? What is it about today’s world that is so different? The answer, of course, is the internet. Our underlying communications and information infrastructure is arguably undergoing its biggest change since the development of the Gutenberg’s press. Like all developments of new communication networks, SMS, fixed telephones, the telegraph, the railways, and writing itself, the internet doesn’t just change how well we can do things, it qualitatively changes what we can do. To give a seemingly trivial example the expectations and possibilities of a society with mobile telephones is qualitatively different and their introduction has changed the way we behave and expect others to behave. The internet is a network on a scale, and with connectivity, that we have never had before. The potential change in our capacity as individuals, communities, and societies is therefore immense.

Why do networks change things? Before a network technology spreads you can imagine people, largely separated from each other, unable to communicate in this new way. As you start to make connections nothing much really happens, a few small groups can start to communicate in this new way, but that just means that they can do a few things a bit better. But as more connections form suddenly something profound happens. There comes a point where there is a transition – where suddenly nearly everyone is connected. For the physical scientists this is in fact a phase transition and can display extreme cooperativity – a sudden break where the whole system crystallizes into a new state.

At this point the whole is suddenly greater than the sum of its parts. Suddenly there is the possibility of coordination, of distribution of tasks that was simply not possible before. The internet simply does this better than any other network we have ever had. It is better for a range of reasons but they key ones are: its immense scale – connecting more people, and now machines than any previous network; its connectivity – the internet is incredibly densely connected, essentially enabling any computer to speak to any other computer globally; its lack of friction – transfer of information is very low cost, essentially zero compared to previous technologies, and is very very easy. Anyone with a web browser can point and click and be a part of that transfer.

What does this mean for research?

So if the internet and the web bring new capacity, where is the evidence that this is making a difference? If we have fundamentally new capacity where are the examples of that being exploited? I will give two examples, both very familiar to many people now, but ones that illustrate what can be achieved.

In late January 2009 Tim Gowers, a Fields medalist and arguably one of the worlds greatest living mathematicians, posed a question. Could a group of mathematicians working together be better at solving a problem than one on their own. He suggested a problem, one that he had an idea how to solve but felt was too challenging to tackle on his own. He then started to hedge his bets, stating:

“It is not the case that the aim of the project is [to solve the problem but rather it is to see whether the proposed approach was viable] I think that the chances of success even for this more modest aim are substantially less than 100%.”

A loose collection of interested parties, some world leading mathematicians, others interested but less expert, started to work on the problem. Six weeks later Gower’s announced that he believed the problem solved:

“I hereby state that I am basically sure that the problem is solved (though not in the way originally envisaged).”

In six weeks a non-planned assortment of contributors had solved a problem that a world-leading mathematician had thought both interesting, and too hard. And had solved it by a route other than the one he had originally proposed. Gower’s commented:

“It feels as though this is to normal research as driving is to pushing a car.”

For one of the world’s great mathematicians, there was a qualitative difference in what was possible when a group of people with the appropriate expertise were connected via a network through which they could easily and effectively transmit ideas, comments, and proposals. Three key messages emerge, the scale of the network was sufficient to bring the required resources to bear, the connectivity of the network was sufficient that work could be divided effectively and rapidly, and there was little friction in transferring ideas.

The Galaxy Zoo project arose out of a different kind of problem at at different kind of scale. One means of testing theories of the history and structure of the universe is to look at the numbers and types of different categories of galaxy in the sky. Images of the sky are collected and made freely available to the community. Researchers will then categories galaxies by hand to build up data sets to allow them to test theories. An experienced researcher could perhaps classify a hundred galaxies in a day. A paper might require a statistical sample of around 10,000 galaxy classifications to get past peer review. One truly heroic student classified 50,000 galaxies within their PhD, declaring at the end that they would never classify another again.

However problems were emerging. It was becoming clear that the statistical power offered by even 10,000 galaxies was not enough. One group would get different results to another. More classifications were required. Data wasn’t the problem. The Sloan Digital Sky Survey had a million galaxy images. But computer based image categorization wasn’t up to the job. The solution? Build a network. In this case a network of human participants willing to contribute by categorizing the galaxies. Several hundred thousand people classified the millions of images several times over in a matter of months. Again the key messages: scale of the network – both the number of images and the number of participants; the connectivity of the network – the internet made it easy for people to connect and participate; a lack of friction – sending images one way, and a simple classification was easy. Making the website easy, even fun, for people to use was a critical part of the success.

Galaxy Zoo changed the scale of this kind of research. It provided a statistical power that was unheard of and made it possible to ask fundamentally new types of questions. It also enabled fundamentally new types of people to play an effective role in the research, school children, teachers, full time parents. It enabled qualitatively different research to take place.

So why hasn’t the future arrived then?

These are exciting stories, but they remain just that. Sure I can multiply examples but they are still limited. We haven’t yet taken real advantage of the possibilities. There are lots of reasons for this but the fundamental one is inertia. People within the system are pretty happy for the most part with how it works. They don’t want to rock the boat too much.

But there are a group of people who are starting to be interested in rocking the boat. The funders, the patient groups, that global public who want to see outcomes. The thought process hasn’t worked through yet, but when it does they will all be asking one question. “How are you building networks to enable research”. The question may come in many forms – “How are you maximizing your research impact?” – “What are you doing to ensure the commercialization of your research?” – “Where is your research being used?” – but they all really mean the same thing. How are you working to make sure that the outputs of your research are going into the biggest, most connected, lowest friction, network that they possibly can.

As service providers, all of those who work in this industry – and I mean all, from the researchers to the administrators, to the publishers to the librarians – will need to have an answer. The suprising thing is that it’s actually very easy. The web makes building and exploiting networks easier than it has ever been because it is a network infrastructure. It has scale, billions of people – billions of computers – exabytes of information resources – exaflops of computational resources. It has connectivity on a scale that is literally unimaginable – the human mind can’t conceive of that number of connections because the web has more. It is incredibly low in friction – the cost of information transfer is in most cases so close to zero as to make no difference.

Service requirements

To exploit the potential of the network all we need to do is get as much material online as fast as we can. We need to connect it up, to make it discoverable, to make sure that people can find and understand and use it. And we need to ensure that once found those resources can be easily transferred, shared, and used. And used in any way – at network scale the system is designed to ensure  that resources get used in unexpected ways. At scale you can have serendipity by design, not by blind luck.

The problem arises with the systems we have in place to get material online. The raw material of science is not often in a state where putting it online is immediately useful. It needs checking, formatting, testing, indexing. All of this does require real work, and real money. So we need services to do this, and we need to be prepared to pay for those services. The trouble is our current system has this backwards. We don’t pay directly for those services so those costs have to be recouped somehow. And the current set of service providers do that by producing the product that we really need and want and then crippling it.

Currently we take raw science and through a collaborative process between researchers and publishers we generate a communication product, generally a research paper, that is what most of the community holds as the standard means by which they wish to receive information. Because the publishers receive no direct recompense for their contribution they need to recover those costs by other means. They do this by artificially introducing friction and then charging to remove it.

This is a bad idea on several levels. Firstly because it means the product we get doesn’t have the maximum impact it could, because its not embedded in the largest possible network. From a business perspective it creates risks, publishers have to invest up front and then recoup money later, rather than being confident that expenditure and cash flow are coupled. This means, for instance that if there is a sudden rise (or fall) in the number of submissions there is no guarantee that cash flows or costs will scale with that change. But the real problem is that it distorts the market. Because on the researcher side we don’t pay for the product of effective communication we don’t pay much attention to what we’re getting. On the publisher side it drives a focus on surface and presentation, because it enhances the product in the current purchasers eyes, rather than a ruthless focus on production costs and shareability.

Network Ready Research Communication

If we care about taking advantage of the web and internet for research then we must tackle the building of scholarly communication networks. These networks will have those critical characteristics described above, scale and a lack of friction. The question is how do we go about building them. In practice we actually already have a network at huge scale – the web and the internet do that job for us, connecting essentially all professional researchers and a large proportion of the interested public. There is work to be done on expanding the reach of the network but this is a global development goal, not something specific to research.

So if we already have the network then what is the problem? The issue lies in the second characteristic – friction. Our current systems are actually designed to create friction. Before the internet was in place our network was formed of a distribution system involving trucks and paper – reducing costs to reasonable levels meant charging for that distribution process. Today those distribution costs have fallen to as near zero as makes no difference, yet we retain the systems that add friction unnecessarily. Slow review processes, charging for access, formats and discovery tools that are no longer fit for purpose.

What we need to do is focus on the process of taking research we that we do and convert it into a Network Ready form. That is we need to have access to the services that take our research and make them ready to exploit our network infrastructure – or we need to do it ourselves. What does “Network Ready” mean? A piece of Network Ready Research will be modular and easily discoverable, it will present different facets that will allow people and systems to use it in a wide variety of ways, it will be compatible with the widest range of systems and above all it will be easily shareable. Not just copyable or pasteable but easily shared through multiple systems while carrying with it all the context required to make use of it, all the connections that will allow a user to dive deeper into its component parts.

Network Ready Research will be interoperable, socially, technically, and legally with the rest of the network. The network is more than just technical infrastructure. It is also built up from the social connections, a shared understanding of the parameters of re-use, and a compatible system of checks and balances. The network is the shared set of technical and social connections that together enable new connections to be made. Network Ready Research will move freely across that, building new connections as it goes, able to act as both connecting edge and connected node in different contexts.

Building and strengthening the network

If you believe the above, as I do, then you see a potential for us to qualitatively change our capacity as a society to innovate, understand our world, and help to make it a better place. That potential will be best realized by building the largest possible, most effective, and lowest friction network possible. A networked commons in which ideas and data, concepts and expertise can be most easily shared, and can most easily find the place where they can do the most good.

Therefore the highest priority is building this network, making its parts and components interoperable, and making it as easy as possible to connect up networks that already exist. For an agency that funds research and seeks to ensure that research makes a difference the only course of action is to place the outputs of that research where they are most accessible on the network. In blunt terms that means three things: free at the point of access, technically interoperable with as many systems as possible, and free to use for any purpose. The key point is that at network scale the most important uses are statistically likely to be unexpected uses. We know we can’t predict the uses, or even success, of much research. That means we must position it so it can be used in unexpected ways.

Ultimately, the bigger the commons, the bigger the network, the better. And the more interoperable and the widest range of uses the better. That ultimately is why I argue for liberal licences, for the exclusion of non-commercial terms. It is why I use ccZero on this blog and for software that I write where I can. For me, the risk of commercial enclosure is so much smaller than the risk of not building the right networks, or of creating fragmented incompatible networks, of ultimately not being able to solve the crises we face today in time to do any good, that the course of action is clear. At the same time we need to build up the social interoperability of the network, to call out bad behavior and perhaps in some cases to isolate its perpetrators but we need to find ways of doing this that don’t damage the connectivity and freedom of movement on the network. Legal tools are useful to assure users of interoperability and their rights, otherwise they just become a source of friction. Social tools are a more viable route for encouraging desirable behaviour.

The priority has to be achieving scale and lowering friction. If we can do this then we have the potential to create a qualitative jump in our research capacity on a scale not seen since the 18th century and perhaps never. And it certainly feels like we need it.

Enhanced by Zemanta

IP Contributions to Scientific Papers by Publishers: An open letter to Rep Maloney and Issa

Dear Representatives Maloney and Issa,

I am writing to commend your strong commitment to the recognition of intellectual property contributions to research communication. As we move to a modern knowledge economy, supported by the technical capacity of the internet, it is crucial that we have clarity on the ownership of intellectual property arising from the federal investment in research. For the knowledge economy to work effectively it is crucial that all players receive fair recompense for the contribution of intellectual property that they make and the services that they provide.

As a researcher I like to base my work on solid data, so I thought it might interest you to have some quantitation of the level of contribution of IP that publishers make to the substance of scientific papers. In this, I have focussed on the final submitted version of papers after peer review as this is the version around which the discussion of mandates for deposition in repositories revolve. This also has the advantage of separating the typesetting and copyright in layout, clearly the property of the publishers from the intellectual substance of the research.

Contribution of IP to the final (post peer review) submitted versions of papers

Methodology: I examined the final submitted version (i.e. the version accepted for publication) of the ten most recent research papers on which I was an author along with the referee and editorial comments received from the publisher. For each paper I examined the text of the final submitted version and the diagrams and figures.  As the only IP of significance in this case is copyright the specific contributions that were searched for were text or elements of figures contributed by the publisher that satisfied the requirements for obtaining copyright. Figures that were re-used from other publications (where the copyright had been transferred to the other publisher and permission been obtained to republish) were not included as these were considered “old IP” that did not relate to new IP embodied in the specific paper under consideration. The text and figures were searched for specific creative contributions from the publisher and these were quantified for each paper.

Results: The contribution of IP by publishers to the final submitted versions of these ten papers, after peer review had been completed, was zero. Zip. Nada. Zilch. Not one single word, line, or graphical element was contributed by the publisher or the editor acting as their agent. A small number of single words, or forms of expression, were found that were contributed by external peer reviewers. However as these peer reviewers do not sign over copyright to the publisher and are not paid this contribution cannot be considered work for hire and any copyright resides with the original reviewers.

Limitations: This is a small and arguably biased study based on the publications I have to hand. I recommend that other researchers examine their own oeuvre and publish similar analyses so that effects of discipline, age, and venue of publication can be examined. Following such analysis I ask that researchers provide the data via twitter using the hashtag #publisheripcontrib where I will aggregate it and republish.

Data availability: I regret that the original submissions can not be provided as the copyright in these articles was transferred after acceptance for publication to the publishers. I can not provide the editorial reports as these contain material from the publishers for which I do not have re-distribution rights.

The IP argument is sterile and unproductive. We need to discuss services.

The analysis above at its core shows how unhelpful framing this argument around IP is. The fact that publishers do not contribute IP is really not relevant. Publishers do contribute services, the provision of infrastructure, the management of the peer review process, dissemination and indexing, that are crucial for the current system of research dissemination via peer reviewed papers. Without these services papers would not be published and it is therefore clear that these services have to be paid for. What we should be discussing is how best to pay for those services, how to create a sustainable market place in which they can be offered, and what level of service the federal government expects in exchange for the services it is buying.

There is a problem with this. We currently pay for these services in a convoluted fashion which is the result of historical developments. Rather than pay up front for publication services, we currently give away the intellectual property in our papers in exchange for publication. The U.S. federal and state governments then pay for these publication services indirectly by funding libraries to hire access back to our own work. This model made sense when the papers were physically on paper; distribution, aggregation, and printing were major components of the cost. In that world a demand side business model worked well and was appropriate.

In the current world the costs of dissemination and provision of access are as near to zero as makes no difference. The major costs are in the peer review process and preparing the paper in a version that can be made accessible online. That is, we have moved from a world where the incremental cost of dissemination of each copy was dominant, to a world where the first copy costs are dominant and the incremental costs of dissemination after those first copy costs are negligible. Thus we must be clear that we are paying for the important costs of the services required to generate that first web accessible copy, and not that we are supporting unnecessary incremental costs. A functioning market requires, as discussed above, that we have clarity on what is being paid for.

In a service based model the whole issue of IP simply goes away. It is clear that the service we would wish to pay for is one in which we generate a research communication product which provides appropriate levels of quality assurance and is as widely accessible and available for any form of use as possible. This ensures that the outputs of the most recent research are available to other researchers, to members of the public, to patients, to doctors, to entrepreneurs and technical innovators, and not least to elected representatives to support informed policy making and legislation. In a service based world there is no logic in artificially reducing access because we pay for the service of publication and the full first copy costs are covered by the purchase of that service.

Thus when we abandon the limited and sterile argument about intellectual property and move to a discussion around service provision we can move from an argument where no-one can win to a framework in which all players are suitably recompensed for their efforts and contributions, whether or not those contributions generate IP in the legal sense, and at the same time we can optimise the potential for the public investment in research to be fully exploited.

HR3699 prohibits federal agencies from supporting publishers to move to a transparent service based model

The most effective means of moving to a service based business model would be for U.S. federal agencies as the major funders of global research to work with publishers to assure them that money will be available for the support of publication services for federally funded researchers. This will require some money to be put aside. The UK’s Wellcome Trust estimates that they expect to spend approximately 1.5% of total research funding on publication services. This is a significant sum, but not an overly large proportion of the whole. It should also be remembered that governments, federal and state, are already paying these costs indirectly through overheads charges and direct support to research institutions via educational and regional grants. While there will be additional centralised expenditure over the transitional period in the longer term this is at worst a zero-sum game. Publishers are currently viable, indeed highly profitable. In the first instance service prices can be set so that the same total sum of money flows to them.

The challenge is the transitional period. The best way to manage this would be for federal agencies to be able to guarantee to publishers that their funded researchers would be moving to the new system over a defined time frame. The most straight forward way to do this would be for the agencies to have a published program over a number of years through which the publication of research outputs via the purchase of appropriate services would be made mandatory. This could also provide confidence to the publishers by defining the service level agreements that the federal agencies would require, and guarantee a predictable income stream over the course of the transition.

This would require agencies working with publishers and their research communities to define the timeframes, guarantees, and service level agreements that would be put in place. It would require mandates from the federal agencies as the main guarantor of that process. The Research Works Acts prohibits any such process. In doing so it actively prevents publishers from moving towards business models that are appropriate for today’s world. It will stifle innovation and new entrants to the market by creating uncertainty and continuing the current obfuscation of first copy costs with dissemination costs. In doing so it will damage the very publishers that support it by legislatively sustaining an out of date business model that is no longer fit for purpose.

Like General Motors, or perhaps more analogously, Lehman Brothers, the incumbent publishers are trapped in a business model that can not be sustained in the long term. The problem for publishers is that their business model is predicated on charging for the dissemination and access costs that are disappearing and not explicitly charging for the costs that really matter. Hiding the cost of one thing in a charge for another is never a good long term business strategy. HR3699 will simply prop them up for a little longer, ultimately leading to a bigger crash when it comes. The alternative is a managed transition to a better set of business models which can simultaneously provide a better return on investment for the taxpayer.

We recognise the importance of the services that scholarly publishers provide. We want to pay publishers for the services they provide because we want those services to continue to be available and to improve over time. Help us to help them make that change. Drop the Research Works Act.

Yours sincerely

Cameron Neylon

Enhanced by Zemanta

An Open Letter to David Willetts: A bold step towards opening British research

English: Open Access logo, converted into svg,...
Image via Wikipedia

On the 8th December David Willetts, the Minister of State for Universities and Science, and announced new UK government strategies to develop innovation and research to support growth. The whole document is available online and you can see more analysis at the links at the bottom of the post.  A key aspect for Open Access advocates was the section that discussed a wholesale move by the UK to an author pays system to freely accessible research literature with SCOAP3 raised as a possible model. The report refers not to Open Access, but to freely accessible content. I think this is missing a massive opportunity for Britain to take a serious lead in defining the future direction of scholarly communication. That’s the case I attempt to lay out in this open letter. This post should be read in the context of my usual disclaimer.

Minister of State for Universities and Science

Department of Business Innovation and Skills

Dear Mr Willetts,

I am writing in the first instance to congratulate you on your stance on developing routes to a freely accessible research outputs. I cannot say I am a great fan of many current government positions and I might have wished for greater protection of the UK science budget but in times of resource constraint for research I believe your focus on ensuring the efficiency of access to and exploitation of research outputs in its widest sense is the right one.

The position you have articulated offers a real opportunity for the UK to take a lead in this area. But along with the opportunities there are risks, and those risks could entrench existing inefficiencies of our scholarly communication system. They could also reduce the value for money that the public purse, and it will be the public purse one way or another, gets for its investment. In our current circumstances this would be unfortunate. I would therefore ask you to consider the following as the implementation pathway for this policy is developed.

Firstly, the research community will be buying a service. This is a significant change from the current system where the community buys a product, the published journal. The purchasing exercise should be seen in this light and best practice in service procurement applied.

Secondly the nature of this service must be made clear. The service that is being provided must provide for any and all downstream uses, including commercial use, text mining, indeed any use that might developed at some point in the future. We are paying for this service and we must dictate its terms. Incumbent publishers will say in response that they need to retain commercial rights, or text mining rights, to ensure their viability, as indeed they have done in response to the Hargreaves Review.

This, not to put to fine a point on it, is hogwash. PLoS and BioMedCentral, both operate financially viable operations in which no downstream rights beyond that of appropriate attribution are retained by the publishers and where the author charges are lower in price then many of the notionally equivalent, but actually far more limited, offerings of more traditional publishers. High quality scholarly communication can be supported by reasonable author charges without any need for publishers to retain rights beyond those protected by their trademarks. An effective market place could therefore be expected to bring the average costs of this form of scholarly communications down.

The reason for supporting a system that demands that any downstream use of the communication be enabled is that we need innovation and development within the publishing systems well as innovation and development as a result of its content. Our scholarship is currently being held back by a morass of retained rights that prevent the development of research projects, of new technology startups and potentially new industries. The government consultation document of 14 December on the Hargreaves report explicitly notes that enabling downstream uses of content, and scholarly content in particular, can support new economic activity. It can also support new scholarly activity. The exploitation of our research outputs requires new approaches to indexing, mining, and parsing the literature. The shame of our current system is that much of this is possible today. The technology exists but is prevented from being exploited at scale by the logistical impossibility of clearing the required rights. These new approaches will require money and it is entirely appropriate, indeed desirable, that some of this work therefore occurs in the private sector. Experimentation will require both freedom to act as well as freedom to develop new business models. Our content and its accessibility and its reusability must support this.

Finally I ask you to look beyond the traditional scholarly publishing industry to the range of experimentation that is occurring globally in academic spaces, non-profits, and commercial endeavours. The potential leaps in functionality as well as the potential cost reductions are enormous. We need to work to encourage this experimentation and develop a diverse and vibrant market which both provides the quality assurance and stability that we are used to while encouraging technical experimentation and the improvement of business models. What we don’t need is a five or ten year deal that cements in existing players, systems, and practices.

Your government’s philosophy is based around the effectiveness of markets. The recent history of major government procurement exercises is not a glorious one. This is one we should work to get right. We should take our time to do so and ensure a deal that delivers on its promise. The vision of a Britain that is lead by innovation and development supported by a vibrant and globally leading research community is, I believe, the right one. Please ensure that this innovation isn’t cut off at the knees by agreeing terms that prevent our research communication tools being re-used to improve the effectiveness of that communication. And please ensure that the process of procuring these services is one that supports innovation and development in scholarly communications itself.

Yours truly,

Cameron Neylon

 

 

 

Enhanced by Zemanta

PLoS (and NPG) redefine the scholarly publishing landscape

Open Access logo, converted into svg, designed...
Image via Wikipedia

Nature Publishing Group yesterday announced a new venture, very closely modelled on the success of PLoS ONE, titled Scientific Reports. Others have started to cover the details and some implications so I won’t do that here. I think there are three big issues here. What does this tell us about the state of Open Access? What are the risks and possibilities for NPG? And why oh why does NPG keep insisting on a non-commercial licence? I think those merit separate posts so here I’m just going to deal with the big issue. And I think this is really big.

[I know it bores people, hell it bores me, but the non-commercial licence is a big issue. It is an even bigger issue here because this launch may define the ground rules for future scholarly communication. Open Access with a non-commercial licence actually achieves very little either for the community, or indeed for NPG, except perhaps as a cynical gesture. The following discussion really assumes that we can win the argument with NPG to change those terms. If we can the future is very interesting indeed.]

The Open Access movement has really been defined by two strands of approach. The “Green Road” involves self archiving of pre-prints or published articles in subscription journals as a means of providing access. It has had its successes, perhaps more so in the humanities, with deposition mandates becoming increasingly common both at the institutional level and the level of funders. The other approach, the “Gold Road” is for most intents and purposes defined by commercial and non-profit publishers based on a business model of article processing charges (APCs) to authors and making the published articles freely available at a publisher website. There is a thriving community of “shoe-string business model” journals publishing small numbers of articles without processing charges but in terms of articles published OA publishing is dominated by BioMedCentral, the pioneers in this area, now owned by Springer, Public Library of Science, and on a smaller scale Hindawi. This approach has gained more traction in the sciences, particularly the biological sciences.

From my perspective yesterday’s announcement means that for the sciences, the argument for Gold Open Access as the default publication mechanism has effectively been settled. Furthermore the future of most scholarly publishing will be in publication venues that place no value on a subjective assessment of “importance”. Those are big claim, but NPG have played a bold and possibly decisive move, in an environment where PLoS ONE was already starting to dominate some fields of science.

PLoS ONE was already becoming a default publication venue. A standard path for getting a paper published is, have a punt at Cell/Nature/Science, maybe a go at one of the “nearly top tier” journals, and then head straight for PLoS ONE, in some cases with the technical assessments already in hand. However in some fields, particularly chemistry, the PLoS brand wasn’t enough to be attractive against the strong traditional pull of American Chemical Society or Royal Society of Chemistry journals and Angewandte Chemie. Scientific Reports changes this because of the association with the Nature brand. If I were the ACS I’d be very worried this morning.

The announcement will also be scaring the hell out of those publishers who have a lot of separate, lower tier journals. The problem for publication business models has never been with the top tier, that can be made to work because people want to pay for prestige (whether we can afford it in the long term is a separate question). The problem has been the volume end of the market. I back Dorothea Salo’s prediction [and again] that 2011/12 would see the big publishers looking very closely at their catalogue of 100s or 1000s of low yield, low volume, low prestige journals and see the beginning of mass closures, simply to keep down subscription increases that academic libraries can no longer pay for. Aggregated large scale journals with streamlined operating and peer review procedures, simplified and more objective selection criteria, and APC supported business models make a lot of sense in this market. Elsevier, Wiley, Springer (and to a certain extent BMC) have just lost the start in the race to dominate what may become the only viable market in the medium term.

With two big players now in this market there will be real competition. Others have suggested [see Jason Priem‘s comment] this will be on the basis of services and information. This might be true in the longer term but in the short to medium term it will be on two issues: brand, and price. The choice of name is a risk for NPG, the Nature brand is crucial to success of the venture, but there’s a risk of dilution of the brand which is NPG’s major asset. That the APC for Science Reports has been set identically to PLoS ONE is instructive. I have previously argued that APC driven business models will be the most effective way of forcing down publication costs and I would expect to see competition develop here. I hope we might soon see a third player in this space to drive effective competition.

At the end of the day what this means is that there are now seriously credible options for publishing in Open Access venues (assuming we win the licensing argument) across the sciences, that funders now support Article Processing Charges, and that there is really no longer any reason to publish in that obscure subscription journal that no-one actually read anyway. The dream of a universal database of freely accessible research outputs is that much closer to our reach.

Above all, this means that PLoS in particular has succeeded in its aim of making Gold Open Access publication a credible default option. The founders and team at PLoS set out with the aim of changing the publication landscape. PLoS ONE was a radical and daring step at the time which they pulled off. The other people who experimented in this space also deserve credit but it was PLoS ONE in particular that found the sweet spot between credibility and pushing the envelope. I hope that those in office are cracking open some bubbly today. But not too much. For the first time there is now some serious competition and its going to be tough to keep up. There remains a lot more work to be done (assuming we can sort out the licence).

Full disclosure: I am an academic editor for PLoS ONE, editor in chief of the BioMedCentral journal Open Research Computation, and have advised PLoS, BMC, and NPG in a non-paid capacity on a variety of issues that relate closely to this post.

Enhanced by Zemanta

Open Research Computation: An ordinary journal with extraordinary aims.

I spend a lot of my time arguing that many of the problems in the research community are caused by journals. We have too many, they are an ineffective means of communicating the important bits of research, and as a filter they are inefficient and misleading. Today I am very happy to be publicly launching the call for papers for a new journal. How do I reconcile these two statements?

Computation lies at the heart of all modern research. Whether it is the massive scale of LHC data analysis or the use of Excel to graph a small data set. From the hundreds of thousands of web users that contribute to Galaxy Zoo to the solitary chemist reprocessing an NMR spectrum we rely absolutely on billions of lines of code that we never think to look at. Some of this code is in massive commercial applications used by hundreds of millions of people, well beyond the research community. Sometimes it is a few lines of shell script or Perl that will only ever be used by the one person who wrote it. At both extremes we rely on the code.

We also rely on the people who write, develop, design, test, and deploy this code. In the context of many research communities the rewards for focusing on software development, of becoming the domain expert, are limited. And the cost in terms of time and resource to build software of the highest quality, using the best of modern development techniques, is not repaid in ways that advance a researcher’s career. The bottom line is that researchers need papers to advance, and they need papers in journals that are highly regarded, and (say it softly) have respectable impact factors. I don’t like it. Many others don’t like it. But that is the reality on the ground today, and we do younger researchers in particular a disservice if we pretend it is not the case.

Open Research Computation is a journal that seeks to directly address the issues that computational researchers have. It is, at its heart, a conventional peer reviewed journal dedicated to papers that discuss specific pieces of software or services. A few journals now exist in this space that either publish software articles or have a focus on software. Where ORC will differ is in its intense focus on the standards to which software is developed, the reproducibility of the results it generates, and the accessibility of the software to analysis, critique and re-use.

The submission criteria for ORC Software Articles are stringent. The source code must be available, on an appropriate public repository under an OSI compliant license. Running code, in the form of executables, or an instance of a service must be made available. Documentation of the code will be expected to a very high standard, consistent with best practice in the language and research domain, and it must cover all public methods and classes. Similarly code testing must be in place covering, by default, 100% of the code. Finally all the claims, use cases, and figures in the paper must have associated with them test data, with examples of both input data and the outputs expected.

The primary consideration for publication in ORC is that your code must be capable of being used, re-purposed, understood, and efficiently built on. You work must be reproducible. In short, we expect the computational work published in ORC to deliver at the level that is expected in experimental research.

In research we build on the work of those that have gone before. Computational research has always had the potential to deliver on these goals to a level that experimental work will always struggle to, yet to date it has not reliably delivered on that promise. The aim of ORC is to make this promise a reality by providing a venue where computational development work of the highest quality can be shared, and can be celebrated. To provide a venue that will stand for the highest standards in research computation and where developers, whether they see themselves more as software engineers or as researchers who code, will be proud to publish descriptions of their work.

These are ambitious goals and getting the technical details right will be challenging. We have assembled an outstanding editorial board, but we are all human, and we don’t expect to get it all right, first time. We will be doing our testing and development out in the open as we develop the journal and will welcome comments, ideas, and criticisms to editorial@openresearchcomputation.com. If you feel your work doesn’t quite fit the guidelines as I’ve described them above get in touch and we will work with you to get it there. Our aim, at the end of the day is to help the research developer to build better software and to apply better development practice. We can also learn from your experiences and wider ranging review and proposal papers are also welcome.

In the end I was persuaded to start yet another journal only because there was an opportunity to do something extraordinary within that framework. An opportunity to make a real difference to the recognition and quality of research computation. In the way it conducts peer review, manages papers, and makes them available Open Research Computation will be a very ordinary journal. We aim for its impact to be anything but.

Other related posts:

Jan Aerts: Open Research Computation: A new journal from BioMedCentral

Enhanced by Zemanta

What would scholarly communications look like if we invented it today?

Picture1
Image by cameronneylon via Flickr

I’ve largely stolen the title of this post from Daniel Mietchen because I it helped me to frame the issues. I’m giving an informal talk this afternoon and will, as I frequently do, use this to think through what I want to say. Needless to say this whole post is built to a very large extent on the contributions and ideas of others that are not adequately credited in the text here.

If we imagine what the specification for building a scholarly communications system would look like there are some fairly obvious things we would want it to enable. Registration of ideas, data or other outputs for the purpose of assigning credit and priority to the right people is high on everyone’s list. While researchers tend not to think too much about it, those concerned with the long term availability of research outputs would also place archival and safekeeping high on the list as well. I don’t imagine it will come as any surprise that I would rate the ability to re-use, replicate, and re-purpose outputs very highly as well. And, although I won’t cover it in this post, an effective scholarly communications system for the 21st century would need to enable and support public and stakeholder engagement. Finally this specification document would need to emphasise that the system will support discovery and filtering tools so that users can find the content they are looking for in a huge and diverse volume of available material.

So, filtering, archival, re-usability, and registration. Our current communications system, based almost purely on journals with pre-publication peer review doesn’t do too badly at archival although the question of who is actually responsible for actually doing the archival, and hence paying for it doesn’t always seem to have a clear answer. Nonetheless the standards and processes for archiving paper copies are better established, and probably better followed in many cases, than those for digital materials in general, and certainly for material on the open web.

The current system also does reasonably well on registration, providing a firm date of submission, author lists, and increasingly descriptions of the contributions of those authors. Indeed the system defines the registration of contributions for the purpose of professional career advancement and funding decisions within the research community. It is a clear and well understood system with a range of community expectations and standards around it. Of course this is circular as the career progression process feeds the system and the system feeds career progression. It is also to some extent breaking down as wider measures of “impact” become important. However for the moment it is an area where the incumbent has clear advantages over any new system, around which we would need to grow new community standards, expectations, and norms.

It is on re-usability and replication where our current system really falls down. Access and rights are a big issue here, but ones that we are gradually pushing back. The real issues are much more fundamental. It is essentially assumed, in my experience, by most researchers that a paper will not contain sufficient information to replicate an experiment or analysis. Just consider that. Our primary means of communication, in a philosophical system that rests almost entirely on reproducibility, does not enable even simple replication of results. A lot of this is down to the boundaries created by the mindset of a printed multi-page article. Mechanisms to publish methods, detailed laboratory records, or software are limited, often leading to a lack of care in keeping and annotating such records. After all if it isn’t going in the paper why bother looking after it?

A key advantage of the web here is that we can publish a lot more with limited costs and we can publish a much greater diversity of objects. In principle we can solve the “missing information” problem by simply making more of the record available. However those important pieces of information need to be captured in the first place. Because they aren’t currently valued, because they don’t go in the paper, they often aren’t recorded in a systematic way that makes it easy to ultimately publish them. Open Notebook Science, with its focus on just publishing everything immediately, is one approach to solving this problem but it’s not for everyone, and causes its own overhead. The key problem is that recording more, and communicating it effectively requires work over and above what most of us are doing today. That work is not rewarded in the current system. This may change over time, if as I have argued we move to metrics based on re-use, but in the meantime we also need much better, easier, and ideally near-zero burden tools that make it easier to capture all of this information and publish it when we choose, in a useful form.

Of course, even with the perfect tools, if we start to publish a much greater portion of the research record then we will swamp researchers already struggling to keep up. We will need effective ways to filter this material down to reduce the volume we have to deal with. Arguably the current system is an effective filter. It almost certainly reduces the volume and rate at which material is published. Of all the research that is done, some proportion is deemed “publishable” by those who have done it, a small portion of that research is then incorporated into a draft paper and some proportion of those papers are ultimately accepted for publication. Up until 20 years ago where the resource pinch point was the decision of whether or not to publish something this is exactly what you would want. The question of whether it is an effective filter; is it actually filtering the right stuff out, is somewhat more controversial. I would say the evidence for that is weak.

When publication and distribution was the expensive part that was the logical place to make the decision. Now these steps are cheap the expensive part of the process is either peer review, the traditional process of making a decision prior to publication, or conversely, the curation and filtering after publication that is seen more widely on the open web. As I have argued I believe that using the patterns of the web will be ultimately a more effective means of enabling users to discover the right information for their needs. We should publish more; much more and much more diversely but we also need to build effective tools for filtering and discovering the right pieces of information. Clearly this also requires work, perhaps more than we are used to doing.

An imaginary solution

So what might this imaginary system that we would design look like. I’ve written before about both key aspects of this. Firstly I believe we need recording systems that as far as possible record and publish both the creation of objects, be they samples, data, or tools. As far as possible these should make a reliable, time stamped, attributable record of the creation of these objects as a byproduct of what the researcher needs to do anyway. A simple concept for instance is a label printer that, as a byproduct of printing off a label, makes a record of who, what, and when, publishing this simultaneously to a public or private feed.

Publishing rapidly is a good approach, not just for ideological reasons of openness but also some very pragmatic concerns. It is easier to publish at the source than to remember to go back and do it later. Things that aren’t done immediately are almost invariably forgotten or lost. Secondly rapid publication has the potential to both efficiently enable re-use and to prevent scooping risks by providing a time stamped citable record. This of course would require people to cite these and for those citations to be valued as a contribution; requiring a move away from considering the paper as the only form of valid research output (see also Michael Nielsen‘s interview with me).

It isn’t enough though, just to publish the objects themselves. We also need to be able to understand the relationship between them. In a semantic web sense this means creating the links between objects, recording the context in which they were created, what were their inputs and outputs. I have alluded a couple of times in the past to the OREChem Experimental Ontology and I think this is potentially a very powerful way of handling these kind of connections in a general way. In many cases, particularly in computational research, recording workflows or generating batch and log files could serve the same purpose, as long as a general vocabulary could be agreed to make this exchangeable.

As these objects get linked together they will form a network, both within and across projects and research groups, providing the kind of information that makes Google work, a network of citations and links that make it possible to directly measure the impact of a single dataset, idea, piece of software, or experimental method through its influence over other work. This has real potential to help solve both the discovery problem and the filtering problem. Bottom line, Google is pretty good at finding relevant text and they’re working hard on other forms of media. Research will have some special edges but can be expected in many ways to fit patterns that mean tools for the consumer web will work, particularly as more social components get pulled into the mix.

On the rare occasions when it is worth pulling together a whole story, for a thesis, or a paper, authors would then aggregate objects together, along with text and good visuals to present the data. The telling of a story then becomes a special event, perhaps one worthy of peer review in its traditional form. The forming of a “paper” is actually no more than providing new links, adding grist to the search and discovery mill, but it can retain its place as a high status object, merely losing its role as the only object worth considering.

So in short, publish fragments, comprehensively and rapidly. Weave those into a wider web of research communication, and from time to time put in the larger effort required to tell a more comprehensive story. This requires tools that are hard to build, standards that are hard to agree, and cultural change that at times seems like spitting into a hurricane. Progress is being made, in many places and in many ways, but how can we take this forward today?

Practical steps for today

I want to write more about these ideas in the future but here I’ll just sketch out a simple scenario that I hope can be usefully implemented locally but provide a generic framework to build out without necessarily requiring a massive agreement on standards.

The first step is simple, make a record, ideally an address on the web for everything we create in the research process. For data and software just the files themselves, on a hard disk is a good start. Pushing them to some sort of web storage, be it a blog, github, an institutional repository, or some dedicated data storage service, is even better because it makes step two easy.

Step two is to create feeds that list all of these objects, their addresses and as much standard metadata as possible, who and when would be a good start. I would make these open by choice, mainly because dealing with feed security is a pain, but this would still work behind a firewall.

Step three gets slightly harder. Where possible configure your systems so that inputs can always be selected from a user-configurable feed. Where possible automate the pushing of outputs to your chosen storage systems so that new objects are automatically registered and new feeds created.

This is extraordinarily simple conceptually. Create feeds, use them as inputs for processes. It’s not so straightforward to build such a thing into an existing tool or framework, but it doesn’t need to be too terribly difficult either. And it doesn’t need to bother the user either. Feeds should be automatically created, and presented to the user as drop down menus.

The step beyond this, creating a standard framework for describing the relationships between all of these objects is much harder. Not because its difficult, but because it requires an agreement on standards for how to describe those relationships. This is do-able and I’m very excited by the work at Southampton on the OREChem Experimental Ontology but the social problems are harder. Others prefer the Open Provenance Model or argue that workflows are the way to manage this information. Getting agreement on standards is hard, particularly if we’re trying to maximise their effective coverage but if we’re going to build a computable record of science we’re going to have to tackle that problem. If we can crack it and get coverage of the records via a compatible set of models that tell us how things are related then I think we will be will placed to solve the cultural problem of actually getting people to use them.

Enhanced by Zemanta

P ≠ NP and the future of peer review

Decomposition method (constraint satisfaction)
Image via Wikipedia

“We demonstrate the separation of the complexity class NP from its subclass P. Throughout our proof, we observe that the ability to compute a property on structures in polynomial time is intimately related to the statistical notions of conditional independence and sufficient statistics. The presence of conditional independencies manifests in the form of economical parametrizations of the joint distribution of covariates. In order to apply this analysis to the space of solutions of random constraint satisfaction problems, we utilize and expand upon ideas from several fields spanning logic, statistics, graphical models, random ensembles, and statistical physics.”

Vinay Deolalikar [pdf]

No. I have no idea either, and the rest of the document just gets more confusing for a non-mathematician. Nonetheless the online maths community has lit up with excitement as this document, claiming to prove one of the major outstanding theorems in maths circulated. And in the process we are seeing online collaborative post publication peer review take off.

It has become easy to say that review of research after it has been published doesn’t work. Many examples have failed, or been partially successful. Most journals with commenting systems still get relatively few comments on the average paper. Open peer review tests have generally been judged a failure. And so we stick with traditional pre-publication peer review despite the lack of any credible evidence that it does anything except cost around a few billion pounds a year.

Yesterday, Bill Hooker, not exactly a nay-sayer when it comes to using the social web to make research more effective wrote:

“…when you get into “likes” etc, to me that’s post-publication review — in other words, a filter. I love the idea, but a glance at PLoS journals (and other experiments) will show that it hasn’t taken off: people just don’t interact with the research literature (yet?) in a way that makes social filtering effective.”

But actually the picture isn’t so negative. We are starting to see examples of post-publication peer review and see it radically out-perform traditional pre-publication peer review. The rapid demolition [1, 2, 3] of the JACS hydride oxidation paper last year (not least pointing out that the result wasn’t even novel) demonstrated the chemical blogosphere was more effective than peer review of one of the premiere chemistry journals. More recently 23andMe issued a detailed, and at least from an outside perspective devastating, peer review (with an attempt at replication!) of a widely reported Science paper describing the identification of genes associated with longevity. This followed detailed critiques from a number of online writers.

These, though were of published papers, demonstrating that a post-publication approach can work, but not showing it working for an “informally published” piece of research such as a a blog post or other online posting. In the case of this new mathematical proof, the author Vinay Deolalikar, apparently took the standard approach that one does in maths, sent a pre-print to a number of experts in the field for comments and criticisms. The paper is not in the ArXiv and was in fact made public by one of the email correspondents. The rumours then spread like wildfire, with widespread media reporting, and widespread online commentary.

Some of that commentary was expert and well informed. Firstly a series of posts appeared stating that the proof is “credible”. That is, that it was worth deeper consideration and the time of experts to look for holes. There appears a widespread skepticism that the proof will be correct, including a $200,000 bet from Scott Aaronson, but also a widespread view that it nonetheless is useful, that it will progress the field in a helpful way even if it is wrong.

After this first round, there have been summaries of the proof, and now the identification of potential issues is occurring (see RJLipton for a great summary). As far as I can tell these issues are potentially extremely subtle and will require the attention of the best domain experts to resolve. In a couple of cases these experts have already potentially “patched” the problem, adding their own expertise to contribute to the proof. And in the last couple of hours as Michael Nielsen pointed out to me there is the beginning of a more organised collaboration to check through the paper.

This is collaborative, and positive peer review, and it is happening at web scale. I suspect that there are relatively few experts in the area who aren’t spending some of their time on this problem this week. In the market for expert attention this proof is buying big, as it should be. An important problem is getting a good going over and being tested, possibly to destruction, in a much more efficient manner than could possibly be done by traditional peer review.

There are a number of objections to seeing this as a generalizable to other research problems and fields. Firstly, maths has a strong pre-publication communication and review structure which has been strengthened over the years by the success of the ArXiv. Moreover there is a culture of much higher standards of peer review in maths, review which can take years to complete. Both of these encourage circulation of drafts to a wider community than in most other disciplines, priming the community for distributed review to take place.

The other argument is that only high profile work will get this attention, only high profile work will get reviewed, at this level, possibly at all. Actually I think this is a good thing. Most papers are never cited, so why should they suck up the resource required to review them? Of those that are or aren’t published whether they are useful to someone, somewhere, is not something that can be determined by one or two reviewers. Whether they are useful to you is something that only you can decide. The only person competent to review which papers you should look at in detail is you. Sorry.

Many of us have argued for some time that post-publication peer review with little or no pre-publication review is the way forward. Many have argued against this on practical grounds that we simply can’t get it to happen, there is no motivation for people to review work that has already been published. What I think this proof, and the other stories of online review tell us is that these forms of review will grow of their own accord, particularly around work that is high profile. My hope is that this will start to create an ecosystem where this type of commenting and review is seen as valuable. That would be a more positive route than the other alternative, which seems to be a wholesale breakdown of the current system as the workloads rise too high and the willingness of people to contribute drops.

The argument always brought forward for peer review is that it improves papers. What interests me about the online activity around Deolalikar’s paper is that there is a positive attitude. By finding the problems, the proof can be improved, and new insights found, even if the overall claim is wrong. If we bring a positive attitude to making peer review work more effectively and efficiently then perhaps we can find a good route to improving the system for everyone.

Enhanced by Zemanta