Added Value: I do not think those words mean what you think they mean

There are two major strands to position of traditional publishers have taken in justifying the process by which they will make the, now inevitable, transition to a system supporting Open Access. The first of these is that the transition will cost “more money”. The exact costs are not clear but the, broadly reasonable, assumption is that there needs to be transitional funding available to support what will clearly be a mixed system over some transitional period. The argument of course is how much money and where it will come from, as well as an issue that hasn’t yet been publicly broached, how long will it last for? Expect lots of positioning on this over the coming months with statements about “average paper costs” and “reasonable time frames”, with incumbent subscription publishers targeting figures of around $2,500-5,000 and ten years respectively, and those on my side of the fence suggesting figures of around $1,500 and two years. This will be fun to watch but the key will be to see where this money comes from (and what subsequently gets cut), the mechanisms put in place to release this “extra” money and the way in which they are set up so as to wind down, and provide downwards price pressure.

The second arm of the publisher argument has been that they provide “added value” over what the scholarly community provides into the publication process. It has become a common call of the incumbent subscription publishers that they are not doing enough to explain this added value. Most recently David Crotty has posted at Scholarly Kitchen saying that this was a core theme of the recent SSP meeting. This value exists, but clearly we disagree on its quantitative value. The problem is we never see any actual figures given. But I think there are some recent numbers that can help us put some bounds on what this added value really is, and ironically they have been provided by the publisher associations in their efforts to head off six month embargo periods.

When we talk about added value we can posit some imaginary “real” value but this is really not a useful number – there is no way we can determine it. What we can do is talk about realisable value, i.e. the amount that the market is prepared to pay for the additional functionality that is being provided. I don’t think we are in a position to pin that number down precisely, and clearly it will differ between publishers, disciplines, and work flows but what I want to do is attempt to pin down some points which I think help to bound it, both from the provider and the consumer side. In doing this I will use a few figures and reports as well as place an explicit interpretation on the actions of various parties. The key data points I want to use are as follows:

  1. All publisher associations and most incumbent publishers have actively campaigned against open access mandates that make the final refereed version of a scholarly article, prior to typesetting, publication, indexing, and archival, online in any form either immediately or within six months after publication. The Publishers Association (UK) and ALPSP are both on record as stating that such a mandate would be “unsustainable” and most recently that it would bankrupt publishers.
  2. In a survey run by ALPSP of research libraries (although there are a series of concerns that have to be raised about the methodology) a significant proportion of libraries stated that they would cut some subscriptions if the majority research articles were available online six months after formal publication. The survey states that it appeared that most respondents assumed that the freely available version would be the original author version, i.e. not that which was peer reviewed.
  3. There are multiple examples of financially viable publishing houses running a pure Open Access programme with average author charges of around $1500. These are concentrated in the life and medical sciences where there is both significant funding and no existing culture of pre-print archives.
  4. The SCOAP3 project has created a formal journal publication framework which will provide open access to peer reviewed papers for a community that does have a strong pre-print culture utilising the ArXiv.

Let us start at the top. Publishers actively campaign against a reduction of embargo periods. This makes it clear that they do not believe that the product they provide, in transforming the refereed version of a paper into the published version, has sufficient value that their existing customers will pay for it at the existing price. That is remarkable and a frightening hole at the centre of our current model. The service providers can only provide sufficient added value to justify the current price if they additionally restrict access to the “non-added-value” version. A supplier that was confident about the value that they add would have no such issues, indeed they would be proud to compete with this prior version, confident that the additional price they were charging was clearly justified. That they do not should be a concern to all of us, not least the publishers.

Many publishers also seek to restrict access to any prior version, including the authors original version prior to peer review. These publishers don’t even believe that their management of the peer review process adds sufficient value to justify the price they are charging. This is shocking. The ACS, for instance, has such little faith in the value that it adds that it seeks to control all prior versions of any paper it publishes.

But what of the customer? Well the ALPSP survey, if we take the summary as I have suggested above at face value, suggests that libraries also doubt the value added by publishers. This is more of a quantitative argument but that some libraries would cancel some subscriptions shows that overall the community doesn’t believe the overall current price is worth paying even allowing for a six month delay in access. So broadly speaking we can see that both the current service providers and the current customers do not believe that the costs of the pure service element of subscription based scholarly publication are justified by the value added through this service.  This in combination means we can provide some upper bounds on the value added by publishers.

If we take the approximately $10B currently paid as cash costs to recompense publishers for their work in facilitating scholarly communications neither the incumbent subscription publishers nor their current library customers believe that the value added by publishers justifies the current cost, absent artificial restrictions to access to the non-value added version.

This tells us not very much about what the realisable value of this work actually is, but it does provide an upper bound. But what about a lower bound? One approach would be turn to the services provided to authors by Open Access publishers. These costs are willingly incurred by a paying customer so it is tempting to use these directly as a lower bound. This is probably reasonable in the life and medical sciences but as we move into other disciplinary areas, such as mathematics, it is clear that cost level is not seen as attractive enough. In addition the life and medical sciences have no tradition of wide availability of pre-publication versions of papers. That means for these disciplines the willingness to pay the approximately $1500 average cost of APCs is in part bound up with making the wish to make the paper effectively available through recognised outlets. We have not yet separated the value in the original copy versus the added value provided by this publishing service. The $1000-1500 mark is however a touchstone that is worth bearing in mind for these disciplines.

To do a fair comparison we would need to find a space where there is a thriving pre-print culture and a demonstrated willingness to pay a defined price for added-value in the form of formal publication over and above this existing availability. The Sponsoring Consortium for Open Access Publishing in Particle Physics (SCOAP3) is an example of precisely this. The particle physics community have essentially decided unilaterally to assume control of the journals for their area and have placed their service requirements out for tender. Unfortunately this means we don’t have the final prices yet, but we will soon and the executive summary of the working party report suggests a reasonable price range of €1000-2000. If we assume the successful tender comes in at the lower end or slightly below of this range we see an accepted price for added value, over that already provided by the ArXiv for this disciplinary area, that is not a million miles away from that figure of $1500.

Of course this is before real price competition in this space is factored in. The realisable value is a function of the market and as prices inevitably drop there will be downward pressure on what people are willing to pay. There will also be increasing competition from archives, repositories, and other services that are currently free or near free to use, as they inevitably increase the quality and range of the services they offer. Some of these will mirror the services provided by incumbent publishers.

A reasonable current lower bound for realisable added value by publication service providers is ~$1000 per paper. This is likely to drop as market pressures come to bear and existing archives and repositories seek to provide a wider range of low cost services.

Where does this leave us? Not with a clear numerical value we can ascribe to this added value, but that’s always going to be a moving target. But we can get some sense of the bottom end of the range. It’s currently $1000 or greater at least in some disciplines, but is likely to go down. It’s also likely to diversify as new providers offer subsets of the services currently offered as one indivisible lump. At the top end both customers and service providers actions suggest they believe that the added value is less than what we currently pay and that it is only artificial controls over access to the non-value add versions that justify the current price. What we need is a better articulation of what is the real value that publishers add and an honest conversation about what we are prepared to pay for it.

Enhanced by Zemanta

They. Just. Don’t. Get. It…

English: Traffic Jam in Delhi Français : Un em...
Image via Wikipedia

…although some are perhaps starting to see the problems that are going to arise.

Last week I spoke at a Question Time style event held at Oxford University and organised by Simon Benjamin and Victoria Watson called “The Scientific Evolution: Open Science and the Future of Publishing” featuring Tim Gowers (Cambridge), Victor Henning (Mendeley), Alison Mitchell (Nature Publishing Group), Alicia Wise (Elsevier), and Robert Winston (mainly in his role as TV talking head on science issues). You can get a feel for the proceedings from Lucy Pratt’s summary but I want to focus on one specific issue.

As is common for me recently I emphasised the fact that networked research communication needs to be different to what we are used to. I made a comparison to the fact that when the printing press was developed one of the first things that happened was that people created facsimiles of hand written manuscripts. It took hundreds of years for someone to come up with the idea of a newspaper and to some extent our current use of the network is exactly that – digital facsimiles of paper objects, not truly networked communication.

It’s difficult to predict exactly what form a real networked communication system will take, in much the same way that asking a 16th century printer how newspaper advertising would work would not provide a detailed and accurate answer, but there are some principles of successful network systems that we can see emerging. Effective network systems distribute control and avoid centralisation, they are loosely coupled, and distributed. Very different to the centralised systems for control of access and control we have today.

This is a difficult concept and one that scholarly publishers simply don’t get for the most part. This is not particularly suprising because truly disruptive innovation rarely comes from incumbent players. Large and entrenched organisations don’t generally enable the kind of thinking that is required to see the new possibilities. This is seen in publishers statements that they are providing “more access than ever before” via “more routes”, but all routes that are under tight centralised control, with control systems that don’t scale. By insisting on centralised control over access publishers are setting themselves up to fail.

Nowhere is this going to play out more starkly than in the area of text mining. Bob Campbell from Wiley-Blackwell walked into this – but few noticed it – with the now familiar claim that “text mining is not a problem because people can ask permission”. Centralised control, failure to appreciate scale, and failure to understand the necessity of distribution and distributed systems. I have with me a device capable of holding the text of perhaps 100,000 papers It also has the processor power to mine that text. It is my phone. In 2-3 years our phones, hell our watches, will have the capacity to not only hold the world’s literature but also to mine it, in context for what I want right now. Is Bob Campbell ready for every researcher, indeed every interested person in the world, to come into his office and discuss an agreement for text mining? Because the mining I want to do and the mining that Peter Murray-Rust wants to do will be different, and what I will want to do tomorrow is different to what I want to do today. This kind of personalised mining is going to be the accepted norm of handling information online very soon and will be at the very centre of how we discover the information we need. Google will provide a high quality service for free, subscription based scholarly publishers will charge an arm and a leg for a deeply inferior one – because Google is built to exploit network scale.

The problem of scale has also just played out in fact. Heather Piwowar writing yesterday describes a call with six Elsevier staffers to discuss her project and needs for text mining. Heather of course now has to have this same conversation with Wiley, NPG, ACS, and all the other subscription based publishers, who will no doubt demand different conditions, creating a nightmare patchwork of different levels of access on different parts of the corpus. But the bit I want to draw out is at the bottom of the post where Heather describes the concerns of Alicia Wise:

At the end of the call, I stated that I’d like to blog the call… it was quickly agreed that was fine. Alicia mentioned her only hesitation was that she might be overwhelmed by requests from others who also want text mining access. Reasonable.

Except that it isn’t. It’s perfectly reasonable for every single person who wants to text mine to want a conversation about access. Elsevier, because they demand control, have set themselves up as the bottleneck. This is really the key point, because the subscription business model implies an imperative to extract income from all possible uses of the content it sets up a need for control of access for differential uses. This means in turn that each different use, and especially each new use, has to be individually negotiated, usually by humans, apparently about six of them. This will fail because it cannot scale in the same way that the demand will.

The technology exists today to make this kind of mass distributed text mining trivial. Publishers could push content to bit torrent servers and then publish regular deltas to notify users of new content. The infrastructure for this already exists. There is no infrastructure investment required. The problems that publishers raise of their servers not coping is one that they have created for themselves. The catch is that distributed systems can’t be controlled from the centre and giving up control requires a different business model. But this is also an opportunity. The publishers also save money  if they give up control – no more need for six people to sit in on each of hundreds of thousands of meetings. I often wonder how much lower subscriptions would be if they didn’t need to cover the cost of access control, sales, and legal teams.

We are increasingly going to see these kinds of failures. Legal and technical incompatibility of resources, contractual requirements at odds with local legal systems, and above all the claim “you can just ask for permission” without the backing of the hundreds or thousands of people that would be required to provide a timely answer. And that’s before we deal with the fact that the most common answer will be “mumble”. A centralised access control system is simply not fit for purpose in a networked world. As demand scales, people making legitimate requests for access will have the effect of a distributed denial of service attack. The clue is in the name; the demand is distributed. If the access control mechanisms are manual, human and centralised, they will fail. But if that’s what it takes to get subscription publishers to wake up to the fact that the networked world is different then so be it.

Enhanced by Zemanta

Network Enabled Research: Maximise scale and connectivity, minimise friction

BBN Technologies TCP/IP internet map early 1986
Image via Wikipedia

Prior to all the nonsense with the Research Works Act, I had been having a discussion with Heather Morrison about licenses and Open Access and peripherally the principle of requiring specific licenses of authors. I realized then that I needed to lay out the background thinking that leads me to where I am. The path that leads me here is one built on a technical understanding of how networks functional and what their capacity can be. This builds heavily on the ideas I have taken from (in no particular order) Jon Udell, Jonathan Zittrain, Michael Nielsen, Clay Shirky, Tim O’Reilly, Danah Boyd, and John Wilbanks among many others. Nothing much here is new but it remains something that very few people really get. Ironically the debate over the Research Works Act is what helped this narrative crystallise. This should be read as a contribution to Heather’s suggested “Articulating the Commons” series.

A pragmatic perspective

I am at heart a pragmatist. I want to see outcomes, I want to see evidence to support the decisions we make about how to get outcomes. I am happy to compromise, even to take tactical steps in the wrong direction if they ultimately help us to get where we need to be. In the case of publicly funded research we need to ensure that the public investment in research is made in such a way that it maximizes those outcomes. We may not agree currently on how to prioritize those outcomes, or the timeframe they occur on. We may not even agree that we can know how best to invest. But we can agree on the principle that public money should be effectively invested.

Ultimately the wider global public is for the most part convinced that research is something worth investing in, but in turn they expect to see outcomes of that research, jobs, economic activity, excitement, prestige, better public health, improved standards of living. The wider public are remarkably sophisticated when it comes to understanding that research may take a long time to bear fruit. But they are not particularly interested in papers. And when they become aware of academia’s obsession with papers they tend to be deeply unimpressed. We ignore that at our peril.

So it is important that when we think about the way we do research, that we understand the mechanisms and the processes that lead to outcomes. Even if we can’t predict exactly where outcomes will spring from (and I firmly believe that we cannot) that does not mean that we can avoid the responsibility of thoughtfully designing our systems so as to maximize the potential for innovation. The fact that we cannot, literally cannot under our current understanding of physics, follow the path of an electron through a circuit does not meant that we cannot build circuits with predictable overall behaviour. You simply design the system at a different level.

The assumptions underlying research communication have changed

So why are we having this conversation? And why now? What is it about today’s world that is so different? The answer, of course, is the internet. Our underlying communications and information infrastructure is arguably undergoing its biggest change since the development of the Gutenberg’s press. Like all developments of new communication networks, SMS, fixed telephones, the telegraph, the railways, and writing itself, the internet doesn’t just change how well we can do things, it qualitatively changes what we can do. To give a seemingly trivial example the expectations and possibilities of a society with mobile telephones is qualitatively different and their introduction has changed the way we behave and expect others to behave. The internet is a network on a scale, and with connectivity, that we have never had before. The potential change in our capacity as individuals, communities, and societies is therefore immense.

Why do networks change things? Before a network technology spreads you can imagine people, largely separated from each other, unable to communicate in this new way. As you start to make connections nothing much really happens, a few small groups can start to communicate in this new way, but that just means that they can do a few things a bit better. But as more connections form suddenly something profound happens. There comes a point where there is a transition – where suddenly nearly everyone is connected. For the physical scientists this is in fact a phase transition and can display extreme cooperativity – a sudden break where the whole system crystallizes into a new state.

At this point the whole is suddenly greater than the sum of its parts. Suddenly there is the possibility of coordination, of distribution of tasks that was simply not possible before. The internet simply does this better than any other network we have ever had. It is better for a range of reasons but they key ones are: its immense scale – connecting more people, and now machines than any previous network; its connectivity – the internet is incredibly densely connected, essentially enabling any computer to speak to any other computer globally; its lack of friction – transfer of information is very low cost, essentially zero compared to previous technologies, and is very very easy. Anyone with a web browser can point and click and be a part of that transfer.

What does this mean for research?

So if the internet and the web bring new capacity, where is the evidence that this is making a difference? If we have fundamentally new capacity where are the examples of that being exploited? I will give two examples, both very familiar to many people now, but ones that illustrate what can be achieved.

In late January 2009 Tim Gowers, a Fields medalist and arguably one of the worlds greatest living mathematicians, posed a question. Could a group of mathematicians working together be better at solving a problem than one on their own. He suggested a problem, one that he had an idea how to solve but felt was too challenging to tackle on his own. He then started to hedge his bets, stating:

“It is not the case that the aim of the project is [to solve the problem but rather it is to see whether the proposed approach was viable] I think that the chances of success even for this more modest aim are substantially less than 100%.”

A loose collection of interested parties, some world leading mathematicians, others interested but less expert, started to work on the problem. Six weeks later Gower’s announced that he believed the problem solved:

“I hereby state that I am basically sure that the problem is solved (though not in the way originally envisaged).”

In six weeks a non-planned assortment of contributors had solved a problem that a world-leading mathematician had thought both interesting, and too hard. And had solved it by a route other than the one he had originally proposed. Gower’s commented:

“It feels as though this is to normal research as driving is to pushing a car.”

For one of the world’s great mathematicians, there was a qualitative difference in what was possible when a group of people with the appropriate expertise were connected via a network through which they could easily and effectively transmit ideas, comments, and proposals. Three key messages emerge, the scale of the network was sufficient to bring the required resources to bear, the connectivity of the network was sufficient that work could be divided effectively and rapidly, and there was little friction in transferring ideas.

The Galaxy Zoo project arose out of a different kind of problem at at different kind of scale. One means of testing theories of the history and structure of the universe is to look at the numbers and types of different categories of galaxy in the sky. Images of the sky are collected and made freely available to the community. Researchers will then categories galaxies by hand to build up data sets to allow them to test theories. An experienced researcher could perhaps classify a hundred galaxies in a day. A paper might require a statistical sample of around 10,000 galaxy classifications to get past peer review. One truly heroic student classified 50,000 galaxies within their PhD, declaring at the end that they would never classify another again.

However problems were emerging. It was becoming clear that the statistical power offered by even 10,000 galaxies was not enough. One group would get different results to another. More classifications were required. Data wasn’t the problem. The Sloan Digital Sky Survey had a million galaxy images. But computer based image categorization wasn’t up to the job. The solution? Build a network. In this case a network of human participants willing to contribute by categorizing the galaxies. Several hundred thousand people classified the millions of images several times over in a matter of months. Again the key messages: scale of the network – both the number of images and the number of participants; the connectivity of the network – the internet made it easy for people to connect and participate; a lack of friction – sending images one way, and a simple classification was easy. Making the website easy, even fun, for people to use was a critical part of the success.

Galaxy Zoo changed the scale of this kind of research. It provided a statistical power that was unheard of and made it possible to ask fundamentally new types of questions. It also enabled fundamentally new types of people to play an effective role in the research, school children, teachers, full time parents. It enabled qualitatively different research to take place.

So why hasn’t the future arrived then?

These are exciting stories, but they remain just that. Sure I can multiply examples but they are still limited. We haven’t yet taken real advantage of the possibilities. There are lots of reasons for this but the fundamental one is inertia. People within the system are pretty happy for the most part with how it works. They don’t want to rock the boat too much.

But there are a group of people who are starting to be interested in rocking the boat. The funders, the patient groups, that global public who want to see outcomes. The thought process hasn’t worked through yet, but when it does they will all be asking one question. “How are you building networks to enable research”. The question may come in many forms – “How are you maximizing your research impact?” – “What are you doing to ensure the commercialization of your research?” – “Where is your research being used?” – but they all really mean the same thing. How are you working to make sure that the outputs of your research are going into the biggest, most connected, lowest friction, network that they possibly can.

As service providers, all of those who work in this industry – and I mean all, from the researchers to the administrators, to the publishers to the librarians – will need to have an answer. The suprising thing is that it’s actually very easy. The web makes building and exploiting networks easier than it has ever been because it is a network infrastructure. It has scale, billions of people – billions of computers – exabytes of information resources – exaflops of computational resources. It has connectivity on a scale that is literally unimaginable – the human mind can’t conceive of that number of connections because the web has more. It is incredibly low in friction – the cost of information transfer is in most cases so close to zero as to make no difference.

Service requirements

To exploit the potential of the network all we need to do is get as much material online as fast as we can. We need to connect it up, to make it discoverable, to make sure that people can find and understand and use it. And we need to ensure that once found those resources can be easily transferred, shared, and used. And used in any way – at network scale the system is designed to ensure  that resources get used in unexpected ways. At scale you can have serendipity by design, not by blind luck.

The problem arises with the systems we have in place to get material online. The raw material of science is not often in a state where putting it online is immediately useful. It needs checking, formatting, testing, indexing. All of this does require real work, and real money. So we need services to do this, and we need to be prepared to pay for those services. The trouble is our current system has this backwards. We don’t pay directly for those services so those costs have to be recouped somehow. And the current set of service providers do that by producing the product that we really need and want and then crippling it.

Currently we take raw science and through a collaborative process between researchers and publishers we generate a communication product, generally a research paper, that is what most of the community holds as the standard means by which they wish to receive information. Because the publishers receive no direct recompense for their contribution they need to recover those costs by other means. They do this by artificially introducing friction and then charging to remove it.

This is a bad idea on several levels. Firstly because it means the product we get doesn’t have the maximum impact it could, because its not embedded in the largest possible network. From a business perspective it creates risks, publishers have to invest up front and then recoup money later, rather than being confident that expenditure and cash flow are coupled. This means, for instance that if there is a sudden rise (or fall) in the number of submissions there is no guarantee that cash flows or costs will scale with that change. But the real problem is that it distorts the market. Because on the researcher side we don’t pay for the product of effective communication we don’t pay much attention to what we’re getting. On the publisher side it drives a focus on surface and presentation, because it enhances the product in the current purchasers eyes, rather than a ruthless focus on production costs and shareability.

Network Ready Research Communication

If we care about taking advantage of the web and internet for research then we must tackle the building of scholarly communication networks. These networks will have those critical characteristics described above, scale and a lack of friction. The question is how do we go about building them. In practice we actually already have a network at huge scale – the web and the internet do that job for us, connecting essentially all professional researchers and a large proportion of the interested public. There is work to be done on expanding the reach of the network but this is a global development goal, not something specific to research.

So if we already have the network then what is the problem? The issue lies in the second characteristic – friction. Our current systems are actually designed to create friction. Before the internet was in place our network was formed of a distribution system involving trucks and paper – reducing costs to reasonable levels meant charging for that distribution process. Today those distribution costs have fallen to as near zero as makes no difference, yet we retain the systems that add friction unnecessarily. Slow review processes, charging for access, formats and discovery tools that are no longer fit for purpose.

What we need to do is focus on the process of taking research we that we do and convert it into a Network Ready form. That is we need to have access to the services that take our research and make them ready to exploit our network infrastructure – or we need to do it ourselves. What does “Network Ready” mean? A piece of Network Ready Research will be modular and easily discoverable, it will present different facets that will allow people and systems to use it in a wide variety of ways, it will be compatible with the widest range of systems and above all it will be easily shareable. Not just copyable or pasteable but easily shared through multiple systems while carrying with it all the context required to make use of it, all the connections that will allow a user to dive deeper into its component parts.

Network Ready Research will be interoperable, socially, technically, and legally with the rest of the network. The network is more than just technical infrastructure. It is also built up from the social connections, a shared understanding of the parameters of re-use, and a compatible system of checks and balances. The network is the shared set of technical and social connections that together enable new connections to be made. Network Ready Research will move freely across that, building new connections as it goes, able to act as both connecting edge and connected node in different contexts.

Building and strengthening the network

If you believe the above, as I do, then you see a potential for us to qualitatively change our capacity as a society to innovate, understand our world, and help to make it a better place. That potential will be best realized by building the largest possible, most effective, and lowest friction network possible. A networked commons in which ideas and data, concepts and expertise can be most easily shared, and can most easily find the place where they can do the most good.

Therefore the highest priority is building this network, making its parts and components interoperable, and making it as easy as possible to connect up networks that already exist. For an agency that funds research and seeks to ensure that research makes a difference the only course of action is to place the outputs of that research where they are most accessible on the network. In blunt terms that means three things: free at the point of access, technically interoperable with as many systems as possible, and free to use for any purpose. The key point is that at network scale the most important uses are statistically likely to be unexpected uses. We know we can’t predict the uses, or even success, of much research. That means we must position it so it can be used in unexpected ways.

Ultimately, the bigger the commons, the bigger the network, the better. And the more interoperable and the widest range of uses the better. That ultimately is why I argue for liberal licences, for the exclusion of non-commercial terms. It is why I use ccZero on this blog and for software that I write where I can. For me, the risk of commercial enclosure is so much smaller than the risk of not building the right networks, or of creating fragmented incompatible networks, of ultimately not being able to solve the crises we face today in time to do any good, that the course of action is clear. At the same time we need to build up the social interoperability of the network, to call out bad behavior and perhaps in some cases to isolate its perpetrators but we need to find ways of doing this that don’t damage the connectivity and freedom of movement on the network. Legal tools are useful to assure users of interoperability and their rights, otherwise they just become a source of friction. Social tools are a more viable route for encouraging desirable behaviour.

The priority has to be achieving scale and lowering friction. If we can do this then we have the potential to create a qualitative jump in our research capacity on a scale not seen since the 18th century and perhaps never. And it certainly feels like we need it.

Enhanced by Zemanta

The Research Works Act and the breakdown of mutual incomprehension

Man's face screaming/shouting. Stubbly wearing...
Image via Wikipedia

When the history of the Research Works Act, and the reaction against it, is written that history will point at the factors that allowed smart people with significant marketing experience to walk with their eyes wide open into the teeth of a storm that thousands of people would have predicted with complete confidence. That story will detail two utterly incompatible world views of scholarly communication. The interesting thing is that with the benefit of hindsight both will be totally incomprehensible to the observer from five or ten years in the future. It seems worthwhile therefore to try and detail those world views as I understand them.

The scholarly publisher

The publisher world view places them as the owner and guardian of scholarly communications. While publishers recognise that researchers provide the majority of the intellectual property in scholarly communication, their view is that researchers willingly and knowingly gift that property to the publishers in exchange for a set of services that they appreciate and value. In this view everyone is happy as a trade is carried out in which everyone gets what they want. The publisher is free to invest in the service they provide and has the necessary rights to look after and curate the content. The authors are happy because they can obtain the services they require without having to pay cash up front.

Crucial to this world view is a belief that research communication, the process of writing and publishing papers, is separate to the research itself. This is important because otherwise it would be clear that, at least in an ethical sense, that the writing of papers would be work for hire for the funders – and part and parcel of the contract of research. For the publishers the fact that no funding contract specifies that “papers must be published” is the primary evidence of this.

The researcher

The researcher’s perspective is entirely different. Researchers view their outputs as their own property, both the ideas, the physical outputs, and the communications. Within institutions you see this in the uneasy relationship between researchers and research translation and IP exploitation offices. Institutions try to avoid inflaming this issue by ensuring that economic returns on IP go largely to the researcher, at least until there is real money involved. But at that stage the issue is usually fudged as extra investment is required which dilutes ownership. But scratch a researcher who has gone down the exploitation path and then pushed gently aside and you’ll get a feel for the sense of personal ownership involved.

Researchers have a love-hate relationship with papers. Some people enjoy writing them, although I suspect this is rare. I’ve never met any researcher who did anything but hate the process of shepherding a paper through the review process. The service, as provided by the publisher, is viewed with deep suspicion. The resentment that is often expressed by researchers for professional editors is primarily a result of a loss of control over the process for the researcher and a sense of powerlessness at the hands of people they don’t trust. The truth is that researchers actually feel exactly the same resentment for academic editors and reviewers. They just don’t often admit it in public.

So from a researcher’s perspective, they have spent an inordinate amount of effort on a great paper. This is their work, their property. They are now obliged to hand over control of this to people they don’t trust to run a process they are unconvinced by. Somewhere along the line they sign something. Mostly they’re not too sure what that means, but they don’t give it much thought, let alone read it. But the idea that they are making a gift of that property to the publisher is absolute anathema to most researchers.

To be honest researchers don’t care that much about a paper once its out. It caused enough pain and they don’t ever want to see it again. This may change over time if people start to cite it and refer to it in supportive terms but most people won’t really look at a paper again. It’s a line on a CV, a notch on the bedpost. What they do notice is the cost, or lack of access, to other people’s papers. Library budgets are shrinking, subscriptions are being chopped, personal subscriptions don’t seem to be affordable any more.

The first response to this when researchers meet is “why can’t we afford access to our work?” The second is, given the general lack of respect for the work that publishers do, is to start down the process of claiming that they could do it better. Much of the rhetoric around eLife as a journal “led by scientists” is built around this view. And a lot of it is pure arrogance. Researchers neither understand, nor appreciate for the most part, the work of copyediting and curation, layout and presentation. While there are tools today that can do many of these things more cheaply there are very few researchers who could use them effectively.

The result…kaboom!

So the environment that set the scene for the Research Works Act revolt was a combination of simmering resentment amongst researchers for the cost of accessing the literature, combined with a lack of understanding of what it is publishers actually do. The spark that set it off was the publisher rhetoric about ownership of the work. This was always going to happen one day. The mutually incompatible world views could co-exist while there was still enough money to go around. While librarians felt trapped between researchers who demanded access to everything and publishers offering deals that just about meant they could scrape by things could continue.

Fundamentally once publishers started publicly using the term “appropriation of our property” the spark had flown. From the publisher perspective this makes perfect sense. The NIH mandate is a unilateral appropriation of their property. From the researcher perspective it is a system that essentially adds a bit of pressure to do something that they know is right, promote access, without causing them too much additional pain. Researchers feel they ought to be doing something to improve acccess to research output but for the most part they’re not too sure what, because they sure as hell aren’t in a position to change the journals they publish in. That would be (perceived to be) career suicide.

The elephant in the room

But it is of course the funder perspective that we haven’t yet discussed and looking forward, in my view it is the action of funders that will render both the publisher and researcher perspective incomprehensible in ten years time. The NIH view, similar to that of the Wellcome Trust, and indeed every funder I have spoken to, is that research communication is an intrinsic part of the research they fund. Funders take a close interest in the outputs that their research generates. One might say a proprietorial interest because again, there is a strong sense of ownership. The NIH Mandate language expresses this through the grant contract. Researchers are required to grant to the NIH a license to hold a copy of their research work.

In my view it is through research communication that research has outcomes and impact. From the perspective of a funder their main interest is that the research they fund generates those outcomes and impacts. For a mission driven funder the current situation signals one thing and it signals it very strongly. Neither publishers, nor researchers can be trusted to do this properly. What funders will do is move to stronger mandates, more along the Wellcome Trust lines than the NIH lines, and that this will expand. At the end of the day, the funders hold all the cards. Publishers never really did have a business model, they had a public subsidy. The holders of those subsidies can only really draw one conclusion from current events. That they are going to have to be much more active in where they spend it to successfully perform their mission.

The smart funders will work with the pre-existing prejudice of researchers, probably granting copyright and IP rights to the researchers, but placing tighter constraints on the terms of forward licensing. That funders don’t really need the publishers has been made clear by HHMI, Wellcome Trust, and the MPI. Publishing costs are a small proportion of their total expenditure. If necessary they have the resources and will to take that in house. The NIH has taken a similar route though technically implemented in a different way. Other funders will allow these experiments to run, but ultimately they will adopt the approaches that appear to work.

Bottom line: Within ten years all major funders will mandate CC-BY Open Access on publication arising from work they fund immediately on publication. Several major publishers will not survive the transition. A few will and a whole set of new players will spring up to fill the spaces. The next ten years look to be very interesting.

Enhanced by Zemanta

IP Contributions to Scientific Papers by Publishers: An open letter to Rep Maloney and Issa

Dear Representatives Maloney and Issa,

I am writing to commend your strong commitment to the recognition of intellectual property contributions to research communication. As we move to a modern knowledge economy, supported by the technical capacity of the internet, it is crucial that we have clarity on the ownership of intellectual property arising from the federal investment in research. For the knowledge economy to work effectively it is crucial that all players receive fair recompense for the contribution of intellectual property that they make and the services that they provide.

As a researcher I like to base my work on solid data, so I thought it might interest you to have some quantitation of the level of contribution of IP that publishers make to the substance of scientific papers. In this, I have focussed on the final submitted version of papers after peer review as this is the version around which the discussion of mandates for deposition in repositories revolve. This also has the advantage of separating the typesetting and copyright in layout, clearly the property of the publishers from the intellectual substance of the research.

Contribution of IP to the final (post peer review) submitted versions of papers

Methodology: I examined the final submitted version (i.e. the version accepted for publication) of the ten most recent research papers on which I was an author along with the referee and editorial comments received from the publisher. For each paper I examined the text of the final submitted version and the diagrams and figures.  As the only IP of significance in this case is copyright the specific contributions that were searched for were text or elements of figures contributed by the publisher that satisfied the requirements for obtaining copyright. Figures that were re-used from other publications (where the copyright had been transferred to the other publisher and permission been obtained to republish) were not included as these were considered “old IP” that did not relate to new IP embodied in the specific paper under consideration. The text and figures were searched for specific creative contributions from the publisher and these were quantified for each paper.

Results: The contribution of IP by publishers to the final submitted versions of these ten papers, after peer review had been completed, was zero. Zip. Nada. Zilch. Not one single word, line, or graphical element was contributed by the publisher or the editor acting as their agent. A small number of single words, or forms of expression, were found that were contributed by external peer reviewers. However as these peer reviewers do not sign over copyright to the publisher and are not paid this contribution cannot be considered work for hire and any copyright resides with the original reviewers.

Limitations: This is a small and arguably biased study based on the publications I have to hand. I recommend that other researchers examine their own oeuvre and publish similar analyses so that effects of discipline, age, and venue of publication can be examined. Following such analysis I ask that researchers provide the data via twitter using the hashtag #publisheripcontrib where I will aggregate it and republish.

Data availability: I regret that the original submissions can not be provided as the copyright in these articles was transferred after acceptance for publication to the publishers. I can not provide the editorial reports as these contain material from the publishers for which I do not have re-distribution rights.

The IP argument is sterile and unproductive. We need to discuss services.

The analysis above at its core shows how unhelpful framing this argument around IP is. The fact that publishers do not contribute IP is really not relevant. Publishers do contribute services, the provision of infrastructure, the management of the peer review process, dissemination and indexing, that are crucial for the current system of research dissemination via peer reviewed papers. Without these services papers would not be published and it is therefore clear that these services have to be paid for. What we should be discussing is how best to pay for those services, how to create a sustainable market place in which they can be offered, and what level of service the federal government expects in exchange for the services it is buying.

There is a problem with this. We currently pay for these services in a convoluted fashion which is the result of historical developments. Rather than pay up front for publication services, we currently give away the intellectual property in our papers in exchange for publication. The U.S. federal and state governments then pay for these publication services indirectly by funding libraries to hire access back to our own work. This model made sense when the papers were physically on paper; distribution, aggregation, and printing were major components of the cost. In that world a demand side business model worked well and was appropriate.

In the current world the costs of dissemination and provision of access are as near to zero as makes no difference. The major costs are in the peer review process and preparing the paper in a version that can be made accessible online. That is, we have moved from a world where the incremental cost of dissemination of each copy was dominant, to a world where the first copy costs are dominant and the incremental costs of dissemination after those first copy costs are negligible. Thus we must be clear that we are paying for the important costs of the services required to generate that first web accessible copy, and not that we are supporting unnecessary incremental costs. A functioning market requires, as discussed above, that we have clarity on what is being paid for.

In a service based model the whole issue of IP simply goes away. It is clear that the service we would wish to pay for is one in which we generate a research communication product which provides appropriate levels of quality assurance and is as widely accessible and available for any form of use as possible. This ensures that the outputs of the most recent research are available to other researchers, to members of the public, to patients, to doctors, to entrepreneurs and technical innovators, and not least to elected representatives to support informed policy making and legislation. In a service based world there is no logic in artificially reducing access because we pay for the service of publication and the full first copy costs are covered by the purchase of that service.

Thus when we abandon the limited and sterile argument about intellectual property and move to a discussion around service provision we can move from an argument where no-one can win to a framework in which all players are suitably recompensed for their efforts and contributions, whether or not those contributions generate IP in the legal sense, and at the same time we can optimise the potential for the public investment in research to be fully exploited.

HR3699 prohibits federal agencies from supporting publishers to move to a transparent service based model

The most effective means of moving to a service based business model would be for U.S. federal agencies as the major funders of global research to work with publishers to assure them that money will be available for the support of publication services for federally funded researchers. This will require some money to be put aside. The UK’s Wellcome Trust estimates that they expect to spend approximately 1.5% of total research funding on publication services. This is a significant sum, but not an overly large proportion of the whole. It should also be remembered that governments, federal and state, are already paying these costs indirectly through overheads charges and direct support to research institutions via educational and regional grants. While there will be additional centralised expenditure over the transitional period in the longer term this is at worst a zero-sum game. Publishers are currently viable, indeed highly profitable. In the first instance service prices can be set so that the same total sum of money flows to them.

The challenge is the transitional period. The best way to manage this would be for federal agencies to be able to guarantee to publishers that their funded researchers would be moving to the new system over a defined time frame. The most straight forward way to do this would be for the agencies to have a published program over a number of years through which the publication of research outputs via the purchase of appropriate services would be made mandatory. This could also provide confidence to the publishers by defining the service level agreements that the federal agencies would require, and guarantee a predictable income stream over the course of the transition.

This would require agencies working with publishers and their research communities to define the timeframes, guarantees, and service level agreements that would be put in place. It would require mandates from the federal agencies as the main guarantor of that process. The Research Works Acts prohibits any such process. In doing so it actively prevents publishers from moving towards business models that are appropriate for today’s world. It will stifle innovation and new entrants to the market by creating uncertainty and continuing the current obfuscation of first copy costs with dissemination costs. In doing so it will damage the very publishers that support it by legislatively sustaining an out of date business model that is no longer fit for purpose.

Like General Motors, or perhaps more analogously, Lehman Brothers, the incumbent publishers are trapped in a business model that can not be sustained in the long term. The problem for publishers is that their business model is predicated on charging for the dissemination and access costs that are disappearing and not explicitly charging for the costs that really matter. Hiding the cost of one thing in a charge for another is never a good long term business strategy. HR3699 will simply prop them up for a little longer, ultimately leading to a bigger crash when it comes. The alternative is a managed transition to a better set of business models which can simultaneously provide a better return on investment for the taxpayer.

We recognise the importance of the services that scholarly publishers provide. We want to pay publishers for the services they provide because we want those services to continue to be available and to improve over time. Help us to help them make that change. Drop the Research Works Act.

Yours sincerely

Cameron Neylon

Enhanced by Zemanta

An Open Letter to David Willetts: A bold step towards opening British research

English: Open Access logo, converted into svg,...
Image via Wikipedia

On the 8th December David Willetts, the Minister of State for Universities and Science, and announced new UK government strategies to develop innovation and research to support growth. The whole document is available online and you can see more analysis at the links at the bottom of the post.  A key aspect for Open Access advocates was the section that discussed a wholesale move by the UK to an author pays system to freely accessible research literature with SCOAP3 raised as a possible model. The report refers not to Open Access, but to freely accessible content. I think this is missing a massive opportunity for Britain to take a serious lead in defining the future direction of scholarly communication. That’s the case I attempt to lay out in this open letter. This post should be read in the context of my usual disclaimer.

Minister of State for Universities and Science

Department of Business Innovation and Skills

Dear Mr Willetts,

I am writing in the first instance to congratulate you on your stance on developing routes to a freely accessible research outputs. I cannot say I am a great fan of many current government positions and I might have wished for greater protection of the UK science budget but in times of resource constraint for research I believe your focus on ensuring the efficiency of access to and exploitation of research outputs in its widest sense is the right one.

The position you have articulated offers a real opportunity for the UK to take a lead in this area. But along with the opportunities there are risks, and those risks could entrench existing inefficiencies of our scholarly communication system. They could also reduce the value for money that the public purse, and it will be the public purse one way or another, gets for its investment. In our current circumstances this would be unfortunate. I would therefore ask you to consider the following as the implementation pathway for this policy is developed.

Firstly, the research community will be buying a service. This is a significant change from the current system where the community buys a product, the published journal. The purchasing exercise should be seen in this light and best practice in service procurement applied.

Secondly the nature of this service must be made clear. The service that is being provided must provide for any and all downstream uses, including commercial use, text mining, indeed any use that might developed at some point in the future. We are paying for this service and we must dictate its terms. Incumbent publishers will say in response that they need to retain commercial rights, or text mining rights, to ensure their viability, as indeed they have done in response to the Hargreaves Review.

This, not to put to fine a point on it, is hogwash. PLoS and BioMedCentral, both operate financially viable operations in which no downstream rights beyond that of appropriate attribution are retained by the publishers and where the author charges are lower in price then many of the notionally equivalent, but actually far more limited, offerings of more traditional publishers. High quality scholarly communication can be supported by reasonable author charges without any need for publishers to retain rights beyond those protected by their trademarks. An effective market place could therefore be expected to bring the average costs of this form of scholarly communications down.

The reason for supporting a system that demands that any downstream use of the communication be enabled is that we need innovation and development within the publishing systems well as innovation and development as a result of its content. Our scholarship is currently being held back by a morass of retained rights that prevent the development of research projects, of new technology startups and potentially new industries. The government consultation document of 14 December on the Hargreaves report explicitly notes that enabling downstream uses of content, and scholarly content in particular, can support new economic activity. It can also support new scholarly activity. The exploitation of our research outputs requires new approaches to indexing, mining, and parsing the literature. The shame of our current system is that much of this is possible today. The technology exists but is prevented from being exploited at scale by the logistical impossibility of clearing the required rights. These new approaches will require money and it is entirely appropriate, indeed desirable, that some of this work therefore occurs in the private sector. Experimentation will require both freedom to act as well as freedom to develop new business models. Our content and its accessibility and its reusability must support this.

Finally I ask you to look beyond the traditional scholarly publishing industry to the range of experimentation that is occurring globally in academic spaces, non-profits, and commercial endeavours. The potential leaps in functionality as well as the potential cost reductions are enormous. We need to work to encourage this experimentation and develop a diverse and vibrant market which both provides the quality assurance and stability that we are used to while encouraging technical experimentation and the improvement of business models. What we don’t need is a five or ten year deal that cements in existing players, systems, and practices.

Your government’s philosophy is based around the effectiveness of markets. The recent history of major government procurement exercises is not a glorious one. This is one we should work to get right. We should take our time to do so and ensure a deal that delivers on its promise. The vision of a Britain that is lead by innovation and development supported by a vibrant and globally leading research community is, I believe, the right one. Please ensure that this innovation isn’t cut off at the knees by agreeing terms that prevent our research communication tools being re-used to improve the effectiveness of that communication. And please ensure that the process of procuring these services is one that supports innovation and development in scholarly communications itself.

Yours truly,

Cameron Neylon

 

 

 

Enhanced by Zemanta

PLoS (and NPG) redefine the scholarly publishing landscape

Open Access logo, converted into svg, designed...
Image via Wikipedia

Nature Publishing Group yesterday announced a new venture, very closely modelled on the success of PLoS ONE, titled Scientific Reports. Others have started to cover the details and some implications so I won’t do that here. I think there are three big issues here. What does this tell us about the state of Open Access? What are the risks and possibilities for NPG? And why oh why does NPG keep insisting on a non-commercial licence? I think those merit separate posts so here I’m just going to deal with the big issue. And I think this is really big.

[I know it bores people, hell it bores me, but the non-commercial licence is a big issue. It is an even bigger issue here because this launch may define the ground rules for future scholarly communication. Open Access with a non-commercial licence actually achieves very little either for the community, or indeed for NPG, except perhaps as a cynical gesture. The following discussion really assumes that we can win the argument with NPG to change those terms. If we can the future is very interesting indeed.]

The Open Access movement has really been defined by two strands of approach. The “Green Road” involves self archiving of pre-prints or published articles in subscription journals as a means of providing access. It has had its successes, perhaps more so in the humanities, with deposition mandates becoming increasingly common both at the institutional level and the level of funders. The other approach, the “Gold Road” is for most intents and purposes defined by commercial and non-profit publishers based on a business model of article processing charges (APCs) to authors and making the published articles freely available at a publisher website. There is a thriving community of “shoe-string business model” journals publishing small numbers of articles without processing charges but in terms of articles published OA publishing is dominated by BioMedCentral, the pioneers in this area, now owned by Springer, Public Library of Science, and on a smaller scale Hindawi. This approach has gained more traction in the sciences, particularly the biological sciences.

From my perspective yesterday’s announcement means that for the sciences, the argument for Gold Open Access as the default publication mechanism has effectively been settled. Furthermore the future of most scholarly publishing will be in publication venues that place no value on a subjective assessment of “importance”. Those are big claim, but NPG have played a bold and possibly decisive move, in an environment where PLoS ONE was already starting to dominate some fields of science.

PLoS ONE was already becoming a default publication venue. A standard path for getting a paper published is, have a punt at Cell/Nature/Science, maybe a go at one of the “nearly top tier” journals, and then head straight for PLoS ONE, in some cases with the technical assessments already in hand. However in some fields, particularly chemistry, the PLoS brand wasn’t enough to be attractive against the strong traditional pull of American Chemical Society or Royal Society of Chemistry journals and Angewandte Chemie. Scientific Reports changes this because of the association with the Nature brand. If I were the ACS I’d be very worried this morning.

The announcement will also be scaring the hell out of those publishers who have a lot of separate, lower tier journals. The problem for publication business models has never been with the top tier, that can be made to work because people want to pay for prestige (whether we can afford it in the long term is a separate question). The problem has been the volume end of the market. I back Dorothea Salo’s prediction [and again] that 2011/12 would see the big publishers looking very closely at their catalogue of 100s or 1000s of low yield, low volume, low prestige journals and see the beginning of mass closures, simply to keep down subscription increases that academic libraries can no longer pay for. Aggregated large scale journals with streamlined operating and peer review procedures, simplified and more objective selection criteria, and APC supported business models make a lot of sense in this market. Elsevier, Wiley, Springer (and to a certain extent BMC) have just lost the start in the race to dominate what may become the only viable market in the medium term.

With two big players now in this market there will be real competition. Others have suggested [see Jason Priem‘s comment] this will be on the basis of services and information. This might be true in the longer term but in the short to medium term it will be on two issues: brand, and price. The choice of name is a risk for NPG, the Nature brand is crucial to success of the venture, but there’s a risk of dilution of the brand which is NPG’s major asset. That the APC for Science Reports has been set identically to PLoS ONE is instructive. I have previously argued that APC driven business models will be the most effective way of forcing down publication costs and I would expect to see competition develop here. I hope we might soon see a third player in this space to drive effective competition.

At the end of the day what this means is that there are now seriously credible options for publishing in Open Access venues (assuming we win the licensing argument) across the sciences, that funders now support Article Processing Charges, and that there is really no longer any reason to publish in that obscure subscription journal that no-one actually read anyway. The dream of a universal database of freely accessible research outputs is that much closer to our reach.

Above all, this means that PLoS in particular has succeeded in its aim of making Gold Open Access publication a credible default option. The founders and team at PLoS set out with the aim of changing the publication landscape. PLoS ONE was a radical and daring step at the time which they pulled off. The other people who experimented in this space also deserve credit but it was PLoS ONE in particular that found the sweet spot between credibility and pushing the envelope. I hope that those in office are cracking open some bubbly today. But not too much. For the first time there is now some serious competition and its going to be tough to keep up. There remains a lot more work to be done (assuming we can sort out the licence).

Full disclosure: I am an academic editor for PLoS ONE, editor in chief of the BioMedCentral journal Open Research Computation, and have advised PLoS, BMC, and NPG in a non-paid capacity on a variety of issues that relate closely to this post.

Enhanced by Zemanta

Beyond the Impact Factor: Building a community for more diverse measurement of research

An old measuring tape
Image via Wikipedia

I know I’ve been a bit quiet for a few weeks. Mainly I’ve been away for work and having a brief holiday so it is good to be plunging back into things with some good news. I am very happy to report that the Open Society Institute has agreed to fund the proposal that was built up in response to my initial suggestion a month or so ago.

OSI, which many will know as one of the major players in bringing the Open Access movement to its current position, will fund a workshop that will identify both potential areas where the measurement and aggregation of research outputs can be improved as well as barriers to achieving these improvements. This will be immediately followed by a concentrated development workshop (or hackfest) that will aim to deliver prototype examples that show what is possible. The funding also includes further development effort to take one or two of these prototypes and develop them to proof of principle stage, ideally with the aim of deploying these into real working environments where they might be useful.

The workshop structure will be developed by the participants over the 6 weeks leading up to the date itself. I aim to set that date in the next week or so, but the likelihood is early to mid-March. The workshop will be in southern England, with the venue to be again worked out over the next week or so.

There is a lot to pull together here and I will be aiming to contact everyone who has expressed an interest over the next few weeks to start talking about the details. In the meantime I’d like to thank everyone who has contributed to the effort thus far. In particular I’d like to thank Melissa Hagemann and Janet Haven at OSI and Gunner from Aspiration who have been a great help in focusing and optimizing the proposal. Too many people contributed to the proposal itself to name them all (and you can check out the GoogleDoc history if you want to pull apart their precise contributions) but I do want to thank Heather Piwowar and David Shotton in particular for their contributions.

Finally, the success of the proposal, and in particular the community response around it has made me much more confident that some of the dreams we have for using the web to support research are becoming a reality. The details I will leave for another post but what I found fascinating is how far the network of people spread who could be contacted, essentially through a single blog post. I’ve contacted a few people directly but most have become involved through the network of contacts that spread from the original post. The network, and the tools, are effective enough that a community can be built up rapidly around an idea from a much larger and more diffuse collection of people. The challenge of this workshop and the wider project is to see how we can make that aggregated community into a self sustaining conversation that produces useful outputs over the longer term.

It’s a complete co-incidence that Michael Nielsen posted a piece in the past few hours that forms a great document for framing the discussion. I’ll be aiming to write something in response soon but in the meantime follow the top link below.

Enhanced by Zemanta

What would scholarly communications look like if we invented it today?

Picture1
Image by cameronneylon via Flickr

I’ve largely stolen the title of this post from Daniel Mietchen because I it helped me to frame the issues. I’m giving an informal talk this afternoon and will, as I frequently do, use this to think through what I want to say. Needless to say this whole post is built to a very large extent on the contributions and ideas of others that are not adequately credited in the text here.

If we imagine what the specification for building a scholarly communications system would look like there are some fairly obvious things we would want it to enable. Registration of ideas, data or other outputs for the purpose of assigning credit and priority to the right people is high on everyone’s list. While researchers tend not to think too much about it, those concerned with the long term availability of research outputs would also place archival and safekeeping high on the list as well. I don’t imagine it will come as any surprise that I would rate the ability to re-use, replicate, and re-purpose outputs very highly as well. And, although I won’t cover it in this post, an effective scholarly communications system for the 21st century would need to enable and support public and stakeholder engagement. Finally this specification document would need to emphasise that the system will support discovery and filtering tools so that users can find the content they are looking for in a huge and diverse volume of available material.

So, filtering, archival, re-usability, and registration. Our current communications system, based almost purely on journals with pre-publication peer review doesn’t do too badly at archival although the question of who is actually responsible for actually doing the archival, and hence paying for it doesn’t always seem to have a clear answer. Nonetheless the standards and processes for archiving paper copies are better established, and probably better followed in many cases, than those for digital materials in general, and certainly for material on the open web.

The current system also does reasonably well on registration, providing a firm date of submission, author lists, and increasingly descriptions of the contributions of those authors. Indeed the system defines the registration of contributions for the purpose of professional career advancement and funding decisions within the research community. It is a clear and well understood system with a range of community expectations and standards around it. Of course this is circular as the career progression process feeds the system and the system feeds career progression. It is also to some extent breaking down as wider measures of “impact” become important. However for the moment it is an area where the incumbent has clear advantages over any new system, around which we would need to grow new community standards, expectations, and norms.

It is on re-usability and replication where our current system really falls down. Access and rights are a big issue here, but ones that we are gradually pushing back. The real issues are much more fundamental. It is essentially assumed, in my experience, by most researchers that a paper will not contain sufficient information to replicate an experiment or analysis. Just consider that. Our primary means of communication, in a philosophical system that rests almost entirely on reproducibility, does not enable even simple replication of results. A lot of this is down to the boundaries created by the mindset of a printed multi-page article. Mechanisms to publish methods, detailed laboratory records, or software are limited, often leading to a lack of care in keeping and annotating such records. After all if it isn’t going in the paper why bother looking after it?

A key advantage of the web here is that we can publish a lot more with limited costs and we can publish a much greater diversity of objects. In principle we can solve the “missing information” problem by simply making more of the record available. However those important pieces of information need to be captured in the first place. Because they aren’t currently valued, because they don’t go in the paper, they often aren’t recorded in a systematic way that makes it easy to ultimately publish them. Open Notebook Science, with its focus on just publishing everything immediately, is one approach to solving this problem but it’s not for everyone, and causes its own overhead. The key problem is that recording more, and communicating it effectively requires work over and above what most of us are doing today. That work is not rewarded in the current system. This may change over time, if as I have argued we move to metrics based on re-use, but in the meantime we also need much better, easier, and ideally near-zero burden tools that make it easier to capture all of this information and publish it when we choose, in a useful form.

Of course, even with the perfect tools, if we start to publish a much greater portion of the research record then we will swamp researchers already struggling to keep up. We will need effective ways to filter this material down to reduce the volume we have to deal with. Arguably the current system is an effective filter. It almost certainly reduces the volume and rate at which material is published. Of all the research that is done, some proportion is deemed “publishable” by those who have done it, a small portion of that research is then incorporated into a draft paper and some proportion of those papers are ultimately accepted for publication. Up until 20 years ago where the resource pinch point was the decision of whether or not to publish something this is exactly what you would want. The question of whether it is an effective filter; is it actually filtering the right stuff out, is somewhat more controversial. I would say the evidence for that is weak.

When publication and distribution was the expensive part that was the logical place to make the decision. Now these steps are cheap the expensive part of the process is either peer review, the traditional process of making a decision prior to publication, or conversely, the curation and filtering after publication that is seen more widely on the open web. As I have argued I believe that using the patterns of the web will be ultimately a more effective means of enabling users to discover the right information for their needs. We should publish more; much more and much more diversely but we also need to build effective tools for filtering and discovering the right pieces of information. Clearly this also requires work, perhaps more than we are used to doing.

An imaginary solution

So what might this imaginary system that we would design look like. I’ve written before about both key aspects of this. Firstly I believe we need recording systems that as far as possible record and publish both the creation of objects, be they samples, data, or tools. As far as possible these should make a reliable, time stamped, attributable record of the creation of these objects as a byproduct of what the researcher needs to do anyway. A simple concept for instance is a label printer that, as a byproduct of printing off a label, makes a record of who, what, and when, publishing this simultaneously to a public or private feed.

Publishing rapidly is a good approach, not just for ideological reasons of openness but also some very pragmatic concerns. It is easier to publish at the source than to remember to go back and do it later. Things that aren’t done immediately are almost invariably forgotten or lost. Secondly rapid publication has the potential to both efficiently enable re-use and to prevent scooping risks by providing a time stamped citable record. This of course would require people to cite these and for those citations to be valued as a contribution; requiring a move away from considering the paper as the only form of valid research output (see also Michael Nielsen‘s interview with me).

It isn’t enough though, just to publish the objects themselves. We also need to be able to understand the relationship between them. In a semantic web sense this means creating the links between objects, recording the context in which they were created, what were their inputs and outputs. I have alluded a couple of times in the past to the OREChem Experimental Ontology and I think this is potentially a very powerful way of handling these kind of connections in a general way. In many cases, particularly in computational research, recording workflows or generating batch and log files could serve the same purpose, as long as a general vocabulary could be agreed to make this exchangeable.

As these objects get linked together they will form a network, both within and across projects and research groups, providing the kind of information that makes Google work, a network of citations and links that make it possible to directly measure the impact of a single dataset, idea, piece of software, or experimental method through its influence over other work. This has real potential to help solve both the discovery problem and the filtering problem. Bottom line, Google is pretty good at finding relevant text and they’re working hard on other forms of media. Research will have some special edges but can be expected in many ways to fit patterns that mean tools for the consumer web will work, particularly as more social components get pulled into the mix.

On the rare occasions when it is worth pulling together a whole story, for a thesis, or a paper, authors would then aggregate objects together, along with text and good visuals to present the data. The telling of a story then becomes a special event, perhaps one worthy of peer review in its traditional form. The forming of a “paper” is actually no more than providing new links, adding grist to the search and discovery mill, but it can retain its place as a high status object, merely losing its role as the only object worth considering.

So in short, publish fragments, comprehensively and rapidly. Weave those into a wider web of research communication, and from time to time put in the larger effort required to tell a more comprehensive story. This requires tools that are hard to build, standards that are hard to agree, and cultural change that at times seems like spitting into a hurricane. Progress is being made, in many places and in many ways, but how can we take this forward today?

Practical steps for today

I want to write more about these ideas in the future but here I’ll just sketch out a simple scenario that I hope can be usefully implemented locally but provide a generic framework to build out without necessarily requiring a massive agreement on standards.

The first step is simple, make a record, ideally an address on the web for everything we create in the research process. For data and software just the files themselves, on a hard disk is a good start. Pushing them to some sort of web storage, be it a blog, github, an institutional repository, or some dedicated data storage service, is even better because it makes step two easy.

Step two is to create feeds that list all of these objects, their addresses and as much standard metadata as possible, who and when would be a good start. I would make these open by choice, mainly because dealing with feed security is a pain, but this would still work behind a firewall.

Step three gets slightly harder. Where possible configure your systems so that inputs can always be selected from a user-configurable feed. Where possible automate the pushing of outputs to your chosen storage systems so that new objects are automatically registered and new feeds created.

This is extraordinarily simple conceptually. Create feeds, use them as inputs for processes. It’s not so straightforward to build such a thing into an existing tool or framework, but it doesn’t need to be too terribly difficult either. And it doesn’t need to bother the user either. Feeds should be automatically created, and presented to the user as drop down menus.

The step beyond this, creating a standard framework for describing the relationships between all of these objects is much harder. Not because its difficult, but because it requires an agreement on standards for how to describe those relationships. This is do-able and I’m very excited by the work at Southampton on the OREChem Experimental Ontology but the social problems are harder. Others prefer the Open Provenance Model or argue that workflows are the way to manage this information. Getting agreement on standards is hard, particularly if we’re trying to maximise their effective coverage but if we’re going to build a computable record of science we’re going to have to tackle that problem. If we can crack it and get coverage of the records via a compatible set of models that tell us how things are related then I think we will be will placed to solve the cultural problem of actually getting people to use them.

Enhanced by Zemanta

The triumph of document layout and the demise of Google Wave

Google Wave
Image via Wikipedia

I am frequently overly enamoured of the idea of where we might get to, forgetting that there are a lot of people still getting used to where we’ve been. I was forcibly reminded of this by Carole Goble on the weekend when I expressed a dislike of the Utopia PDF viewer that enables active figures and semantic markup of the PDFs of scientific papers. “Why can’t we just do this on the web?” I asked, and Carole pointed out the obvious, most people don’t read papers on the web. We know it’s a functionally better and simpler way to do it, but that improvement in functionality and simplicity is not immediately clear to, or in many cases even useable by, someone who is more comfortable with the printed page.

In my defence I never got to make the second part of the argument which is that with the new generation of tablet devices, lead by the iPad, there is a tremendous potential to build active, dynamic and (under the hood hidden from the user) semantically backed representations of papers that are both beautiful and functional. The technical means, and the design basis to suck people into web-based representations of research are falling into place and this is tremendously exciting.

However while the triumph of the iPad in the medium term may seem assured, my record on predicting the impact of technical innovations is not so good given the decision by Google to pull out of futher development of Wave primarily due to lack of uptake. Given that I was amongst the most bullish and positive of Wave advocates and yet I hadn’t managed to get onto the main site for perhaps a month or so, this cannot be terribly surprising but it is disappointing.

The reasons for lack of adoption have been well rehearsed in many places (see the Wikipedia page or Google News for criticisms). The interface was confusing, a lack of clarity as to what Wave is for, and simply the amount of user contribution required to build something useful. Nonetheless Wave remains for me an extremely exciting view of the possibilites. Above all it was the ability for users or communities to build dynamic functionality into documents and to make this part of the fabric of the web that was important to me. Indeed one of the most important criticisms for me was PT Sefton’s complaint that Wave didn’t leverage HTML formatting, that it was in a sense not a proper part of the document web ecosystem.

The key for me about the promise of Wave was its ability to interact with web based functionality, to be dynamic; fundamentally to treat a growing document as data and present that data in new and interesting ways. In the end this was probably just too abstruse a concept to grab hold of a user. While single demonstrations were easy to put together, building graphs, showing chemistry, marking up text, it was the bigger picture that this was generally possible that never made it through.

I think this is part of the bigger problem, similar to that we experience with trying to break people out of the PDF habit that we are conceptually stuck in a world of communicating through static documents. There is an almost obsessive need to control the layout and look of documents. This can become hilarious when you see TeX users complaining about having to use Word and Word users complaining about having to use TeX for fundamentally the same reason, that they feel a loss of control over the layout of their document. Documents that move, resize, or respond really seem to put people off. I notice this myself with badly laid out pages with dynamic sidebars that shift around, inducing a strange form of motion sickness.

There seems to be a higher aesthetic bar that needs to be reached for dynamic content, something that has been rarely achieved on the web until recently and virtually never in the presentation of scientific papers. While I philosophically disagree with Apple’s iron grip over their presentation ecosystem I have to admit that this has made it easier, if not quite yet automatic, to build beautiful, functional, and dynamic interfaces.

The rapid development of tablets that we can expect, as the rough and ready, but more flexible and open platforms do battle with the closed but elegant and safe environment provided by the iPad, offer real possibilities that we can overcome this psychological hurdle. Does this mean that we might finally see the end of the hegemony of the static document, that we can finally consign the PDF to the dustbin of temporary fixes where it belongs? I’m not sure I want to stick my neck out quite so far again, quite so soon and say that this will happen, or offer a timeline. But I hope it does, and I hope it does soon.

Enhanced by Zemanta