The problem of expertise: The Brighton/English Touring Theatre production of Stoppard’s Arcadia

Cover of first edition
Cover of first edition (Photo credit: Wikipedia)

The play Arcadia by Tom Stoppard‘s links entropy and chaos theory, the history of english landscape gardens, romantic literature and idiocies of academia. I’ve always thought of it as Stoppard’s most successful “clever” play, the one that best combines the disparate material he is bringing together into a coherent whole. Rosencrantz and Guildernstern are Dead feels more straightforward, more accessible, although I don’t doubt that many will feel that’s because I’m missing its depths.

In the Theatre Royal Brighton/English Touring Theatre production that just closed at the Theatre Royal in Bath the part of Thomasina was played by Dakota Blue Richards, Septimus by Wilf Scolding, Bernard by Robert Cavanah, Hannah by Flora Montgomery and Valentine by Ed McArthur. To be up front I found the production disappointing, probably not helped by our seats up in the gods. Richards and Scolding were excellent as tutor and tutee, each learning from the other but I found Cavanah and Montgomery’s competitive academics to both be thin, even shrill. The caricature in the parts of Bernard and Hannah are easy, making their motivations human – while also ensuring that their verbal sparring is intelligible – a greater challenge where I felt that they fell short.

And it was in that execution of dialogue that the production seemed to fall flat. Arcadia depends on dialogue and ideas that run through two time periods, on believing that we understand the jokes shared between characters, but above all on the audience connecting the threads so that all the pieces fall into place. Despite a constant stream of references that will only make sense to those familar with thermodynamics, or landscape history or romantic poetry the play itself is self contained. All the clues that are needed to knit the many threads together can be found in the script. But the delivery is critical. The timing crucial. Every word needs to be pointed and natural and timed perfectly or the connections are lost. The responsibility lies with the audience to hold those threads and weave them together, but they depend on a faultless ensemble to make this work.

But is it fair to expect an audience to hold in their heads a comment from Act 1 that is only answered in passing an hour or more later in Act 2? How much can the playwright expect the audience to hold? Certainly the audience in Bath seemed to struggle. An audience member who knows enough about thermodynamics and romantic poets and English landscape gardening has an advantage over one who does not. Simply that there is less that needs to be held onto to follow the threads. Experts have a distinct advantage in parsing the play.

It was Michael Nielsen who most clearly articulated the idea that the most valuable resource we have in the research enterprise (or in many places) is the attention of experts.  And those who have internalized an understanding of an area to the extent that problem solving is intuitive are the most valuable. But we at the same time value the serendipitous, the sense of being open to the surprising insight. Valentine, the PhD scholar working on understanding population dynamics of grouse in the modern half of the story is unable to see that Thomasina had recognized the path to modern thermodynamics years earlier than thought. His character is painted as limited in that respect until he can see it. The modern academics are similar painted as only open to the surprises they are seeking. Thomasina’s tutor, Septimus is perhaps of a similar age to Valentine but by contrast is painted as open to idea that his 16 year old charge is capable of such insight.

If fortune favours a prepared mind, then that mind is expensive and time consuming to prepare. So we protect them, and lock them up inside institutions so as to ensure only the relevant demands are made on their time. But if the provocations, the serendipities, come from outside established grounds then, to invert the question I posed of the playwright, what can we reasonably expect of the expert in dedicating time to being open? Or is there a level beyond mastery in Dreyfus’ hierarchy of skill acquisition, where the expert cannot just examine their own practice, but simultaneously be open to outside influence. Does it need to be “or” or can it be “and”? Can we not just ride a bike without thinking about it, but ride a bike while thinking about how we do it.

What we may reasonably expect of the playwright in guiding us, what they may reasonably expect of their audience; what we may reasonably expect of the expert and what they may expect of us as we approach them, are connected by a sense of the contract we implicitly form, the audience with the playwright, publics with the expert. Different pieces of theatre require differing levels of commitment from the audience, with expectations set by forms, venues, even ticket prices. It is plausible to argue that commodification of theatre into mass market entertainment necessarily lowers that level of commitment as it must work for a larger audience. It might conversely be argued that the commitment required to understanding American Football or rugby (or cricket!) is as great as that required to engage with Stoppard or Mozart.

In the realm of the academic expert these questions cut to the heart of what the research establishment is for. And are we delivering on that. Stoppard answers this question with three self-obsessed roles; one hardened by the toxic combination of imposter syndrome and arrogance that many in academia will recognise; one partway through that process, perhaps redeemable, perhaps not likely where she sees herself heading, but nonetheless ready to both rise to the bait and to patronise the third; the academic in formation. Yet it is Valentine, the PhD student who when asked “And then what” gives the answer that speaks most directly to academic sensibilities, “I publish”. There is no explicit stage direction but, given this is said to Hannah, the middle of our three academics the line clearly needs to be delivered with the sense of “well d’uh…”.

Yet Stoppard makes his academics figures of fun with their obsessions with prestige rather than engagement, how is this to be reconciled with his own demands on the audience? In truth the there is a deeper message. The puzzle for the modern protagonists can only be solved by bringing their differing expertises together. Bernard uncovers the trail and knows the problem is of interest, but it is Hannah’s expertise on the garden and its plant that reveals that Bernard is only partly right. It is Valentine who has the evidence that Byron visited Sidley House, but Hannah who pushes him to understand the depth of Thomasina’s work. Hannah in turn needs Valentine to uncover the identity of the hermit and what he was doing.

On the earlier timeline it is Thomasina who has the insight, but even her genius is limited. She laments her inability to fully work out the implications saying “I cannot. I do not know the mathematics”. It is made clear to us that her tutor does “Let [the Germans] have the waltz they cannot have the calculus.” making his working through of Thomasina’s ideas plausible. Stoppard throughout presents with exactly the serendipitous connections (and misconnections) that require an open mind to see, identify and integrate into an expert model. It is the arrogance and lack of confidence of the modern characters that prevents them working together to solve the puzzle much quicker (albeit with less drama). By contrast greater openness of the Cambridge educated Septimus to the possibility that Thomasina’s insight is greater than his provides many of the most sympathetic moments in the play.

Of course Stoppard is cheating. He has a static construct in which to lay his threads, ready to catch the prepared mind of the audience. The answer is known and the final curtain set. At the same time he plots jokes and asides throughout. His real genius is in setting these up so that as long as you know something of gardens, or thermodynamics, or academic life, or Byron, or chaos theory, or shooting, or the English aristocracy. You don’t need to know all of these, but knowing one will enrich the experience. Of course the challenge for the players is to be aware of all of these threads as well as those that make up the surface narrative. In Bath, my implicit stage direction above was missing.

In the end Stoppard does make serious demands, and repays repeated engagement with the text. But those demands are not of expertise. If the answer he poses to my question is that experts will gain more by working together and being open to each other’s expertise, what he demands of the audience is simply to be open and attentive. In the end perhaps that is the answer to the question of the reasonable expectations of expertise in the academy. Not to be prescriptive about when and how an expert focuses, or opens up, or whether they need to achieve a level of mastery that allows them to do both, but simply to be attentive. To continuously absorb, remember, compare and contrast. To value mastery as a tool, but not as the answer in its own right.

That is the mistake that the modern characters make in the play. In the best productions you understand both how Bernard, Hannah and Valentine have arrived where they are, but also that they don’t very much like what they see in the mirror. That they, like many academics, are constantly seeking an escape from the pursuit of prestige and publication that they have locked themselves into. That they have been trapped into a cycle where they are required to demonstrate mastery but not necessarily apply it, and certainly not build on it. The aim for all three is publication, not understanding.

Breaking out of these cycles is hard. But in the end if Stoppard is calling us to pay attention, to be engaged enough to not just see the threads, but to hold them and weave them together, that might be enough to make a start.

Loss, time and money

May - Oct 2006 Calendar
May – Oct 2006 Calendar (Photo credit: Wikipedia)

For my holiday project I’m reading through my old blog posts and trying to track the conversations that they were part of. What is shocking, but not surprising with a little thought, is how many of my current ideas seem to spring into being almost whole in single posts. And just how old some of those posts are. At the some time there is plenty of misunderstanding and rank naivety in there as well.

The period from 2007-10 was clearly productive and febrile. The links out from my posts point to a distributed conversation that is to be honest still a lot more sophisticated than much current online discussion on scholarly communications. Yet at the same time that fabric is wearing thin. Broken links abound, both internal from when I moved my own hosting and external. Neil Saunders’ posts are all still accessible, but Deepak Singh’s seem to require a trip to the Internet Archive. The biggest single loss, though occurs through the adoption of Friendfeed in mid-2008 by our small community. Some links to discussions resolve, some discussions of discussions survive as posts but whole chunks of the record of those conversations – about researcher IDs, peer review, and incentives and credit appear to have disappeared.

As I dig deeper through those conversations it looks like much of it can be extracted from the Internet Archive, but it takes time. Time is a theme that runs through posts starting in 2009 as the “real time web” started becoming a mainstream thing, resurfaced in 2011 and continues to bother. Time also surfaces as a cycle. Comments on peer review from 2011 still seem apposite and themes of feeds, aggregations and social data continue to emerge over time. On the other hand, while much of my recounting of conversations about Researcher IDs in 2009 will look familiar to those who struggled with getting ORCID up and running, a lot of the technology ideas were…well probably best left in same place as my enthusiasm for Google Wave. And my concerns about the involvement of Crossref in Researcher IDs is ironic given I now sit on their board as second representing PLOS.

The theme that travels throughout the whole seven-ish years is that of incentives. Technical incentives, the idea that recording research should be a byproduct of what the researcher is doing anyway and ease of use (often as rants about institutional repositories) appear often. But the core is the question of incentives for researchers to adopt open practice, issues of “credit” and how it might be given as well as the challenges that involves, but also of exchange systems that might turn “credit” into something real and meaningful. Whether that was to be real money wasn’t clear at the time. The concerns with real money come later as this open letter to David Willets suggests a year before the Finch review. Posts from 2010 on frequently mention the UK’s research funding crisis and in retrospect that crisis is the crucible that formed my views on impact and re-use as well as how new metrics might support incentives that encourage re-use.

The themes are the same, the needs have not changes so much and many of the possibilities remain unproven and unrealised. At the same time the technology has marched on, making much of what was hard easy, or even trivial. What remains true is that the real value was created in conversations, arguments and disagreements, reconciliations and consensus. The value remains where it has always been – in a well crafted network of constructive critics and in a commitment to engage in the construction and care of those networks.

Data Capture for the Real World

Many efforts at building data infrastructures for the “average researcher” have been funded, designed and in some cases even built. Most of them have limited success. Part of the problem has always been building systems that solve problems that the “average researcher” doesn’t know that they have. Issues of curation and metadata are so far beyond the day to day issues that an experimental researcher is focussed on as to be incomprehensible. We clearly need better tools, but they need to be built to deal with the problems that researchers face. This post is my current thinking on a proposal to create a solution that directly faces the researcher, but offers the opportunity to address the broader needs of the community. What is more it is designed to allow that average researcher to gradually realise the potential of better practice and to create interfaces that will allow technical systems to build out better systems.

Solve the immediate problem – better backups

The average experimental lab consists of lab benches where “wet work” is done and instruments that are run off computers. Sometimes the instruments are in different rooms, sometimes they are shared. Sometimes they are connected to networks and backed up, often they are not. There is a general pattern of work – samples are created through some form of physical manipulation and then placed into instruments which generate digital data. That data is generally stored on a local hard disk. This is by no means comprehensive but it captures a large proportion of a lot of the work.

The problem a data manager or curator sees here is one of cataloguing the data created, creating a schema that represents where it came from and what it is. We build ontologies and data models and repositories to support them to solve the problem of how all these digital objects relate to each other.

The problem a researcher sees is that the data isn’t backed up. More than that, its hard to back up because institutional systems and charges make it hard to use the central provision (“it doesn’t fit our unique workflows/datatypes”) and block what appears to be the easiest solution (“why won’t central IT just let me buy a bunch of hard drives and keep them in my office?”). An additional problem is data transfer – the researcher wants the data in the right place, a problem generally solved with a USB drive. Networks are often flakey, or not under the control of the researcher so they use what is to hand to transfer data from instrument to their working computer.

The challenge therefore is to build systems under group/researcher control that the needs for backup and easy file transfer. At the same time they should at least start to solve the metadata capture problem and satisfy the requirements of institutional IT providers.

The Lab Box

The PI wants to know that data is being collected, backed up and is findable. They generally want this to be local. Just as researchers still like PDFs because they can keep them locally, researchers are happiest if data is on a drive they can physically locate and control. The PI does not trust their research group to manage these backups – but central provision doesn’t serve their needs. The ideal is a local system under their control that automates data backups from the relevant control computers.

The obvious solution is a substantial hard drive with some form of simple server that “magically” sucks data up from the relevant computers. In the best case scenario appropriate drives on the instrument computers are accessible on a local network. In practice life is rarely this simple and individually creating appropriate folder permissions to allow backups is enough of a problem that it rarely gets done. One alternate solution is to use the USB drive approach – add an appropriate USB fob to the instrument computer that grabs relevant data and transmits it to the server, probably over a dedicated WiFi network. There are a bunch of security issues on how best to design this but one option is a combined drive/Wifi fob where data can be stored and then transmitted to the server.

Once on the server the data can be accessed and if necessary access controls applied. The server system need not be complex but it probably does at least need to be on the local network. This would require some support from institutional IT. Alternately a separate WiFi network could be run, isolating the system entirely from both the web and the local network.

The data directory of instrument computers if replicated to a server via a private WiFi network. The server then provides access to those files through a web interface.
The data directory of instrument computers if replicated to a server via a private WiFi network. The server then provides access to those files through a web interface.

Collecting Metadata

The beauty of capturing data files at the source is that a lot of metadata can be captured automatically. The core metadata of relevance is “What”, “Who”, and “When”. What kind of data has been collected, who was the researcher who collected it and when was it collected. For the primary researcher use cases (finding data after the student has left, recovering lost data, finding your own data six months later) this metadata is sufficient. The What is easily dealt with as is the When. We can collect the original source location of the data (and that tells us what instrument it came from) and the original file creation date. While these are not up to standards of data curators who might want a structured description of what the data is it is enough for the user, it provides enough context for a user to apply their local knowledge of how source and filetype relate to data type.

“Who” is a little harder, but it can be solved with some local knowledge. Every instrument control computer I have ever seen has a folder, usually on the desktop helpfully called “data”. Inside that folder are a set of folders with lab members names on them. This convention is universal enough that with a little nudging it can be relied on to deliver reasonable metadata. If the system allows user registration and automatically creates the relevant folders then saving files to the right folder, and thus providing that critical metadata will be quite reliable.

The desired behaviour can be encouraged even further if files dropped into the correct folder are automatically replicated to a “home computer” thus removing the need for transfer via  USB stick. Again a “convention over configuration” approach can be taken in which the directory structure found on the instrument computers is simply inverted. A data folder is created in which a folder is provided for each instrument. As an additional bonus other folders could be added which would then be treated as if they were an instrument and therefore replicated back to the main server.

How good is this?

If such a system can be made reliable (and that’s not easy - finding a secure way to ensure data gets transferred to the server and managing the dropbox style functionality suggested above is not trivial) then it can solve for a remarkable number of the use cases faced by the small scale laboratory on a regular basis. It doesn’t work for massive files or for instruments that write to a database. That said in research labs, researchers are often “saving as” some form of CSV, text or Excel file type even when the instrument does have a database. It isn’t trivial to integrate into existing shared data systems for instance for departmental instruments. Although adaptors could be easily built they would likely need to be bespoke developments working with local systems. Again though frequently what happens in practice is the users make a local copy of the data in their own directory system.

The major limitations are that there is no real information on what the files really are, just an unstructured record of the instrument that they came from. This is actually sufficient for most local use cases (the users know what the instruments are and the file types that they generate) but isn’t sufficient to support downstream re-use or processing. However, as we’ve argued in some previous papers, this can be seen as a feature not a bug. Many systems attempt to enforce a structured view of what a piece of data is early in creation process. This works in some contexts but often fails in the small lab setting. The lack of structure, while preserving enough contextual information to be locally useful, can be seen as a strength – any data type can be collected and stored without it needing to be available in some sort of schema. That doesn’t mean we can’t offer some of that structure, if and when there is functionality that gives an immediate benefit back to the user, but where there isn’t an immediate benefit we don’t need to force the user to do anything extra.

Offering a route towards more

At this point the seasoned data management people are almost certainly seething if not shouting at their computers. This system does not actually solve many of the core issues we have in data management. That said, it does solve the problem that the community of adopters actually recognises. But it also has the potential to guide them to better practice. One of the points that was made in the LabTrove paper that described work from the Frey group that I was involved in was how setting up a virtuous cycle has the potential to encourage good metadata practice. If good metadata drives functionality that is available to the user then the user will put that metadata in. But more than that, if the schema is flexible enough they will also actively engage in improving it if that improves the functionality.

The system I’ve described has two weaknesses – limited structured metadata on what the digital objects themselves actually are and as a result very limited possibilities for capturing the relationships between them. Because we’ve focussed on “merely” backing up our provenance information beyond that intrinsic to the objects themselves is limited. It is reasonably easy to offer the opportunity to collect more structured information on the objects themselves – when configuring an instrument offer a selection of known types. If a known type of instrument is selected then this can be flagged to the system which will then know to post-process the data file to extract more metadata, perhaps just enriching the record but perhaps also converting it to some form of standard file type. In turn automating this process means that the provenance of the transformation is easily captured. In standard form the system might offer in situ visualisation or other services direct to the users providing an immediate benefit. Such a library of transformations and instrument types could be offered as a community resource, ideally allowing users to contribute to build the library up.

Another way to introduce this “single step workflow” concept to the user is suggested above. Let us say that by creating an “analysis” folder on their local system this gets replicated to the backup server. The next logical step is to create a few different folders, that receive the results of different types of analysis. These then get indexed by the server as separate types but without further metadata. So they might separate mass spectrometry analyses from UV-Vis but not have the facility to search either of these for peaks in specific ranges. If a “peak processing” module is available that provides a search interface or visualisation then the user has an interest in registering those folders as holding data that should be submitted to the module. In doing this they are saying a number of things about the data files themselves – firstly that they should have peaks but also possibly getting them converted to a standard format, perhaps saying what the x and y coordinates of the data are, perhaps providing a link to calibration files.

The transformed files themselves can be captured by the same data model as the main database, but because they are automatically generated the full provenance can be captured. As these full provenance files populate the database and become easier to find and use the user in turn will look at those raw files and ask what can be done to make them more accessible and findable. It will encourage them to think more carefully about the metadata collection for their local analyses. Overall it provides a path that at each step offers a clear return for capturing a little more metadata in a little more structured form. It provides the opportunity to take the user by the hand, solve their immediate problem and carry them further along that path.

Building the full provenance graph

Once single step transformations are available then chaining them together to create workflows is an obvious next step. The temptation at this point is to try to build a complete system in which the researcher can work but in my view this is doomed to failure. One of the reasons workflow systems tend to fail is that they are complex and fragile. They work well where the effort involved in a large scale data analysis justifies their testing and development but aren’t very helpful for ad hoc analyses. What is more workflow systems generally require that all of the relevant information and transformation be available. This is rarely the case in an experimental lab where some crucial element of the process is likely to be offline or physical.

Therefore in building up workflow capabilities within this system it is crucial to build in a way that creates value when only a partial workflow is possible. Again this is surprisingly common – but relatively unexplored. Collecting a set of data together, applying a single transformation, checking for known types of data artefacts. All of these can be useful without needing an end to end analysis pipeline. In many cases implied objects can be created and tracked without bothering the user until they interested. For instance most instruments require samples. The existence of a dataset implies a sample. Therefore the system can create a record of a sample – perhaps of a type appropriate to the instrument, although that’s not necessary. If a given user does runs on three instruments, each with five samples over the course of a day its a reasonable guess that those are the same five samples. The system could then offer to link those records together so a record is made that those datasets are related.

From here its not a big mental leap to recording the creation of those samples in another digital object, perhaps dropping it into a directory labelled “samples”. The user might then choose to link that record with the sample records. As the links propagate from record of generation, to sample record, to raw data to analysis, parts of the provenance graphs are recorded. The beauty of the approach is that if there is a gap in the graph it doesn’t reduce the base functionality the user enjoys, but if they make the effort to link it up then they can suddenly release all the additional functionality that is possible. This is essentially the data model that we proposed in the LabTrove paper but without the need to record the connections up front.

Challenges and complications

None of the pieces of this system are terribly difficult to build. There’s an interesting hardware project – how best to build the box itself and interact with the instrument controller computers. All while keeping the price of the system as low as possible. There’s a software build to manage the server system and this has many interesting user experience issues to be worked through. The two clearly need to work well together and at the same time support an ecosystem of plugins that are ideally contributed by user communities.

The challenge lies in building something where all these pieces work well together and reliably. Actually delivering a system that works requires rapid iteration on all the components while working with (probably impatient and busy) researchers who want to see immediate benefits. The opportunity if it can be got right is immense. While I’m often sceptical of the big infrastructure systems in this space that are being proposed and built, they do serve specific communities well. The bit that is lacking is often the interface onto the day to day workings of an experimental laboratory. Something like this system could be the glue-ware that brings those much more powerful systems into play for the average researcher.

Some notes on provenance

These are not particularly new ideas, just an attempt to rewrite something that’s been a recurring itch for me for many years. Little of this is original and builds on thoughts and work of a very large range of people including the Frey group current and past, members of the National Crystallography Service, Dan Hagon, Jennifer Lin, Jonathan Dugan, Jeff Hammerbacher, Ian Foster (and others at Chicago and the Argonne Computation Institute), Julia Lane, Greg Wilson, Neil Jacobs, Geoff Bilder, Kaitlin Thaney, John Wilbanks, Ross Mounce, Peter Murray-Rust, Simon Hodson, Rachel Bruce, Anita de Waard, Phil Bourne, Maryann Martone, Tim Clark, Titus Brown, Fernando Perez, Frank Gibson, Michael Nielsen and many others I have probably forgotten. If you want to dig into previous versions of this that I (was involved with) writing then there are pieces in Nature (paywalled unfortunately), Automated Experimentation, the PLOS ONE paper mentioned above as well as some previous blog posts. 

 

A Prison Dilemma

Saint Foucault

I am currently on holiday. You can tell this because I’m writing, reading and otherwise doing things that I regard as fun. In particular I’ve been catching up on some reading. I’ve been meaning to read Danah Boyd‘s It’s Complicated for some time (and you can see some of my first impressions in the previous post) but I had held off because I wanted to buy a copy.

That may seem a strange statement. Danah makes a copy of the book available on her website as a PDF (under a CC BY-NC license) so I could (and in the end did) just grab a copy from there. But when it comes to books like this I prefer to pay for a copy, particularly where the author gains a proportion of their livelihood from publishing. Now I could buy a hardback or paperback edition but we have enough physical books. I can buy a Kindle edition from Amazon.co.uk but I object violently to paying a price similar to the paperback for something I can only read within Amazon software or hardware, and where Amazon can remove my access at any time.

In the end I gave up – I downloaded the PDF and read that. As I read it I found a quote that interested me. The quote was from Michel Foucault’s Discipline and Punish, a study of the development of the modern prison system – the quote if anyone is interested was about people’s response to being observed and was interesting in the context of research assessment.

Once I’d embarrassed myself by asking a colleague who knows about this stuff whether Foucault was someone you read, or just skimmed the summary version, I set out again to find myself a copy. Foucault died in 1984 so I’m less concerned about paying for a copy but would have been happy to buy a reasonably priced and well formatted ebook. But again the only source was Amazon. In this case its worse than for Boyd’s book. You can only buy the eBook from the US Amazon store, which requires a US credit card. Even if I was happy with the Amazon DRM and someone was willing to buy the copy for me I would be technically violating territorial rights in obtaining that copy.

It was ironic that all this happened the same week that the European Commission released its report on submissions to the Public Consultation on EU Copyright Rules. The report quickly develops a pattern. Representatives on public groups, users and research users describe a problem with the current way that copyright works. Publishers and media organisations say there is no problem. This goes on and on for virtually every question asked:

In the print sector, book publishers generally consider that territoriality is not a factor in their business, as authors normally provide a worldwide exclusive licence to the publishers for a certain language. Book publishers state that only in the very nascent eBooks markets some licences are being territorially restricted.

As a customer I have to say its a factor for me. I can’t get the content in the form I want. I can’t get it with the rights I want, which means I can’t get the functionality I want. And I often can’t get it in the place I want. Maybe my problem isn’t important enough or there aren’t enough people like me for publishers to care. But with traditional scholarly monograph publishing apparently in a death spiral it seems ironic that these markets aren’t being actively sought out. When books only sell a few hundred copies every additional sale should matter. When books like Elinor Ostrom’s Governing the Commons aren’t easily available then significant revenue opportunities are being lost.

Increasingly it is exactly the relevant specialist works in social sciences and humanities that I’m interested in getting my hands on. I don’t have access to an academic library, the nearest I might get access to is a University focussed on science and technology and in any case the chance of any specific scholarly monograph being in a given academic library is actually quite low. Inter-library loans are brilliant but I can’t wait a week to check something.

I spent nearly half a day trying to find a copy of Foucault’s book that was in the form I wanted with the rights I wanted. I’ve spent hours trying to find a copy of Ostrom’s as well. In both cases it is trivial to find a copy online – took me around 30 seconds. In both cases its relatively easy to find a second hand print copy. I guess for traditional publishers its easy to dismiss me as part of a small market, one that’s hard to reach and not worth the effort. After all, what would I know, I’m just the customer.

 

Remembering Jean-Claude Bradley

Jean-Claude

It takes me a while to process things. It often then takes me longer to feel able to write about them. Two weeks ago we lost one of the true giants of Open Science. Others have written about Jean-Claude’s work and his contributions – and I don’t feel able to add much to those reflections at the moment. I will also be participating in a conference panel in August at which Jean-Claude was going to speak and which will now be dedicated to his memory – I may have something more coherent to say by then.

In the end I believe that Jean-Claude would want to be remembered by the work that he left online for anyone to see, work with, and contribute to. He was fearless and uncompromising in working, exploring, and thinking online, always seeking new ways to more effectively expose the process of the research he was doing. He could also be a pain to work with. There was never any question that if you chose to work with him then you worked to his standards – perhaps rather he would choose to work with you only if you worked to his standards. By maintaining the principle of immediate and complete release of the research process, he taught many of us that there was little to fear and much to gain from this radical openness. Today I often struggle to sympathise with other people’s fears of what might happen if they open up, precisely because Jean-Claude forced me to confront those fears early on and showed me how ill founded many of them are.

In his constant quest to get more of the research process online as fast as possible Jean-Claude would grab whatever tools were to hand. Wiki platforms, YouTube, Blogger, SecondLife, Preprint servers, GoogleDocs and innumerable other tools were grasped and forced into service, linked together to create a web, a network of information and resources. Sometimes these worked and sometimes they didn’t. Often Jean-Claude was ahead of his time, pushing tools in their infancy to the limits and seeing the potential that in many cases is only beginning to be delivered now.

Ironically by appropriating whatever technology was to hand he spread the trace of his work across a wide range of services, many of them proprietary, simply because they were the best tools available at the time. If the best way to remember his work is through preserving that web of resources then we now face a serious challenge. How far does that trace spread? Do we have the rights to copy and preserve it? If so what parts? How much of the history do we lose by merely taking a copy? Jean-Claude understood this risk and engaged early on with the library at Drexel to archive core elements of his program – once again pushing the infrastructure of institutional repositories beyond what they had been intended to do. But his network spread far further than what has currently been preserved.

I want to note that in the hours after we heard the news I didn’t realise the importance of preserving Jean-Claude’s work. I think its important to recognise that it was information management professionals who immediately realised both the importance of preservation and the risks to the record and set in motion the processes necessary to start that work. I remain, like most researchers I suspect, sloppy and lazy about proper preservation and we need the support of professionals who understand the issues and technical challenges, but also are engaged with preservation of works and media outside the scholarly mainstream if science that is truly on the web is to have a lasting impact. The role of a research institution, if it is to have one in the future, is in part to provide that support, literally to insitutionalise the preservation of digital scholarship.

The loss of Jean-Claude leaves a gaping hole. In life his determination to provide direction, to show what could be done if you chose, was a hard act to follow. His rigid (and I will admit to finding it sometimes too rigid) adherence to principles provided that direction – always demanding that we take each extra step. The need for preserving his work is showing us what we should have been doing all along. It should probably not be surprising that even in death he is still providing direction, and also not surprising that we will continue to struggle to realise his vision of what research could be.

Enhanced by Zemanta

Fork, merge and crowd-sourcing data curation

I like to call this one "Fork"

Over the past few weeks there has been a sudden increase in the amount of financial data on scholarly communications in the public domain. This was triggered in large part by the Wellcome Trust releasing data on the prices paid for Article Processing Charges by the institutions it funds. The release of this pretty messy dataset was followed by a substantial effort to clean that data up. This crowd-sourced data curation process has been described by Michelle Brook. Here I want to reflect on the tools that were available to us and how they made some aspects of this collective data curation easy, but also made some other aspects quite hard.

The data started its life as a csv file on Figshare. This is a very frequent starting point. I pulled that dataset and did some cleanup using OpenRefine, a tool I highly recommend as a starting point for any moderate to large dataset, particularly one that has been put together manually. I could use OpenRefine to quickly identify and correct variant publisher and journal name spellings, clean up some of the entries, and also find issues that looked like mistakes. It’s a great tool for doing that initial cleanup, but its a tool for a single user, so once I’d done that work I pushed my cleaned up csv file to github so that others could work with it.

After pushing to github a number of people did exactly what I’d intended and forked the dataset. That is, they took a copy and added it to their own repository. In the case of code people will fork a repository, add to or improve the code, and then make a pull request that gives the original repository owner that there is new code that they might want to merge into their version of the codebase. The success of github has been built on making this process easy, even fun. For data the merge process can get a bit messy but the potential was there for others to do some work and for us to be able to combine it back together.

But github is really only used by people comfortable with command line tools – my thinking was that people would use computational tools to enhance the data. But Theo Andrews had the idea to bring in many more people to manually look at and add to the data. Here an online spreadsheet such as those provided by GoogleDocs that many people can work with is a powerful tool and it was through that adoption of the GDoc that somewhere over 50 people were able to add to the spreadsheet and annotate it to create a high value dataset that allowed the Wellcome Trust to do a much deeper analysis than had previously been the case. The dataset had been forked again, now to a new platform, and this tool enabled what you might call a “social merge” collecting the individual efforts of many people through an easy to use tool.

The interesting thing was that exactly the facilities that made the GDoc attractive for manual crowdsourcing efforts made it very difficult for those of us working with automated tools to contribute effectively. We could take the data and manipulate it, forking again, but if we then pushed that re-worked data back we ran the risk of overwriting what anyone else had done in the meantime. That live online multi-person interaction that works well for people, was actually a problem for computational processing. The interface that makes working with the data easy for people actually created a barrier to automation and a barrier to merging back what others of us were trying to do. [As an aside, yes we could in principle work through the GDocs API but that’s just not the way most of us work doing this kind of data processing].

Crowdsourcing of data collection and curation tends to follow one of two paths. Collection of data is usually done into some form of structured data store, supported by a form that helps the contributor provide the right kind of structure. Tools like EpiCollect provide a means of rapidly building these kinds of projects. At the other end large scale data curation efforts, such as GalaxyZoo, tend to create purpose built interfaces to guide the users through the curation process, again creating structured data. Where there has been less tool building and less big successes are the space in the middle, where messy or incomplete data has been collected and a community wants to enhance it and clean it up. OpenRefine is a great tool, but isn’t collaborative. GDocs is a great collaborative platform but creates barriers to using automated cleanup tools. Github and code repositories are great for supporting the fork, work, and merge back patterns but don’t support direct human interaction with the data.

These issues are part of a broader pattern of issues with the Open Access, Data, and Educational Resources more generally. With the right formats, licensing and distribution mechanisms we’ve become very very good at supporting the fork part of the cycle. People can easily take that content and re-purpose it for their own local needs. What we’re not so good at is providing the mechanisms, both social and technical, to make it easy to contribute those variations, enhancements and new ideas back to the original resources. This is both a harder technical problem and challenging from a social perspective. Giving stuff away, letting people use it is easy because it requires little additional work. Working with people to accept their contributions back in takes time and effort, both often in short supply.

The challenge may be even greater because the means for making one type of contribution easier may make others harder. That certainly felt like the case here. But if we are to reap the benefits of open approaches then we need to do more than just throw things over the fence. We need to find the ways to gather back and integrate all the value that downstream users can add.

Enhanced by Zemanta

Open is a state of mind

English: William Henry Fox Talbot's 'The Open ...
English: William Henry Fox Talbot’s ‘The Open Door’ (Photo credit: Wikipedia)

“Open source” is not a verb

Nathan Yergler via John Wilbanks

I often return to the question of what “Open” means and why it matters. Indeed the very first blog post I wrote focussed on questions of definition. Sometimes I return to it because people disagree with my perspective. Sometimes because someone approaches similar questions in a new or interesting way. But mostly I return to it because of the constant struggle to get across the mindset that it encompasses.

Most recently I addressed the question of what “Open” is about in a online talk I gave for the Futurium Program of the European Commission (video is available). In this I tried to get beyond the definitions of Open Source, Open Data, Open Knowledge, and Open Access to the motivation behind them, something which is both non-obvious and conceptually difficult. All of these various definitions focus on mechanisms – on the means by which you make things open – but not on the motivations behind that. As a result they can often seem arbitrary and rules-focussed, and do become subject to the kind of religious wars that result from disagreements over the application of rules.

In the talk I tried to move beyond that, to describe the motivation and the mind set behind taking an open approach, and to explain why this is so tightly coupled to the rise of the internet in general and the web in particular. Being open as opposed to making open resources (or making resources open) is about embracing a particular form of humility. For the creator it is about embracing the idea that – despite knowing more about what you have done than any other person –  the use and application of your work is something that you cannot predict. Similarly for someone working on a project being open is understanding that – despite the fact you know more about the project than anyone else – that crucial contributions and insights could come from unknown sources. At one level this is just a numbers game, given enough people it is likely that someone, somewhere, can use your work, or contribute to it in unexpected ways. As a numbers game it is rather depressing on two fronts. First, it feels as though someone out there must be cleverer than you. Second, it doesn’t help because you’ll never find them.

Most of our social behaviour and thinking feels as though it is built around small communities. People prefer to be a (relatively) big fish in a small pond, scholars even take pride in knowing the “six people who care about and understand my work”, the “not invented here” syndrome arises from the assumption that no-one outside the immediate group could possibly understand the intricacies of the local context enough to contribute. It is better to build up tools that work locally rather than put an effort into building a shared community toolset. Above all the effort involved in listening for, and working to understand outside contributions, is assumed to be wasted. There is no point “listening to the public” because they will “just waste my precious time”. We work on the assumption that, even if we accept the idea that there are people out there who could use our work or could help, that we can never reach them. That there is no value in expending effort to even try. And we do this for a very good reason; because for the majority of people, for the majority of history it was true.

For most people, for most of history, it was only possible to reach and communicate with small numbers of people. And that means in turn that for most kinds of work, those networks were simply not big enough to connect the creator with the unexpected user, the unexpected helper with the project. The rise of the printing press, and then telegraph, radio, and television changed the odds, but only the very small number of people who had access to these broadcast technologies could ever reach larger numbers. And even they didn’t really have the tools that would let them listen back. What is different today is the scale of the communication network that binds us together. By connecting millions and then billions together the probability that people who can help each other can be connected has risen to the point that for many types of problem that they actually are.

That gap between “can” and “are”, the gap between the idea that there is a connection with someone, somewhere, that could be valuable, and actually making the connection is the practical question that underlies the idea of “open”. How do we make resources, discoverable, and re-usable so that they can find those unexpected applications? How do we design projects so that outside experts can both discover them and contribute? Many of these movements have focussed on the mechanisms of maximising access, the legal and technical means to maximise re-usability. These are important; they are a necessary but not sufficient condition for making those connections. Making resources open enables, re-use, enhances discoverability, and by making things more discoverable and more usable, has the potential to enhance both discovery and usability further. But beyond merely making resources open we also need to be open.

Being open goes in two directions. First we need to be open to unexpected uses. The Open Source community was first to this principle by rejecting the idea that it is appropriate to limit who can use a resource. The principle here is that by being open to any use you maximise the potential for use. Placing limitations always has the potential to block unexpected uses. But the broader open source community has also gone further by exploring and developing mechanisms that support the ability of anyone to contribute to projects. This is why Yergler says “open source” is not a verb. You can license code, you can make it “open”, but that does not create an Open Source Project. You may have a project to create open source code, an “Open-source project“, but that is not necessarily a project that is open, an “Open source-project“. Open Source is not about licensing alone, but about public repositories, version control, documentation, and the creation of viable communities. You don’t just throw the code over the fence and expect a project to magically form around it, you invest in and support community creation with the aim of creating a sustainable project. Successful open source projects put community building, outreach, both reaching contributors and encouraging them, at their centre. The licensing is just an enabler.

In the world of Open Scholarship, and I would include both Open Access and Open Educational Resources in this, we are a long way behind. There are technical and historical reasons for this but I want to suggest that a big part of the issue is one of community. It is in large part about a certain level of arrogance. An assumption that others, outside our small circle of professional peers, cannot possibly either use our work or contribute to it. There is a comfort in this arrogance, because it means we are special, that we uniquely deserve the largesse of the public purse to support our work because others cannot contribute. It means do note need to worry about access because the small group of people who understand our work “already have access”. Perhaps more importantly it encourages the consideration of fears about what might go wrong with sharing over a balanced assessment of the risks of sharing versus the risks of not sharing, the risks of not finding contributors, of wasting time, of repeating what others already know will fail, or of simply never reaching the audience who can use our work.

It also leads to religious debates about licenses, as though a license were the point or copyright was really a core issue. Licenses are just tools, a way of enabling people to use and re-use content. But the license isn’t what matters, what matters is embracing the idea that someone, somewhere can use your work, that someone, somewhere can contribute back, and adopting the practices and tools that make it as easy as possible for that to happen. And that if we do this collectively that the common resource will benefit us all. This isn’t just true of code, or data, or literature, or science. But the potential for creating critical mass, for achieving these benefits, is vastly greater with digital objects on a global network.

All the core definitions of “open” from the Open Source Definition, to the Budapest (and Berlin and Bethesda) Declarations on Open Access, to the Open Knowledge Definition have a common element at their heart – that an open resource is one that any person can use for any purpose. This might be good in itself, but thats not the real point, the point is that it embraces the humility of not knowing. It says, I will not restrict uses because that damages the potential of my work to reach others who might use it. And in doing this I provide the opportunity for unexpected contributions. With Open Access we’ve only really started to address the first part, but if we embrace the mind set of being open then both follow naturally.

Enhanced by Zemanta

What’s the right model for shared scholarly communications infrastructure?

Dollar
Dollar (Photo credit: Wikipedia)

There have been a lot of electrons spilled over the Elsevier Acquisition of Mendeley. I don’t intend to add too much to that discussion but it has provoked for me an interesting train of thought which seems worth thinking through. For what its worth my views of the acquisition are not too dissimilar to those of Jason Hoyt and John Wilbanks, and I recommend their posts. I have no doubt that the Mendeley team remain focussed on their vision and I hope they do well with it. And even with the cash reserves of Elsevier you don’t spend somewhere in the vicinity of $100M on something you intend to break.

But the question is not the intentions of individuals, or even the intentions of the two organisations, but whether the culture and promise of Mendeley can survive, or perhaps even thrive within the culture and organisation of Elsevier. No-one can know whether that will work, we will simply have to wait and see. But that raises a broader question for me. A for-profit startup, particularly one funded by VCs, has a limited number of exit strategies; IPO, sale, or more rarely a gradual move to a revenue positive independent company. This means startups behave in certain ways, and it means that interacting with them, particularly depending on them, has certain risks, primarily that a big competitor could buy your important partner out from under you. It’s not just the community who are wondering what Elsevier will do with the data and community that Mendeley will bring them, its also the other big publishers who were seeing valuable traffic and data coming to them from Mendeley, it’s the whole ecology of organisations that came to rely on the API.

It can be tempting to think that the world would be a better place if this kind of innovation was done by non-profits rather than startups. Non-profits have their strengths, a statutory requirement to focus on mission, the assurance that the promise of a big buy-out won’t change management behaviour. But non-profits have their weaknesses as well. That focus on mission can prevent the pivot that can make a startup.  It can be much harder to raise capital. Where a non-profit is governed by a board made up of a diverse community then conflicts of interest can make decision making glacial.

The third model is that of academic projects, and many useful tools have come from this route, but again there are problems. The peculiar nature of academic projects means that the financial imperatives that characterise the early stages of both for-profits and not-for-profits never really seem to bite. This can lead in turn to a lack of focus on user requirements and from there to a lack of adoption that condemns new tools to the category of interesting, even exciting, but not viable.

Of course all weaknesses are strengths in a different context. The freedom to explore in an academic context can enable exceptional leaps that would never be possible when you are focussed on finding next months rent. The promise of equity can bring in people whose salary you could never afford. The requirement for consensus can be painful but it means that where it can be found it is so much more powerful.

Geoff Bilder at the Rigour and Openness meeting in Oxford last week commented that the board of Crossref was made up of serious commercial competitors who could struggle to reach agreement because of their different interests. The process of building ORCID was painfully and frustratingly slow for many of us because of the different and sometimes conflicting needs of the various stakeholder groups. But when agreement is reached it is so much more powerful because it is clear that there is strong shared need. And agreement is the sign that something really needs to be done.

What has struck me in the conversation of the last week or so is how the interests of a very diverse range of stakeholders; researchers, altmetrics advocates, publishers, both radical and traditional, seem to be coming into alignment. At least on some issues. We need a way to build up shared infrastructure that can be utilised by all of us. Community run not-for-profits seem a good model for that, yet the innovation that builds new elements of infrastructure often comes from commercial startups. A for-profit can raise development capital to support a new tool but this may engender a lack of trust that an academic project might enjoy with a potential userbase.

What our sector lacks, and this might well be a more general problem, is a deep understanding of how these different development and governance models can be combined and applied in different places. We need incubators for non-profits but we also need models where a community non-profit might be setup to buy out a startup. Various publishers have labs groups, and technology will continue to be a key point of competition, but is there a space to do what pharmaceutical companies are increasingly doing and taking some parts of the drug development process pre-competitive so that everyone benefits from a shared infrastructure?

I don’t have any answers, nor do I have experience of running either for-profit or non-profit startups. But it feels like we are at a moment in time where we are starting to see shared infrastructure needs for the whole sector. It isn’t in anyone’s long term interest for us to have to build it more than once – and that means we need to find the right way to both support innovative developments but also ensure that they end up in hands that everyone feels they can trust.

 

Enhanced by Zemanta

The challenge for scholarly societies

society
Cemetery Society (Photo credit: Aunt Owwee)

With major governments signalling a shift to Open Access it seems like a good time to be asking which organisations in the scholarly communications space will survive the transition. It is likely that the major current publishers will survive, although relative market share and focus is likely to change. But the biggest challenges are faced by small to medium scholarly societies that depend on journal income for their current viability. What changes are necessary for them to navigate this transition and can they survive?

The fate of scholarly societies is one of the most contentious and even emotional in the open access landscape. Many researchers have strong emotional ties to their disciplinary societies and these societies often play a crucial role in supporting meetings, providing travel stipends to young researchers, awarding prizes, and representing the community. At the same time they face a peculiar bind. The money that supports these efforts often comes from journal subscriptions. Researchers are very attached to the benefits but seem disinclined to countenance membership fees that would support them. This problem is seen across many parts of the research enterprise – where researchers, or at least their institutions, are paying for services through subscriptions but unwilling to pay for them directly.

What options do societies have? Those with a large publication program could do worse in the short term than look very closely at the announcement from the UK Royal Society of Chemistry last week. The RSC is offering an institutional mechanism where by those institutions that have a particular level of subscription will receive an equivalent amount of publication services, set at the price of £1600 per paper. This is very clever for the RSC, it allows it to help institutions prepare effectively for changes in UK policy, it costs them nothing, and lets them experiment with a route to transition to full open access at relatively low risk. Because the contribution of UK institutions with this particular subscription plan is relatively small it is unlikely to reduce subscriptions significantly in the short term, but if and when it does it positions the RSC to offer package deals on publication services with very similar terms. Tactically by moving early it also allows the RSC to hold a higher price point than later movers will – and will help to increase its market share in the UK over that of the ACS.

Another route is for societies to explore the “indy band model”. Similar to bands that are trying to break through by giving away their recorded material but charging for live gigs, societies could focus on raising money through meetings rather than publications. Some societies already do this – having historically focussed on running large scale international or national meetings. The “in person” experience is something that cannot yet be done cheaply over the internet and “must attend” meetings offer significant income and sponsorship opportunities. There are challenges to be navigated here – ensuring commercial contributions don’t damage the brand or reduce the quality of meetings being a big one – but expect conference fees to rise as subscription incomes drop. Societies that currently run lavish meetings off the back of journal income will face a particular struggle over the next two to five years.

But even meetings are unlikely to offer a long term solution. It’s some way off yet but rising costs of travel and increasing quality of videoconferencing will start to eat into this market as well. If all the big speakers are dialling it in, is it still worth attending the meeting? So what are the real value offerings that societies can provide? What are the things that are unique to that community collection of expertise that no-one else can provide?

Peer review (pre-, post-, or peri-publication) is one of them. Publication services are not. Publication, in the narrow sense of “making public”, will be commoditised, if it hasn’t already. With new players like PeerJ and F1000 Research alongside the now fairly familiar landscape of the wide-ranging megajournal the space for publication services to make fat profits is narrowing rapidly. This will, sooner or later, be a low margin business with a range of options to choose from when someone, whether a society or a single researcher, is looking for a platform to publish their work. While the rest of us may argue whether this will happen next year or in a decade, for societies it is the long term that matters, and in the long term commoditisation will happen.

The unique offering that a society brings is the aggregation and organisation of expert attention. In a given space a scholarly society has a unique capacity to coordinate and organise assessment by domain experts. I can certainly imagine a society offering peer review as a core member service, independent of whether the thing being reviewed is already “published”. This might be a particular case where there are real benefits to operating a small scale – both because of the peer pressure for each member of the community to pull their weight and because the scale of the community lends itself to being understood and managed as a small set of partly connected small world networks. The question is really whether the sums add up. Will members pay $100 or $500 per year for peer review services? Would that provide enough income? What about younger members without grants? And perhaps crucially, how cheap would a separated publication platform have to be to make the sums look good?

Societies are all about community. Arguably most completely missed the boat on the potential of the social web when they could have built community hubs of real value – and those that didn’t miss it entirely largely created badly built and ill thought through community forums well after the first flush of failed generic “Facebook for Science” clones had faded. But another chance is coming. As the ratchet moves on funder and government open access policies, society journals stuck in a subscription model will become increasingly unattractive options for publication. The slow rate of progress and disciplinary differences will allow some to hold on past the point of no return and these societies will wither and die. Some societies will investigate transitional pricing models. I commend the example of the RSC to small societies as something to look at closely. Some may choose to move to publishing collections in larger journals where they retain editorial control. My bet is that those that survive will be the ones that find a way to make the combined expertise of the community pay – and I think the place to look for that will be those societies that find ways to decouple the value they offer through peer review from the costs of publication services.

This post was inspired by a twitter conversation with Alan Cann and builds on many conversations I’ve had with people including Heather Joseph, Richard Kidd, David Smith, and others. Full Disclosure: I’m interested, in my role as Advocacy Director for PLOS, in the question of how scholarly societies can manage a transition to an open access world. However, this post is entirely my own reflections on these issues.

Enhanced by Zemanta

Added Value: I do not think those words mean what you think they mean

There are two major strands to position of traditional publishers have taken in justifying the process by which they will make the, now inevitable, transition to a system supporting Open Access. The first of these is that the transition will cost “more money”. The exact costs are not clear but the, broadly reasonable, assumption is that there needs to be transitional funding available to support what will clearly be a mixed system over some transitional period. The argument of course is how much money and where it will come from, as well as an issue that hasn’t yet been publicly broached, how long will it last for? Expect lots of positioning on this over the coming months with statements about “average paper costs” and “reasonable time frames”, with incumbent subscription publishers targeting figures of around $2,500-5,000 and ten years respectively, and those on my side of the fence suggesting figures of around $1,500 and two years. This will be fun to watch but the key will be to see where this money comes from (and what subsequently gets cut), the mechanisms put in place to release this “extra” money and the way in which they are set up so as to wind down, and provide downwards price pressure.

The second arm of the publisher argument has been that they provide “added value” over what the scholarly community provides into the publication process. It has become a common call of the incumbent subscription publishers that they are not doing enough to explain this added value. Most recently David Crotty has posted at Scholarly Kitchen saying that this was a core theme of the recent SSP meeting. This value exists, but clearly we disagree on its quantitative value. The problem is we never see any actual figures given. But I think there are some recent numbers that can help us put some bounds on what this added value really is, and ironically they have been provided by the publisher associations in their efforts to head off six month embargo periods.

When we talk about added value we can posit some imaginary “real” value but this is really not a useful number – there is no way we can determine it. What we can do is talk about realisable value, i.e. the amount that the market is prepared to pay for the additional functionality that is being provided. I don’t think we are in a position to pin that number down precisely, and clearly it will differ between publishers, disciplines, and work flows but what I want to do is attempt to pin down some points which I think help to bound it, both from the provider and the consumer side. In doing this I will use a few figures and reports as well as place an explicit interpretation on the actions of various parties. The key data points I want to use are as follows:

  1. All publisher associations and most incumbent publishers have actively campaigned against open access mandates that make the final refereed version of a scholarly article, prior to typesetting, publication, indexing, and archival, online in any form either immediately or within six months after publication. The Publishers Association (UK) and ALPSP are both on record as stating that such a mandate would be “unsustainable” and most recently that it would bankrupt publishers.
  2. In a survey run by ALPSP of research libraries (although there are a series of concerns that have to be raised about the methodology) a significant proportion of libraries stated that they would cut some subscriptions if the majority research articles were available online six months after formal publication. The survey states that it appeared that most respondents assumed that the freely available version would be the original author version, i.e. not that which was peer reviewed.
  3. There are multiple examples of financially viable publishing houses running a pure Open Access programme with average author charges of around $1500. These are concentrated in the life and medical sciences where there is both significant funding and no existing culture of pre-print archives.
  4. The SCOAP3 project has created a formal journal publication framework which will provide open access to peer reviewed papers for a community that does have a strong pre-print culture utilising the ArXiv.

Let us start at the top. Publishers actively campaign against a reduction of embargo periods. This makes it clear that they do not believe that the product they provide, in transforming the refereed version of a paper into the published version, has sufficient value that their existing customers will pay for it at the existing price. That is remarkable and a frightening hole at the centre of our current model. The service providers can only provide sufficient added value to justify the current price if they additionally restrict access to the “non-added-value” version. A supplier that was confident about the value that they add would have no such issues, indeed they would be proud to compete with this prior version, confident that the additional price they were charging was clearly justified. That they do not should be a concern to all of us, not least the publishers.

Many publishers also seek to restrict access to any prior version, including the authors original version prior to peer review. These publishers don’t even believe that their management of the peer review process adds sufficient value to justify the price they are charging. This is shocking. The ACS, for instance, has such little faith in the value that it adds that it seeks to control all prior versions of any paper it publishes.

But what of the customer? Well the ALPSP survey, if we take the summary as I have suggested above at face value, suggests that libraries also doubt the value added by publishers. This is more of a quantitative argument but that some libraries would cancel some subscriptions shows that overall the community doesn’t believe the overall current price is worth paying even allowing for a six month delay in access. So broadly speaking we can see that both the current service providers and the current customers do not believe that the costs of the pure service element of subscription based scholarly publication are justified by the value added through this service.  This in combination means we can provide some upper bounds on the value added by publishers.

If we take the approximately $10B currently paid as cash costs to recompense publishers for their work in facilitating scholarly communications neither the incumbent subscription publishers nor their current library customers believe that the value added by publishers justifies the current cost, absent artificial restrictions to access to the non-value added version.

This tells us not very much about what the realisable value of this work actually is, but it does provide an upper bound. But what about a lower bound? One approach would be turn to the services provided to authors by Open Access publishers. These costs are willingly incurred by a paying customer so it is tempting to use these directly as a lower bound. This is probably reasonable in the life and medical sciences but as we move into other disciplinary areas, such as mathematics, it is clear that cost level is not seen as attractive enough. In addition the life and medical sciences have no tradition of wide availability of pre-publication versions of papers. That means for these disciplines the willingness to pay the approximately $1500 average cost of APCs is in part bound up with making the wish to make the paper effectively available through recognised outlets. We have not yet separated the value in the original copy versus the added value provided by this publishing service. The $1000-1500 mark is however a touchstone that is worth bearing in mind for these disciplines.

To do a fair comparison we would need to find a space where there is a thriving pre-print culture and a demonstrated willingness to pay a defined price for added-value in the form of formal publication over and above this existing availability. The Sponsoring Consortium for Open Access Publishing in Particle Physics (SCOAP3) is an example of precisely this. The particle physics community have essentially decided unilaterally to assume control of the journals for their area and have placed their service requirements out for tender. Unfortunately this means we don’t have the final prices yet, but we will soon and the executive summary of the working party report suggests a reasonable price range of €1000-2000. If we assume the successful tender comes in at the lower end or slightly below of this range we see an accepted price for added value, over that already provided by the ArXiv for this disciplinary area, that is not a million miles away from that figure of $1500.

Of course this is before real price competition in this space is factored in. The realisable value is a function of the market and as prices inevitably drop there will be downward pressure on what people are willing to pay. There will also be increasing competition from archives, repositories, and other services that are currently free or near free to use, as they inevitably increase the quality and range of the services they offer. Some of these will mirror the services provided by incumbent publishers.

A reasonable current lower bound for realisable added value by publication service providers is ~$1000 per paper. This is likely to drop as market pressures come to bear and existing archives and repositories seek to provide a wider range of low cost services.

Where does this leave us? Not with a clear numerical value we can ascribe to this added value, but that’s always going to be a moving target. But we can get some sense of the bottom end of the range. It’s currently $1000 or greater at least in some disciplines, but is likely to go down. It’s also likely to diversify as new providers offer subsets of the services currently offered as one indivisible lump. At the top end both customers and service providers actions suggest they believe that the added value is less than what we currently pay and that it is only artificial controls over access to the non-value add versions that justify the current price. What we need is a better articulation of what is the real value that publishers add and an honest conversation about what we are prepared to pay for it.

Enhanced by Zemanta