Against the 2.5% Commitment

Three things come together to make this post. The first is the paper The 2.5% Commitment by David Lewis, which argues essentially for top slicing a percentage off library budgets to pay for shared infrastructures. There is much that I agree with in the paper, the need for resourcing infrastructure, the need for mechanisms to share that burden, and fundamentally the need to think about scholarly communications expenditures as investments. But I found myself disagreeing with the mechanism. What motivates me to getting around to writing this is the recent publication of my own paper looking at resourcing of collective goods in scholarly communications, which lays out the background to my concerns, and the announcement from the European Commission that it will tender for the provision of a shared infrastructure for communicating research funded through Horizon2020 programs.

The problem of shared infrastructures

The challenge of resourcing and supporting infrastructure is well rehearsed elsewhere. We understand it is a collective action problem and that these are hard to solve. We know that the provision of collective and public-like goods is a problem that is not well addressed by markets, and better addressed by the state (for true public goods for which there are limited provisioning problems) or communities/groups (for collective goods that are partly excludable or rivalrous). One of the key shifts in my position has been the realization that knowledge (or its benefits) are not true public goods. They are better seen as club good or common pool resources for which we have a normative aspiration to make them more public but we can never truly achieve this.

This has important implications, chief amongst them that communities that understand and can work with knowledge products are better placed to support them than either the market, or the state. The role of the state (or the funder in our scholarly communications world) then becomes providing an environment that helps communities build these resources. It is the problem of which communities are best placed to do this that I discuss in my recent paper, focussing on questions of size, homogeneity, and solutions to the collective action problem drawing on the models of Mancur Olson in The Logic of Collective Action. The short version is that large groups will struggle, smaller groups, and in particular groups which are effectively smaller due to the dominance of a few players, will be better at solving these problems. Shorter version, publishers, of which there are really only 5-8 that matter will solve problems much better than universities, of which there are thousands.

Any proposal that starts “we’ll just get all the universities to do X” is basically doomed to failure from the start. Unless coordination mechanisms are built into the proposal. Mechanism one is abstract the problem up till there are a smaller number of players. ORCID is gaining institutional members in those countries where a national consortium has been formed. Particularly in Europe where a small number of countries have worked together to make this happen. The effective number of agents negotiating drops from thousands to around 10-20. The problem is this is politically infeasible in the United States where national coordination, government led is simply not possible. This isn’t incidentally an issue to do with Trump or Republicans but a long-standing reality in the USA. National level coordination, even between the national funding agencies is near impossible. The second mechanism is where a small number of players dominate the space. This works well for publishers (at least the big ones) although the thousands of tiny publishers do exploit the investment that the big ones make in shared infrastructures like Crossref. If the Ivy Plus group in the US did something, then maybe the rest would follow but in practice that seems unlikely. The patchwork of university associations is too disparate in most cases to reach agreement on sharing burden.

The final mechanism is one where there are direct benefits to contributors that arise as a side effect of contributing to the collective resource. More on that later.

A brief history of top-slicing proposals

The idea of top-slicing budgets to pay for scholarly infrastructure is not new. I first heard it explicitly proposed by Raym Crow in a meeting in the Netherlands that was seeking to find ways to fund OA Infrastructures. I have been involved in lobbying funders over many years that they need to do something along these lines. The Open Access Network follows a similar argument. The problem then, was the same as the problem now, what is the percentage? If I recall correctly Raym proposed 1%, others have suggested 1.5%, 2% and now 2.5%. Putting aside the inflation in the figure over the years there is a real problem here for any collective action. How can we justify the “correct” figure?

There are generally two approaches to this. One approach is to put a fence around a set of “core” activities and define the total cost of supporting them as a target figure. This approach has been taken by a group of biomedical funders convened by the Human Frontiers Science Program Organization who are developing a shared approach to funding to ensure “that core data resources for the life sciences should be supported through a coordinated international effort(s) that better ensure long-term sustainability and that appropriately align funding with scientific impact”. Important to note is that the convening involved a relatively small number of the most important funders, in a specific discipline, and it is a community organization, the ELIXR project that is working to define how “core data resources” should be defined. This approach has strengths, it helps define, and therefore build community through agreeing commitments to shared activities. It also has a weakness in that once that community definition is made it can be difficult to expand or generalize. The decisions of what is a “core data resource” in the biosciences is unlikely to map well to other disciplines for instance.

The second approach, attempts to get above the problem of disciplinary communities by defining some percentage of all expenditure to invest, effectively a tax on operations to support shared services. In many ways the purpose of the modern nation state is to connect such top-slicing, through taxation, to a shared identity and therefore an implicit and ongoing consent to the appropriation of those resources to support shared infrastructures that would otherwise not exist. That in turn is the problem in the scholarly communication space. Such shared identities and notions of consent do not exist. The somewhat unproductive argument over whether it is the libraries responsibility to cut subscriptions or academics responsibility to ask them to illustrates this. It is actually a shared responsibility, but one that is not supported by sense of shared identity and purpose, certainly not of shared governance. And notably success stories in cutting subscriptions all feature serious efforts to form and strengthen shared identity and purpose within an institution before taking action.

The problem lies in the politics of community. The first approach defines and delineates a community, precisely based on what it sees as core. While this strengthens support for those goods seen as core it can wall off opportunities to work with other communities, or make it difficult to expand (or indeed contract) without major ructions in the community. Top-slicing a percentage does two things, it presumes a much broader community with a sense of shared identity and a notion of consent to governance (which generally does not exist). This means that arguments over the percentage can be used as a proxy for arguments about what is in and out which will obscure the central issues of building a sense of community by debating the identity of what is in and out. In the absence of a shared identity it means adoptions requires unilateral action by the subgroup with their logistical hands on the budgets (in this case librarians). This is why percentages are favoured and those percentages look small. In essence the political goal is to hide this agenda in the noise. That makes good tactical sense, but leaves the strategic problem of building that shared sense of community that might lead to consent and consensus to grow ever bigger. It also stores up a bigger problem. It makes a shift towards shared infrastructures for larger proportions of the scholarly communications system impossible. The numbers might look big today, but they are a small part of managing an overall transition.

Investment returns to community as a model for internal incentives

Thus far you might take my argument as leading to a necessity for some kind of “coming together” for a broad consensus. But equally if you follow the line of my argument this is not practical. The politics of budgets within libraries are heterogeneous enough. Add into that the agendas and disparate views of academics across disciplines (not to mention ignorance on both sides of what those look like) and this falls into the category of collective action problems that I would describe as “run away from screaming”.  That doesn’t mean that progress is not feasible though.

Look again at those cases where subscription cancellations have been successfully negotiated internally. These generally fall into two categories. Small(ish) institutions where a concerted effort has been made to build community internally in response to an internally recognized cash crisis, and consortial deals which are either small enough (see the Netherlands) or where institutions have bound themselves to act together voluntarily (see Germany, and to some extent Finland). In both cases community is being created and identity strengthened. In this element, I agree with Lewis’ paper, the idea of shared commitment is absolutely core. My disagreement is that I believe making that shared commitment an abstract percentage is the wrong approach. Firstly because any number will be arbitrary, but more importantly because it assumes common cause with academics that is not really there, rather than focusing on the easier task of building a community of action within libraries. This community can gradually draw in researchers, providing that it is attractive and sustainable, but to try and bridge this to start with is too big a gap to my mind.

This leads us to the third of Olson’s options for solving collective action problems. This option involves the generation of a byproduct, something which is an exclusive benefit to contributors, as part of the production of the collective good. I don’t think a flat percentage tax does this, but other financial models might. In particular reconfiguring the thinking of libraries around expenditure towards community investment might provide a route. What happens when we ask about return on investment to the scholarly community as a whole and to the funding library for not some percentage of the budget but the whole budget?

When we talk about about scholarly communications economics we usually focus on expenditure. How much money is going out of library or funder budgets? A question we don’t ask is how much of that money re-circulates within the community economy and how much leaves the system? To a first approximation all the money paid to privately held and shareholder owned commercial entities leaves the system. But this is not entirely true, the big commercials do invest in shared systems, infrastructure and standards. We don’t actually know how much, but there is at least some. You might think that non-profits are obliged to recirculate all the money but this is not true either, the American Chemical Society spends millions on salaries and lobbying. And so does PLOS (on salaries, not lobbying, ‘publishing operations’ is largely staff costs). For some of these organizations we have much better information on how much money is reinvested and how because of reporting requirements but it’s still limited.

You might think that two start-ups are both contributing similar value, whether they are (both) for profit or not-for-profit. But look closer, which ones are putting resources into the public domain as open access content, open source code, open data? How do those choices relate to quality of service? Here things get interesting. There’s a balance to found in investing in high quality services, which may be building on closed resources to secure a return to (someone else’s) capital, vs building the capital within the community. It’s perfectly legitimate for some proportion of our total budgets to leave the system, both as a route to securing investment from outside capital, but also because some elements of the system, particular user-facing service interfaces, have traditionally been better provided by markets. It also provides a playing field on which internal players might compete to make this change.

This is a community benefit but it also a direct benefit. Echoing the proposal of the Open Access Network, this investment would likely include instruments for investment in services (like Open Library of Humanities) and innovation (projects like Coko might be a good fit). Even if access to this capital is not exclusively accessible to projects affiliated with contributing libraries (which would be a legitimate approach, but probably a bit limiting) access to the governance of how that capital is allocated provides direct advantages to investing libraries. Access to the decisions that fund systems that their specific library needs is an exclusive benefit. Carefully configured this could also provide a route to draw in academics who want to innovate, as well as those who see the funding basis of their traditional systems crumbling. In the long term this couples direct benefits to the funding libraries to community building within the academic library community, to a long game of drawing in academics (and funders!) to an ecosystem in which their participation implies the consent of the governed, ultimately justifying a regime of appropriate taxation.

Actually in the end my proposal is not very different to Lewis’s. Libraries make a commitment to engage in and publicly report on their investment in services and infrastructure, in particular they report on how that portfolio provides investment returns to the community. There is competition for the prestige of doing better, and early contributors get to claim that prestige of being progressive and innovative. As the common investment pool grows there are more direct financial interests that bring more players in, until in the long term it may become an effectively compulsory part of the way academic libraries function. The big difference for me is not setting a fixed figure. Maybe setting targets would help, but in the first instance I suspect that simply reporting on community return on investment would change practice. One of the things Buchannan and Olson don’t address (or at least not to any great extent) is that identity and prestige are club goods. It is entirely economically rational for a library to invest in something that provides no concrete or immediate material return, if in doing so it gains profile and prestige, bolstering its identity as “progressive” or “innovative” in a way that plays well to internal institutional politics, and therefore in turn to donors and funders. Again, here I agree with Lewis and the 2.5% paper that signalling (including costly signalling from an evolutionary perspective) is a powerful motivator. Where I disagree is with the specifics of the signal, who it signals to, and how that can build community from a small start.

The inequity of traditional tenders

How does all of this relate to the European Commission statement that they will tender for an Open Science Platform? The key here is the terms of the tender. There is an assumption amongst many people I follow that F1000Research have this basically sewn up. Certainly based on numbers I’m aware of they are likely the best placed to offer a cost effective service that is also high quality. The fact that Wellcome, Gates and the African Academy of Sciences have all bought into this offering is not an accident. The information note states that the “Commission is implementing this action through a public procurement procedure, which a cost-benefit analysis has shown to be the most effective and transparent tool”. What the information note does not say is that the production of community owned resources and platforms will be part of the criteria. Framing this as an exercise in procuring a service could give very different results to framing it as investment in community resources.

Such tender processes put community organisations at a disadvantage. Commercial organisations have access to capital that community groups often do not. Community groups are more restricted in risk taking. But they are also more motivated, often it is part of their mission, to produce community resources, open source platforms and systems, open data, as well as the open content which is the focus of the Commission’s thinking. We should not exclude commercial and non-community players from competing for these tenders, far from it. They may well be more efficient, or provide better services. But this should be tensioned against the resources created and made available as as result of the investment. Particularly for the Commission, with its goal of systemic change, that question should be central. The resources that are offered to the community as part of a tender should have a value placed on them, and that should be considered in the context of pricing. The Commission needs to consider the quality of the investment it is making, as well as the price that it pays.

The key point of agreement, assessing investment quality

This leads to my key point of agreement with Lewis. The paper proposes a list of projects, approved for inclusion in the public reporting on investment. I would go further and develop an index of investment quality. The resources to support building this index would be the first project on the list. And members, having paid into that resource, get early and privileged access to the reporting, just as in investment markets. For any given project or service provider an assessment would be made on two main characteristics. How much of the money invested is re-circulated in the community? And an assessment of the quality of governance and management that the investment delivers, including the risk on the investment (high for early-stage projects, low for stable infrastructures)? Projects would get a rating, alongside an estimated percentage of investment that recirculates. Projects were all outputs get delivered with open licensing would get a high percentage (but not 100%, some will go in salaries and admin costs).

Commercial players are not excluded. They can contribute information on the resources that they circulate back to the community (perhaps money going to societies, but also contribution to shared infrastructures, work on standards, dataset contributions) that can be incorporated. And they may well claim that although the percentage is lower than some projects the services are better. Institutions can tension that, effectively setting a price on service quality, which will enhance competition. This requires more public reporting than commercial players are naturally inclined to provide, but equally it can be used as a signal of commitment to the community. One thing that might choose to do is contribute directly to the pool of capital, building shared resources that float all the boats. Funders could do the same, some already do by funding these projects, and that in turn could be reported.

The key to this, is that it can start small. A single library, funding an effort to examine the return it achieves on its own investment will see some modest benefits. A group working together will tell a story about how they are changing the landscape. This is already why efforts like Open Library of Humanites and Knowledge Unlatched and others work at all. Collective benefits would rise as the capital pool grew, even if investment was not directly coordinated, but simply a side effect of libraries funding various of these projects. Members gain exclusive benefits, access to information, identity as a leading group of libraries and institutions, and the prestige that comes with that in arguing internally for budgets and externally for other funding.

2.5% is both too ambitious and not ambitious enough

In the end I agree with the majority of the proposals in the 2.5% Commitment paper, I just disagree with the headline. I think the goal is over-amibitious. It requires too many universities to sign up, it takes political risks both internally and externally that are exactly the ones that have challenged open access implementation. It assumes power over budgets and implicit consent from academics that likely doesn’t exist, and will be near impossible to gain, and then hides it by choosing a small percentage. In turn that small percentage requires coordination across many institutions to achieve results, and as I argue in the paper, that seems unlikely to happen.

At the same time it is far from ambitious enough. If the goal is to shift investment in scholarly communications away from service contracts and content access to shared platforms, then locking in 2.5% as a target may doom us to failure. We don’t know what that figure should be, but I’d argue that it is clear that if we want to save money overall, it will be at least an order of magnitude high in percentage terms. What we need is a systemic optimisation that lets us reap the benefits of commercial provision, including external capital, quality of service, and competition, while progressively retaining more of the standards, platforms and interchange mechanisms in the community sphere. Shifting our thinking from purchasing to investment is one way to work towards that.

Walking the walk – How easily can a whole project be shared and connected?

One of the things I wanted to do with the IDRC Data Sharing Pilot Project that we’ve just published was to try and demonstrate some best practice. This became more important as the project progressed and our focus on culture change developed. As I came to understand more deeply how much this process was one of showing by doing, for all parties, it became clear how crucial it was to make a best effort.

This turns out to be pretty hard. There are lots of tools out there to help with pieces of the puzzle, but actually sharing a project end-to-end is pretty rare. This means that the tools that help with one part don’t tend to interconnect well with others, at multiple levels. What I ended up with was  tangled mess of dependencies that were challenging to work through. In this post I want to focus on the question of how easy it is to connect up the multiple of objects in a project via citations.

The set of objects

The full set of objects to be published from the project were:

  1. The grant proposal – this acts as a representation of the project as a whole and we published it quite early on in RIO
  2. The literature and policy review – this was carried out relatively early on but took a while to publish. I wanted it to reference the materials from expert interviews that I was intended to deposit but this got complicated (see below)
  3. The Data Management Plan for the project – again published in RIO relatively early on in the process
  4. Seven Data Management Plans – one from each of the participating projects in the pilot
  5. Seven Case Studies – one on each of the participating projects. This in turn needed to be connected back to the data collected, and ideally would reference the relevant DMP.
  6. The final report – in turn this should reference the DMPs and the case studies
  7. The data – as a data sharing project we wanted to share data. This consisted of a mix of digital objects and I struggled to figure out how to package it up for deposit, leading in the end to this epic rant. The solution to that problem will be the subject for the next post.

Alongside this, we wanted as far as possible to connect things up via ORCID and to have good metadata flowing through Crossref and Datacite systems. In essence we wanted to push as far as is possible in populating a network of connections to build up the map of research objects.

The problem

In doing this, there’s a fairly obvious problem. If we use DOIs (whether from Crossref or DataCite) as identifiers, and we want the references between objects to be formal citations, that are captured by Crossref as relationships between objects, then we can only reference in one direction. More specifically, you can’t reference an object until its DOI is registered. Zenodo lets you “reserve” a DOI in advance, but until it is registered formally (with DataCite in this case) none of the validation or lookup systems work. This citation graph is a directed graph, so ordering matters. On top of this, you can’t update it after formal publication (or at least not easily)

The effect of this is that if you want to get the reference network right, and if “right” is highly interconnected as was the case here then publication has to proceed in a very specific order.

In this case the order was:

  1. The grant proposal – all other objects needed to refer to “the project”
  2. The data set – assuming there was only one. It may have been better to split it but its hard to say. The choice to have one package meant that it had to be positioned here.
  3. The review – because it needed to reference the data package
  4. The contributing project Data Management Plans – these could have gone either second or third but ended up being the blocker
  5. The Case Studies  – because they need to reference review, data package, project and Data Management Plans
  6. The final report – has to go last because it refers to everything else

In practice, the blocker became the Data Management Plans. Many of the contributing projects are not academic, formal publishing doesn’t provide any particular incentive for them and they had other things to worry about. But for me, if I was to formally reference them, they had to be published before I could release the Case Studies or final report. In the end four DMPs were published and three were referenced by the documents in the Data Package. Some of the contributors had ORCIDs that got connected up via the published DMPs but not all.

citegraph

The challenges and possible solutions

The issue here was that the only way in practice to make this work was to get everything lined up, and then release carefully over a series of days once each of the objects were released by the relevant publisher (Zenodo and Pensoft Publishing for RIO). This meant a delay of months as I tried to get all the pieces fully lined up. For most projects this is impractical. The best way to get more sharing is to share early, once each object is ready. If we have to wait until everything is perfect it will never happen, researchers will lose interest, and much material will either not be shared, or not get properly linked up.

One solution to this is to reserve and pre-register DOIs. Crossref is working towards “holding pages” that allow a publisher to register a DOI before the document is formally released. However this is intended to operate on acceptance. This might be ok for a journal like RIO where peer review operates after publication and acceptance is based on technical criteria, but it wouldn’t have worked if I had sought to publish in a more traditional journal. Registering Crossref DOIs that will definitely get used in the near future may be ok, registering ones that may never resolve to a final formally published document would be problematic.

The alternative solution is to somehow bundle all the objects up into a “holding package” which, once it is complete triggers the process of formal publication and identifier registration. One form of this might be a system, a little like that imagined for machine-actionable DMPs where the records get made and connections stored until, when things are ready the various objects get fired off to their formal homes in the right order. There are a series of problems here as well though. This mirrors the preprint debate. If its not a “real” version of the final thing does that not create potential confusion? The system could still get blocked by a single object being slow in acceptance. This might be better than a researcher being blocked, the system can keep trying without getting bored, but it will likely require researcher intervention.

A different version might involve “informal release” by the eventual publisher(s), with the updates being handled internally. Again recent Crossref work can support this through the ability of DOIs for different versions of objects to point forward to the final one. But this illustrates the final problem, it requires all the publishers to adopt these systems.

Much of what I did with this project was only possible because of the systems at RIO. More commonly there are multiple publishers involved, all with subtly different rules and workflows. Operating across Zenodo and RIO was complex enough, with one link to be made. Working across multiple publishers with partial and probably different implementations of the most recent Crossref systems would be a nightmare. RIO was accommodating throughout when I made odd requests to fix things manually. A big publisher would likely have not bothered. The fiddly process of trying to get everything right was hard enough, and I knew what my goals were. Someone with less experience, trying to do this across multiple systems would have given up much much earlier.

Conclusions and ways forward (and backwards)

A lot of what I struggled with I struggled because I wanted to leverage the “traditional” and formal systems of referencing. I could have made much more progress with annotation or alternative modes of linking, but I wanted to push the formal systems. These formal systems are changing in ways that make what I was trying to do easier. But it’s slow going. Many people have been talking about “packages” for research projects or articles, the idea of being able to easily grab all the materials, code, data, documents, relevant to a project with ease. For that need much more effective ways of building up the reference graph in both directions.

The data package, which is the piece of the project where I had the most control over internal structure and outwards references, does not actually point at the final report. It can’t because we didn’t have the DOI for that when we put the data package together. The real problem here is the directed nature of the graph. The solution lies in some mechanism that either lets up update reference lists after publication or that lets us capture references that point in both directions. The full reference graph can only be captured in practice if we look for links running both from citing to referenced object and from referenced object to citing object, because sometimes we have to publish them in the “wrong” order.

How we manage that I don’t know. Update versioned links that simply use web standards will be one suggestion. But that doesn’t provide the social affordances and systems of our deeply embedded formal referencing systems. The challenge is finding flexibility as well as formality and making it easier to wire up the connections to build the full map.

 

As a researcher…I’m a bit bloody fed up with Data Management

The following will come across as a rant. Which it is. But it’s a well intentioned rant. Please bear in mind that I care about good practice in data sharing, documentation, and preservation. I know there are many people working to support it, generally under-funded, often having to justify their existence to higher-ups who care more about the next Glam Mag article than whether there’s any evidence to support the findings. But, and its an important but, those political fights won’t become easier until researchers know those people exist, value their opinions and input, and internalise the training and practice they provide. The best way for that to happen is to provide the discovery points, tools and support where researchers will find them, in a way that convinces us that you understand our needs. This rant is an attempt to illustrate how large that gap is at the moment.

As a researcher I…have a problem

I have a bunch of data. It’s a bit of a mess, but its not totally disorganised and I know what it is. I want to package it up neatly and put it in an appropriate data repository because I’m part of the 2% of researchers that actually care enough to do it. I’ve heard of Zenodo. I know that metadata is a thing. I’d like to do it “properly”. I am just about the best case scenario. I know enough to know I need to know more, but not so much I think I know everything.

More concretely I specifically have data from a set of interviews. I have audio and I have notes/transcripts. I have the interview prompt. I have decided this set of around 40 files is a good package to combine into one dataset on Zenodo. So my next step is to search for some guidance on how to organise and document that data. Interviews, notes, must be a common form of data package right? So a quick search for a tutorial, or guidance or best practice?

Nope. Give it a go. You either get a deep dive into metadata schema (and remember I’m one of the 2% who even know what those words mean) or you get very high level generic advice about data management in general. Maybe you get a few pages giving (inconsistent) advice on what audio file formats to use. What you don’t get is set of instructions that says “this is the best way to organise these” or good examples of how other people have done it. The latter would be ideal, just finding an example which is regarded as good, and copying the approach. I’m really trying to get advice on a truly basic question: should I organise by interview (audio files and notes together) or by file type (with interviews split up).

As a researcher trying to do a good job of data deposition, I want an example of my kind of data being done well, so I can copy it and get on with my research

As a researcher…I’m late and I’m in a hurry. I don’t have the time to find you.

Now a logical response to my whining is “get thee to your research data support office and get professional help”. Which isn’t bad advice. Again, I’m one of the 5-10% who know that my institution actually has data support. The fact that I’m in the wrong timezone is perhaps a bit unusual, but the fact that I’m in a hurry is not. I should have done this last week, or a month ago, or before the funder started auditing data sharing. In the UK with the whole country in a panic this is particularly bad at the moment with data and scholarly communications support folks oscillating wildly between trying to get any attention from researchers and being swamped when we all simultaneously panic because reports are due.

Most researchers are going to reach for the web and search. And the resources I could find are woeful as a whole. Many of them are incomprehensible, even to me. But worse, virtually none of them are actually directed at my specific use case. I need step by step instructions, with examples to copy. I’m sure there are good sources out there, but they’re not easy to find. Many of the highest ranked hits are out of date, and populated with dead links (more on that particular problem later). But the organisations providing that information are actually highly ranked by search engines. If those national archives, data support organisations worked together more, kept pages more updated and frankly did a bit of old-fashioned Google Bombing and SEO it would help a lot. Again, remember I kind of know what I’m looking for. User testing on search terms could go a long way.

As a researcher looking for best practice guidance online, I need clear, understandable, and up to date guidance to be at the top of my search results, so I can avoid the frustration that will lead me to just give up.

As a researcher…if I figure out what I should do I want useable tools to help me.

Lets imagine that I find the right advice and get on and figure out what I’m going to do. Lets imagine that I’m going to create a structure with a top level human-readable readme.txt, machine readable DDI metadata, the interview prompt (rtf and txt) and then two directories, one for audio (FLAC, wav), one for notes (rtf) with consistent filenames. I’m going to zip all that up and post it to Zenodo. I’m going to use the Community function at Zenodo to collect up all the packages I create for this project (because that provides an OAI-PMH end point for the project). Easy! Job done.

Right. DDI metadata. There will be a tool for creating this obviously. Go to website…look for a “for researchers” link. Nope. Ok. Tools. That’s what I need. Ummm…ok…any of these called “writer”? No. “Editor”…ok a couple. That link is broken, those tools are windows only. This one I have to pay for and is Windows only. This one has no installation instructions and seems to be hosted on Google Code.

Compared to some other efforts DDI is actually pretty good. The website looks as though it has been updated recently. If you dig down into “getting started” there is at least some outline advice that is appropriate for researchers, but that actually gets more confusing as you get deeper in. Should I just be using Dublin Core? Can’t you just send me to a simple set of instructions for a minimal metadata set? If the aim is for a standard, any standard to get into the hands of the average jobbing researcher, it has to be either associated with tools I can use, or give examples where I can cut and paste to adapt for my own needs.

I’m fully aware that the last thing there sends a chill down the spine of both curators and metadata folks but the reality is either your standard is widely used or it is narrowly implemented for those cases where you have fully signed up professional curators or you have a high quality tool. Both of these will only ever be niche. The majority of researchers will never fit the carefully built tools and pipelines. Most of us generate bitty datasets that don’t quite fit in the large scale specialised repositories. We will always have to adapt patterns and templates and that’s always going to be a bit messy. But I definitely fall on the side of at least doing something reasonably well rather than do nothing because its not perfect.

As a researcher who knows what metadata or documentation standard to use, I need good usable (and discoverable) tools or templates to generate it, so that I a) actually do it and b) get it as right as possible.

As a researcher…I’m a bit confused at this point.

The message we get is that “this data stuff matters”. But when we go looking what we mostly find is badly documented and not well preserved. Proliferating websites with out of date broken links and reference to obsolete tools. It looks bad when the message is that high quality documentation and good preservation matter, but those of us shouting that message don’t seem to follow it ourselves. This isn’t the first time I’ve said this, it wouldn’t hurt for RDM folks to do a better job of managing our own resources to the standards we seek to impose on researchers.

I get exactly why this is, none of this is properly funded or institutionally supported. It’s infrastructure. And worse than that its human and social infrastructure rather than sexy new servers and big iron for computation. As Nature noted in an editorial this week there is a hypocrisy at the heart of funder/government-led data agenda in failing to provide the kinds of support needed for this kind of underpinning infrastructure. It’s no longer the money itself so much as the right forms of funding to support things that are important, but not exciting. I’m less worried about new infrastructures than actually properly preserving and integrating the standards and services we have.

But more than that there’s a big gap. I’ve framed my headings in the form of user stories. Most of the data sharing infrastructure, tools and standards is still failing to meet researchers where we actually are. I have some folder of data. I want to do a good job. What should I do, right now because the funder is on my back about it!?!

Two resources would have solved my problem:

  1. First an easily discoverable example of best practice for this kind of data collection. Something that came to the top of the search results when I search for “best practice for archiving depositing records of interviews”. An associated set of instructions and options would have been useful but not critical.
  2. Having identified what form of machine readable metadata was best practice a simple web-based platform independent tool to generate that metadata, either through a Q&A form based approach or some form of wizard. Failing that at least a good example I could modify.

As a researcher that’s what I really needed and I really failed to find it. I’m sympathetic, moderately knowledgeable, and I can get the worlds best RDM advice by pinging off a tweet. And I still struggled. It’s little wonder that we’re losing most of the mainstream.

As a researcher concerned to develop better RDM practice, I need support to meet me where I am, so ultimately I can support you in making the case that it matters.

Transmission and mediation of knowledge

English: First contents page of A Guide to the...
Dr Brewer’s Guide to Science. (Photo credit: Wikipedia)

There’s an article doing the rounds today about public understanding and rejection of experts and expertise. It was discussed in an article in the THES late last year (which ironically I haven’t read). I recommend reading the original article by Scharrer and co-workers, not least because the article itself is about how reading lay summaries can lead to a discounting of expertise. A lot of the reaction seems to be driven by two things. The first is a line in the introduction of the paper that the authors “share the normative position taken by Collins and Evans (2007) that experts with specialized deep-level knowledge and experience are the best sources when judging scientific claims”. The second is the finding of the project itself, that:

[…]we found that laypeople were more confident about their claim judgments after reading popularized depictions. This was indicated by a higher trust in their own judgment based on current knowledge and, conversely, a weaker desire for advice from a more knowledgeable source.

Scharrer et al (2016)

There is an interesting collision of politics and epistemology here. Questions of authority (in both the sense of who gets to make decisions and whose judgement is decisive on the answer to a question) and expertise are being combined in a way which is neither surprising given the current political climate, nor to my mind helpful.

By coincidence I was reading Ravetz’s Scientific Knowledge and its Social Problems last night where he discusses what it is that makes what he calls “true scientific knowledge”. Ravetz makes a distinction between “facts”, assertions which are generally useful within a particular scientific domain, and “scientific knowledge”. The transformation of one into the other involves a process of abstraction and generalisation that in Ravetz words:

Eventually a situation is reached where the original problem and its descendants are dead, but the fact lives on through a great variety of standardized versions, thrown off at different stages of its evolution, and themsleves undergoing constant change in response to that of their uses. This family of versions of the facts[…]will show diversity in every respect: in their form, in the arguments whereby they can be related to experience, and in their logical relation to other facts. They will however, have certain basic features in common: the assertions and their objects will be simpler than those of their original, and frequently vulgarized out of recognition.

Ravetz (1971) Scientific Knowledge and its Social Problems, p234, emphasis added (1996 Edition, Transaction Publishers)

In this Ravetz echoes Ludwig Fleck in Genesis and Development of a Scientific Fact, where he describes the way in which a claim must first circulate within an esoteric (expert) community and then be transmitted and transmuted by an “exoteric” before being circulated back in modified form to the esoteric community. In both cases circulation, use, and transmutation beyond the expert community is critical to the social processes of converting claims and statements into those that qualify as scientific knowledge. Fleck is less explicit about the necessary process of simplification and “vulgarisation” but Ravetz is quite explicit.

Of course both Ravetz and Fleck are talking about transmission to other expert communities, not to wider publics but I am quite sympathetic to the idea of a social construction of knowledge that focusses on the question of what groups can deploy the knowledge. Ironically my knowledge of Harry Collins’ work is based on lay summaries and one or two conversations, but from what I understand of his position he would argue that the “non-expert” might be able to articulate or parrot what Ravetz would call a “fact” but would be unable to deploy it. Ravetz himself focusses strongly on “craft knowledge” as an underestimated part of the whole (as does Collins). It is in this question of whether a transferred knowledge claim is “deployable” in some useful sense that a reconciliation of these positions will be found. Later in the same chapter Ravetz talks about this boundary where knowledge may be transmitted towards a border where it may no longer be “true knowledge”:

[…]in a new environment  [knowledge] may be adopted as a tool for the accomplishment of practical tasks or even for the solution of scientific problems; but in that case the objects of the knowledge will soon be recast aso to make the tool meaningful and effective to its new users. Moreover, only some parts of what seemed a coherent body of knowledge will survive the transfer; and it may be that those which were considered as the essential compoonents will be rejected as false or meaningless by the borrowers.

ibid p238

Ravetz takes a very strong position that what appears as loss should actually be seen as a process of refinement. I would take a more relativist position and argue that when we look at claims of knowledge we need to evaluate both the degree to which such claims clarify existing “facts” or provide new capacities within the knowledge community (deme in Cultural Science terms) that creates them, and how effective that community and the knowledge claim itself is at creating the social affordances that will allow its circulation, adoption, refinement and re-transmission to other communities/demes. At one level this is merely a restatement of a public engagement position that effective communication is crucial. Alice Bell as just one example has long argued for instance that effective communication is more important than mere access to traditional scientific outputs.

What it seems to me missing from the engagement position is the commitment to the follow on position to an effort to critical analyse and to re-absorb the forms of knowledge claim that do successfully circulate. What happens in the work of Scharrer et al, if the experts listen to the non-expert description of their work, and incorporate “what works” back into the way they discuss their own work. In Collins’ work, do the physics experts that he studies note which specific topics he most successfully “passes” in and use that to refine the way they describe their findings. If we understood better how the culture of different communities meant that specific knowledge claims (or specific forms of knowledge claims) created the social affordances that allowed enrichment of both communities then maybe we could get past the current sterile arguments about whether experts know anything.

In that sense what Scharrer et al are telling us is good news. We have ways to test what communications are working. The political question of who gets to make decisions or claim authority remains. But to my mind we have some more ways of testing whether we are truly creating knowledge in the sense that Ravetz and Fleck mean it. Our wider ability to trace how claims are flowing means we can ask the question, has this claim truly circulated beyond its source to sufficiently wider publics and back again? Is it constructed in a way that maximises its social affordances? And its starting to look like we may be able to evaluate (I hesitate to use ‘measure’) that.

Speculation: Sociality and “soundness”. Is this the connection across disciplines?

Italiano: network sociality
Italiano: network sociality (Photo credit: Wikipedia)

A couple of ideas have been rumbling along in the background for me for a while. Reproducibility and what it actually means or should mean has been the issue du jour for a while. As we revised the Excellence manuscript in response to comments and review reports, we also needed to dig a bit deeper into what it was that distinguishes the qualities of the concept of “soundness” from “excellence”. Are they both merely empty and local terms or is there something different about “proper scholarly practice” that we can use to help us.

At the same time I’ve been on a bit of a run reading some very different perspectives on the philosophy of knowledge (with an emphasis on science). I started with Fleck’s Genesis and Development of a Scientific Fact, followed up with Latour’s Politics of Nature and Shapin and Schaeffer’s Leviathan and the Air Pump, and currently am combining E O Wilson’s Consilience with Ravetz’s Scientific Knowledge and its Social Problems. Barbara Herrnstein Smith’s Contingencies of Value and Belief and Resistance are also in the mix. Books I haven’t read – at least not beyond skimming through - include key works by Merton, Kuhn, Foucault, Collins and others, but I feel like I’m getting a taste of the great divide of the 20th century.

I actually see more in common across these books than divides them. What every serious study of how science works agrees on is the importance of social and community processes in validating claims. While they disagree violently on what that validation actually means all of these differing strands of work show that it is explicit and implicit processes within communities, supported by explicit and implicit knowledge of how they operate and who is who, and who gets to say what, and when. The Mertonian norm of “organised scepticism” might be re-cast by some as “organised critique within an implicit frame of knowledge and power” but no-body is arguing that this process, which can be studied and critiqued, crucially can be compared, is not dependent on community processes. Whatever else it is, scholarship is social, occurring within institutions – that are the product of history – that influence the choices that individual scholars make.

In the Excellence pre-print we argued that “excellence” was an empty term, at best determined by a local opinion about what matters. But the obvious criticism of our suggesting “soundness” as an alternate is that soundness is equally locally determined and socially constructed: soundness in computational science is different to soundness in literature studies, or experimental science or theoretical physics. This is true, but misses the point. There is an argument to be made that soundness is a quality of the process by which an output is created, whereas “excellence” is a quality of the output itself. If that argument is accepted alongside the idea that the important part of the scholarly process is social then we have a potential way to audit the idea of soundness proposed by any given community.

If the key to scholarly work is the social process of community validation then it follows that “sound research” follows processes that make the outputs social. Or to be more precise, sound research processes create outputs that have social affordances that support the processes of the relevant communities. Sharing data, rather than keeping it hidden, means an existing object has new social affordances. Subjecting work to peer review is to engage in a process that creates social affordances of particular types. The quality of description that supports reproducibility (at all its levels) provides enhanced social affordances. What all of these things have in common is that better practice involves making outputs more social.

“More social” on its own is clearly not enough. There is a question here of more social for who? And the answer to that is going to be some variant of “the relevant scholarly community”. We can’t avoid the centrality of social construction, because scholarship is a social activity, undertaken by people, within networks of power and resource relationships. What forms of social affordance are considered necessary or sufficient is going to differ from one community to another, but this may be something that the all have in common. And that may be something we can work with.

 

FAIR enough? FAIR for one? FAIR for all!

English: Brussels Accessible Art Fair Logo
Brussels Accessible Art Fair Logo (Wikipedia)

The development of the acronym “FAIR” to describe open data was a stroke of genius. Standing for “Findable, Accessible, Interoperable and Reusable” it describes four attributes of datasets that are aspirations to achieve machine readability and re-use for an open data world. The short hand description provided by four attributes as well as a familiar and friendly word have led to its adoption as a touchstone for funders and policy groups including the G20 Hangzhao Concensus, the Amsterdam Call for Action on Open Science, the NIH Data Commons and the European Open Science Cloud.

At the FORCE11 Workshop on the Scholarly Commons this week in San Diego inclusion was a central issue. The purpose of the workshop was to work towards a shared articulation of principles or requirements that would define this shared space. To make any claim of a truly shared and global conception of this “scholarly commons” we clearly need to bake in inclusive processes. In particular we need our systems, rules and norms to remind us, at every turn, to consider different perspectives, needs and approaches. It is easy to sign up to principles that say there should be no barriers to involvement, but much harder to maintain awareness of barriers that we don’t see or experience.

The coining of FAIR was led by a community that want to emphasise that we need to expand our idea of audiences to include machine readers. As the Scholarly Commons discussion proceeded, and FAIR kept returning as a touch point I wondered whether we could use its traction for a further expansion, as a mnemonic that would remind us to consider the barriers that we don’t see ourselves. Can we embed in the idea of FAIR the inclusion of users and contributors from different geographies, cultures, backgrounds, and levels of access to research? And might something along the lines of making research “FAIR for All” achieve that?

As I looked at the component parts of FAIR it seemed like this could be a really productive approach:

Accessible

Originally conceived of as “available”, accessibility lends itself easily to expanding in scope to fit with this agenda. Can it be accessed without pay barriers online? Is it accessible to a machine? To a person without web access? To a speaker of a different language? To a non-expert? To someone with limited sight? There are many different types of accessibility but by forcing ourselves to consider a wider scope we can enhance inclusion. Many people have made excellent arguments that “access is not accessibility” and we can build on that strong base.

Interoperable

In the original FAIR Data Principles, Interoperability is concerned mainly with the use of  standard descriptions language and how resources make reference to related resources. For our purposes we can ask what systems and cultures can a project or resource interoperate with. Is it useable by policy makers? Can it be disseminated via print shops where internet access is not appropriate? Does it link into widely used information systems like Wikipedia (and in the future WikiData). Does the form of the resource, or the project, make it incompatible with other efforts?

Re-usable

For machine use of data, re-usability is reasonably easily defined. If we seek to expand the definition it gets more diffuse. This is more than just licensing (although open licensing can help) but also relates to formats and design. Is software designed to be multilingual and are resources provided in a form that supports translation? Are documents provided in editable form as well as print or PDF? While accessibility, interoperability and re-usability are all clearly related they give us a different lens to check our commitment to inclusion.

Findability

As I thought through the four components it seemed that discoverability might not fit the agenda well, but as I thought it through it became clear that discoverability is perhaps the most important aspect to consider. As an extreme example, something indexed in Google, or available via Wikimedia Commons doesn’t help if there is no network access. But more generally, the way in which we all search for information shapes the things we discover, and is the first necessary condition for engagement. From the challenges of getting Open Access books into library catalogues to the question of how patients can efficiently search for relevant research, via the systemic problems of how consumer search engines increasingly fail to provide clear provenance for information, the issue of inclusion and engagement starts, and far too often ends, with the challenges of discovery.

Conclusions

A few things become clear in considering this expansion of scope. The FAIR Data principles provide some clear proscriptions and tests for compliance. Issues of inclusion are much more open ended. When have we done enough? What audiences do we need to consider? In that sense it becomes much more a direction of travel than an easily definable goal to reach. But actually that was the initial goal, to prompt and provoke us to think more.

It also expands the question of the thing that is FAIR. For FAIR data we need only consider the resource, generally a dataset. With this expansion it is clear that it is both resources and the projects that generate them that we need to consider. A project could generate FAIR outputs without being FAIR itself. But again, this is a journey, not a destination. If we can hold ourselves to a higher standard then we will make progress towards that goal. With limited resources there will be difficult choices to make, but we can still use this idea as a prompt, to ask ourselves if we can do better.

If our goal is to do research that is “FAIR for All”, then we can test ourselves as we improve towards that goal by continuing to ask ourselves at each stage.

Is this FAIR enough?

 

 

Canaries in the Elsevier Mine: What to watch for at SSRN

English: Yellow canary
English: Yellow canary (Photo credit: Wikipedia)

Just to declare various conflicts of interest: I regard Gregg Gordon, CEO of SSRN as a friend and have always been impressed at what he has achieved at SSRN. From the perspective of what is best for the services SSRN can offer to researchers, selling to Elsevier was a smart opportunity and probably the best of the options available given the need for investment. My concerns are at the ecosystem level. What does this mean for the system as a whole? 

 

 

The first two paragraphs of this post have been edited following its initial posting. The original version can be found at the bottom of the post.

The response to the sale of SSRN to Elsevier is following a pattern that will be very familiar to those who watched the sale of Mendeley a few years back. As I noted in my UKSG talk we face real communications problems in scholarly communication because different groups are concerned about very different things, and largely fail to understand those differences. These plays out in commentary that argues that “the other side” has “failed to understand” the “real issues” at stake, and mostly missing the issues that are at the centre of concerns for that “other side”.

To be clear, obviously I’m doing exactly the same thing in this post, it’s just that I’m pointing the finger at everyone and at a slightly higher level of abstraction. What I hope to do in this post is use one particular strand of the arguments being raised to illustrate how different stakeholders focus on quite different issues when they ask the question “is this purchase good or bad”. The real questions to ask are “what is it we care about” and “what difference does the purchase make to those aspects”. What we can do is look at this particular strand of the conversation and what it tells us about the risks, both for SSRN as a trusted repository for a range of research communities, and for the ecosystem as a whole.

The strand I’m interested in both illustrates the gaping divide between the narrative on both sides and also because it shows how quickly we forget history. But its useful because it also points us to some criteria for judgement. That strand goes like this: “Elsevier did a great job of supporting and looking after Mendeley [ed. as someone noted I originally had Elsevier here, arguably a Freudian slip…], so you can expect SSRN to be the same”. That’s a paraphrase rather than a direct quote but it captures the essence of a narrative that focuses on the idea that all the fears that arose when Elsevier purchased Mendeley turned out to be unfounded.

Except of course it depends which fears you mean. And that’s where the narrative gets interesting.

What were the fears for Mendeley’s future?

First the credit side. Elsevier invested massively in Mendeley, including a desperately needed top to bottom re-engineering of the code base. It’s not widely known just how shakey the Mendeley infrastructure was at the point of purchase, and without a massive capital investment from somewhere its questionable whether it could have stayed afloat. Elsevier have also maintained Mendeley as a free service for users – many assumed that it would convert to a fully paid model. The free for users offering sits alongside an institutional business offering, but that was already the primary focus of the team before the purchase. Indeed, again for full disclosure I, amongst I am sure many others, strongly recommended that Mendeley should shift its business focus from individual memberships to institutions long before the purchase.

Yes, there was a fear that Elsevier would just crush Mendeley, that it would go the way of many previous purchases, but at the very least that would never have been the intention, and many of those fears seem to have been unfounded.

Except that they weren’t the fears that at least some of us had. It’s actually telling that the narrative has legs because it shows how much of recent history has been forgotten. What has been largely forgotten that Mendeley was once positioned as a major and very public force in driving wider access. The description of Mendeley on the Elsevier Connect blog post about the SSRN purchase illustrates this:

As a start-up, Mendeley was focused on building its reference manager workflow tool, gaining a loyal following of users thankful that it made their lives easier. Mendeley continues to evolve from offering purely software and content to being a sophisticated social, informational and analytical tool.

SSRN—the leading social science and humanities repository and online community—joins Elsevier – Gregg Gordon

It’s not that this isn’t true. But it excises from history one of the significant strands of Mendeley’s early development. Compare and contrast that with this from Ricardo Vidal on the official Mendeley blog in 2010 (emphasis added):

Keeping with the Open Access week spirit, we’re taking this opportunity to show you how to publicly share your own research on Mendeley. Making it openly available for others to easily access means they are more likely to cite you in their own publications, and also allows your colleagues to build upon your work faster.

When you sign up for a Mendeley user account, a researcher profile is created for you. On this page, along with your name, academic status, and short bio, you will also see a section titled “Publications”. This section is where you can display work you’ve published or perhaps even work that’s not yet published.

Or this from William Gunn in 2012 (again emphasis added):

Requiring the published results of taxpayer-funded research to be posted on the Internet in human and machine readable form would provide access to patients and caregivers, students and their teachers, researchers, entrepreneurs, and other taxpayers who paid for the research. Expanding access would speed the research process and increase the return on our investment in scientific research.

We have a brief, critical window of opportunity to underscore our community’s strong commitment to expanding the NIH Public Access Policy across all U.S. Federal Science Agencies. The administration is currently considering which policy actions are priorities that will they will act on before the 2012 Presidential Election season swings into high gear later this summer. We need to ensure that Public Access is one of those priorities.

From 2008 when I met Victor Henning for the first time until 2013 Mendeley was a major, and very public, advocate for Public Access and Open Access. It was a big part of the sales pitch for the service and a big part of the drive to increase the user base, particularly amongst younger researchers. Mendeley provided a means for researchers to share their articles on a public profile and it did this in the context of a service that was useable and gaining traction. Alongside this it made a very public commitment to open data, being one of the first scholarly communications services to have a functional API providing data on bookmarks. It was the future of those activities that many of us feared in 2013 when the purchase went through. So how have they fared?

Has Mendeley thrived as an “open access” organisation?

Public sharing of articles is long gone. Long enough that many people have forgotten it ever existed. I can’t actually figure out exactly when this happened but it was around the time of the purchase. A chunk of a Scholarly Kitchen post on the sale is very sceptical of how this could be managed inside Elsevier. Now any public links on researcher profiles are to the publisher versions of the articles. Note that shift: an available version of a manuscript was removed from public view and replaced with a link to the publisher website (not augmented, not “offering the better version”, replaced). Certainly Mendeley does not provide any means of publicly sharing documents any more.

Has Mendeley retained an individual voice as an advocate of Public and Open Access? Well to be fair there is at least some evidence of independence. This from William Gunn on the STM/Elsevier policies on article sharing at least raises some things that Mendeley was not happy with. But more usual are posts that align with the overall Elsevier position. This on the UK Access to Research Program for instance is a straightforward articulation of the Publishers Association position (and completely at odds with those earlier positions above). A search for “open access” gives this post on “Myths of Open Access“, a rather muted re-hash of a much older BioMedCentral post. The next several hits are posts from before the purchase, followed by the Access to Research post. There is nothing wrong with this, priorities change, but Mendeley is no longer the place to go for full throated Open Access advocacy.

What about the API? As was noted of both the Mendeley and SSRN purchases it’s the data that really matters. The real diagnostic of whether Elsevier was committed to the same vision as the founders would have been strong support for the API and Open Data. Now, the API never went away, but nor was it exactly the priority for development resources. For those who used that data the reliability went steadily downhill from around 2013 onwards. What had once been the jewell in the crown of open usage data in scholarly communications was for periods basically un-usable. The priority was working with internal Elsevier data, building the vertical integration from Scopus through to Mendeley to support future products. Outside users (and therefore competitors) were a distant second. Things have improved significantly recently, a sign perhaps of investment priorities changing, but it is also coupled with some other troubling developments that indicate a desire to maintain control over information.

Within the STM/Elsevier Article Sharing Policy is an implication of a requirement for information tracking: “Publishers and libraries should be able to measure the amount and type of sharing, using standards such as COUNTER, to better understand the habits of their readers and quantify the value of the services they provide”. Read that next to this post on the LibLicense List on growing restrictions on the re-use of COUNTER data imposed by publishers. Then ask why there is a need for a private mechanism for pushing usage data, one that is intended to support the transport of “personally identifiable data about usage”? Business models where “you are the product” benefit from scale and enclosed data. Innovative and small scale market entrants benefit from an ecosystem of open data – closed data benefits large scale and vertically integrated companies, of which there is really only one example in the Scholarly Communications space (for the future watch SpringerNature and the inevitable re-integration of DigitalScience after the IPO but they’re not there yet).

If the argument for SSRN continuing successfully within Elsevier depends on the argument as to whether Mendeley has succeeded within Elsevier, then that argument depends entirely on what you mean by “succeeded”. Mendeley has changed: opinion will divide rather sharply between those who think that change is overwhelmingly good and those that think that change is overwhelmingly bad. About now, one group of readers (if they made it this far) will be exasperated with the “wooly” and “political” nature of the comments above. They will point to Mendeley “growing up” and “becoming serious”, properly integrated into a real business. Another group will be exasperated by the failure to criticise any and all commercial involvement, and the lack of commitment to the purity of the cause. If you fall into one of those camps then it tells me more about you, than it does about either Mendeley or SSRN. Far more interesting is the question of whether each element of that change is good, or bad, and for who, and at what level? Either way, the argument that “Mendeley thrived within Elsevier, therefore SSRN will” is certainly not a slam dunk.

What does this tell us about SSRN?

So what of the canaries? What does this tell us about what we should be watching for at SSRN? Certainly the purchase makes sense in the context of establishing stronger and broader data flows that can be converted to product for Elsevier. But what would be a sign that the service is drifting away from its original purpose? It’s those same three things. The elements that I have argued were lost after Mendeley was purchased.

  1. Advocacy: SSRN always occupied a quite different space in its disciplinary fields to that of Mendeley, and has never had a strong policy/advocacy stance. Nonetheless look for shifts in policy or narrative that align with the STM Article Sharing Policy or other policy initiatives driven from within Elsevier. Particularly in the light of recent developments with the Cancer Moonshot in the US look for efforts to use SSRN to show that “these disciplines are different to science/medicine”.
  2. Data: SSRN doesn’t have an API and access to data on usage is moderately restrictive already. One way for SSRN and Elsevier to prove me wrong is to make a covenant on releasing data or even better for them to build a truly open API. In the meantime watch for changes in terms of use on the pages that provides the rankings. When it is updated with the major site refresh that is almost certainly coming check for the tell tale signs of obfuscation to make it hard to scrape. These would be the signs of a gradual effort to lock down the data.
  3. Redirection: This is the big one. SSRN is a working paper repository. That is its central function. In that way it is different to Mendeley where you could always argue that the public access element was a secondary function. Watch very closely for when (not if) links to publisher versions of articles appear. Watch how those are presented and whether there is a move towards removing versions that might (perhaps) relate closely to a publisher version. Ask the question: is the fundamental purpose of this repository changing based on the way it directs the attention of users seeking content?

SSRN is a good service, and the Elsevier investment will improve it substantially. Whether the existing community of users will continue to trust and use it is an open question. I’ve heard more disquiet from real researchers beyond the twitter echo chamber than I would have expected but any reading of history suggests that it will lose some users but retain a substantial number. Service improvements offer the opportunity for future growth, although SSRN already has substantial market penetration, something that wasn’t true of Mendeley and its subsequent growth. It’s not clear to me that integration with the wider Elsevier ecosystem is something that user base wants, but improved interfaces and information flow to other systems certainly are. The real question is the extent to which Elsevier exerts control over information and user flows, and whether it uses that to foster competition or to control it.

Note on changes made post-publication: I have modified the initial paragraph because it was pointed out that it could be read as me being dismissive of the quality of arguments being made in other posts on the SSRN sale. That wasn’t my intent. What I had intended to say was that the different sides of those arguments build on quite separate narratives and world-views and that I had seen little evidence of real understanding of what the issues are across the stakeholder and cultural divides.  The original first paragraph is below.

The response to the sale of SSRN to Elsevier is pretty much entirely predictable. As I noted in my UKSG talk different groups are simply talking past each other, completely failing to understand what matters to the others, and for the most part failing to see the damage that some of that communication is doing. But that’s not what I want to cover here. What I’m interested is a particular strand of the conversation and what it tells us about the risks, both for SSRN as a trusted repository for a range of research communities, and for the ecosystem as a whole.

Scholarly Communications: Less of a market, more like general taxation?

English: Plot of top bracket from U.S. Federal...
Top bracket from U.S. Federal Marginal Income Tax (Photo credit: Wikipedia)

 

This is necessarily speculative and I can’t claim to have bottomed out all the potential issues with this framing. It therefore stands as one of the “thinking in progress” posts I promised earlier in the year. Nonetheless it does seem like an interesting framing to pursue an understanding of scholarly communications through.

The idea of “the market” in scholarly communications has rubbed me up the wrong way for a long time. Both the moral and political superiority claimed by private and commercial players for the presumption that markets should be unregulated and protected from “public interference” and the simplistic notion (acknowledging I’ve been guilty of doing this myself) of pointing to the market as “broken”.

A market, whether meant in the sense of a system of exchange that optimising complex pricing for a (set of) good(s) or in the political sense of a protected space for free private exchange, requires the ability for free exchange across that “market”. In it’s political sense it should also be somehow “organic” – that is it develops free from direct subsidy or indirect government control. None of these are true of the scholarly communications system: there is neither free exchange amongst providers, nor purchasers, there is as we shall see effectively compulsion on purchasers and providers, and the whole system only exists because of government subsidy.

Pointing to market failure isn’t helpful if there isn’t really a market. Nor is claiming that markets support better innovation and customer service. It’s poor politics but its also poor system analysis. If we want to make things better, we need a better framing that actually explains, and ideally predicts how changes might play out. After a conversation this morning I’m wondering whether general taxation is a better framing of the scholarly communications system.

Mancur Olson, in The Logic of Collective Action lays out the problem of how large groups can provision what he calls a “collective good”, today we would use the term “public good”. While he focusses on other examples he keeps the case of general taxation in the background. His conclusion is that large groups, which he describes as “latent”, need some form of compulsion to provide public goods. Buchannan, Ostrom and others would later work on the way in which smaller groups can successfully manage public-like goods that have some (differing) elements of private good-like natures. Buchannan focussed on “local public goods”, what we now call “toll goods” or “club goods“, which are non-rivalrous but excludable, Ostrom on “common pool resources” (CPRs) which are non-excludable but rivalrous. In both cases they diagnose limits on the scale of groups that can successfully manage these resources. Neither Ostrom nor Buchannan looked in great detail about the gradation from their ideal types (pure CPRs and club goods) as good become more public-like. We therefore need an economics of “public-making”.

The scholarly community (at least as currently configured) is far too large to provide public goods without compulsion or to manage CPRs or club goods. Yet research is about the generation of public goods (or at least public-like club goods). We therefore have a provisioning problem, and according to Olson one that requires compulsion to resource. We need tax.

That publishers provide a public-like good is easy to demonstrate for subscription articles. Once a PDF is created it is an almost pure public good in the wild. It can be distributed infinitely (non-rivalrous) and is almost impossible to stop (non-excludable). Equally the provision of metadata that is freely available for re-use is also a public good provision. The benefits of the review process are available to all who can access the article so that is also a public good. But herein lies a paradox. If these are public goods then classical economics tells us they won’t be provided by private enterprise. So what is going on?

One of Olson’s key insights is to separate the issue of public vs private (which is a vexed division in scholarly communications anyway) from systems that can impose the compulsion that provide public or collective goods. He notes for instance that in the early to mid-20th century workers in many US firms voted overwhelmingly in secret ballots to impose union membership on themselves and to create closed shops. Although we don’t get a vote on it, the world of academic publishing is not so different from the closed industrial shops of 50s America. If you don’t publish in the right places, you’re not part of the club. So there is a “tax” on the supply side, although the dues are paid in the effort of authorship (and refereeing) rather than as union membership fees. That compulsion is how knowledge as a club good is made more public-like.

But what about the demand side? What about the real money? Publishers are also providing a public good for institutions. It’s the system that matters from an institutional perspective, the exposure that their researchers get, the credit for publication and therefore “membership”. One journal more or less doesn’t really matter. Again the classical economics says there should be a substantial free rider problem with each library gradually cutting back on subscriptions to bring down expenditure (this argument has nothing to do with the serials crisis, it would still follow with a flat, or even dropping over all expenditure). In practice libraries don’t really have a choice about dealing with the Big 4 publishers and not that much choice about what they pay. Fees are configured so that there are only marginal savings in cutting the big deal and going a la carte. Indeed fees are based on a bizarre set of historical factors that have no real basis in reality. Effectively there is compulsion. It is more like tax than a market.

If the Big 4 are a bit like Federal income tax in the US system (yes, the analogy breaks down a bit there but there are multiple strands of national tax in some countries, such as “National Insurance” here in the UK) then the smaller publishers are a bit similar to US state income taxes. If you want you can move to a state with low or no tax, and you can equally have an institution that does no humanities, or no chemistry and avoid paying out to SAGE or the ACS, but those are not the regular choices, and they limit your opportunities to operate. Article Processing Charges are equivalent to a sales or consumption tax. The analogy works right down to the political argument about how consumption taxes can in principle reduce income tax evasion (or redistribute the burden of the scholcomms system back towards the biggest players) because even those who can afford good accountants (or in our case subscription negotiators) still consume goods (or publish articles), but often in practice also raise access barriers for the poor, particularly those seeking to raise their income above the local tax-free threshold.

The serials crisis is then the unconstrained rise in collective provisioning. Far from being a market response to demand, it is a self-reinforcing increase in provision that can’t possibly be subject to proper market discipline, because a market can never truly operate. I don’t need to spell out the political consequences of this argument, but I find it fascinating that we can frame the system as a European general taxation system run by organisations that espouse American fiscal politics. It certainly creates a moral equivalence between “private” provision of these public goods and collective provisioning, such as through repository systems. There really isn’t a difference. It’s simply a question of which form of provision is better suited and better value.

The economic consequences are that we cannot constrain costs through market oriented actions. The only way to constrain run away collective provisioning is through collective action. But collective action is hard because we operate in an existing collective system, with its existing mechanisms of compulsion. The evolution of systems that provide public goods is slow, compared to that of private goods (I am minded to compare manuscript submission systems with government websites as an example). Change will come locally, with local changes of rules. The local governments in the system, the scholarly societies, perhaps some institutions and University Presses, are where change can start, where there are small groups, those that Buchannan and Ostrom as well as Olson identify as able to act.

Ostrom in particular notes that very large scale management of common pool resources is possible within communities (and that privatisation and nationalisation both tend to fail) as long as institutions are built that can coordinate groups of groups. Federations of societies and groups of institutions alongside collective mechanisms like Open Library of Humanities and Knowledge Unlatched offer some possible directions, as does an interesting increase in the number of endowed journals being launched. Ostrom’s prescription is to build institutions that support communities working together effectively, so they can build these large collectives.

And the big governments? Well they can be changed or their orientation shifted from “big” to “small”. And if I recall correctly issues with taxation are often at the root of that change.

Open Access Progress: Anecdotes from close to home

nanoparticles
Solution extinction measurements of (A) faceted (B) and Au@MNPs and (C) photos of the particles. From Silva et al Chem. Commun., 2016 DOI:10.1039/C6CC03225G

It has become rather fashionable in some circles to decry the complain about the lack of progress on Open Access. Particularly to decry the apparent failure of UK policies to move things forward. I’ve been guilty of frustration at various stages in the past and one thing I’ve always found useful is thinking back to where things were. So with that in mind here’s an anecdote or two that suggests not just progress but a substantial shift in the underlying practice.

I live with a chemist, a group not known for their engagement with Open Access. More than most other disciplines in my experience there is a rigid hierarchy of journals, a mechanistic view of productivity, and – particularly in those areas not awash with pharmaceutical funding – not a huge amount of money. Combine that with a tendency to think everything is – or at least should be – patentable (which tends to rule out preprints) and this is not fertile ground for OA advocacy.

Over the years we’ve had our fair share of disagreements. A less than ideal wording on the local institutional mandate meant that archiving was off the menu for a while (the agreement to deposit required all staff to deposit but also required the depositor to take personal responsibility for any copyright breaches) and a lack of funds (and an institutional decision to concentrate RCUK funds and RSC vouchers on only the journals at the top of that rigid hierarchy) meant that OA publication in the journals of choice was not feasible either. That argument about whether you choose to pay an APC or buy reagents for the student was not a hypothetical in our household.

But over the past year things have shifted. A few weeks ago: “You know, I just realised my last two papers were published Open Access”. The systems and the funds are starting to work, are starting to reach even into those corners  of resistance, yes even into chemistry. Yes it’s still the natural sciences, and yes it’s only two articles out of who knows how many (I’m not the successful scientist in the house), buts its a quite substantial shift from it being out totally out of the question.

But around about the same time something that I found even more interesting. Glimpsed over a shoulder I saw something I found odd…searching on a publisher website, which is strange enough, and also searching only for Open Access content. A query raised the response: “Yeah, these CC BY articles is great, I can use the images directly in my lectures without having to worry; I just cite the article, which after all I would have obviously done anyway”. It turns out that with lecture video capture now becoming standard universities are getting steadily more worried about copyright. The Attribution licensed content meant there was no need to worry.

Sure these are just anecdotes but they’re indicative to me of a shift in the narrative. A shift from “this is expensive and irrelevant to me” to “the system takes care of it and I’m seeing benefits”. Of course we can complain that its costing too much, that much of the system is flakey at best and absent at worst, or that the world could be so much better. We can and should point to all the things that are sub-optimal. But just as the road may stretch out some distance ahead, and there may be roadblocks and barriers in front of us, there is also a long stretch of road behind, with the barriers cleared or overcome.

As much as anything it was the sense of “that’s just how things are now” that made me feel like real progress has been made. If that is spreading, even if slowly, then the shift towards a new normal may finally be underway.

PolEcon of OA Publishing VI: Economies of Scale

Victory Press of Type used by SFPP (Photo credit: Wikipedia)
Victory Press (Photo credit: Wikipedia)

I think I committed to one of these every two weeks didn’t I? So already behind? Some of what I intended in this section already got covered in What are the assets of a journal? and the other piece Critiquing the Standard Analytics Paper so this is headed in a slightly different direction from originally planned.

There are two things you frequently hear in criticism of scholarly publishers. One is “why can’t they do X? It’s trivial. Service Y does this for free and much better!”. I covered some of the reasons that this is less true than you might think in What’s the technical problem with reforming scholarly publishing? In particular I argued that it is often the scale of systems and the complexities of their interconnection that mean that things that appear simple (and often are for a one off case) are much more complex in context.

In turn this argument, that big systems become more rigid and require greater investment in management and maintenance, raises the other oft-heard comment, that the industry is “ripe for disruption” by new nimble players. There is growing criticism that Christenson’s narrative of industrial disruption isn’t very general (see for e.g. articles in New Yorker, Sloan Business Review[paywall]), but for the purposes of this piece lets assume that it is a good model; that large industrial players tend towards a state of maintenance and incremental improvement where they are vulnerable to smaller, more innovative players who have less to lose, less investment in existing systems, and customers, and can therefore implement the radical change. This kind of narrative is definitely appealing, even empowering, to those agitating for change. It is also picked up and applied by analysts who take a more traditional approach.

So why hasn’t it happened?

The news pieces heralding the imminent demise of Elsevier started in the mid-90s. And people in the know will admit that internally there was a real fear that the business would be in big trouble. But despite this the big players remain remarkably stable. Indeed consolidation in the industry has increased that stability with a rapidly decreasing number of independent companies publishing a substantial proportion of the world’s research literature. The merger of Springer and Nature is just the latest in a long run of mergers and purchases.

A conventional market watcher will say that such mergers indicate that there are economies of scale that can be harnessed to deliver better returns. Cynics often point out that what is really happening is a recapitalisation that benefits a small number of people but doesn’t generate overall returns. Certainly in scholarly publishing there are serious questions about combining very different cultures creates benefits. It is not an accident that mergers and purchases are often followed by a mass exodus of staff. In the case of the Springer-Nature merger the departure of many senior ex-Nature staff is a clear indication of which of the two cultures is dominating the transition.

If we take Christenson seriously then a lack of disruption, and simultaneous consolidation implies that there are real economies of scale to be achieved. What are they? And are they enough to continue providing a sufficient barrier to entry that disruption won’t happen in the future?

Capital

Possibly the biggest economy of scale is the most obvious. Simply having access to resources. Elsevier’s public reports show that it has access to substantial amounts of cash (albeit as a credit facility). The 2015 RELX Annual Report [pdf] (p57) notes that “Development costs of £242m (2014: £203m) were capitalised within internally developed intangible assets, most notably investment in new products and related infrastructure in the Legal and Scientific, Technical & Medical businesses” giving a sense of the scale of investment. It is these resources that allow large corporations invest large amounts of money in various internal projects and also to buy up external developments, whether to add to the technology stack, bring in new expertise or to remove potential competitors for the market. Sufficient cash provides a lot of flexibility and opportunity for experimentation.

It’s not just the big players. PLOS is probably best characterised as a “medium sized” publisher but benefits from having access to capital built up during the period 2010-13 where it had significant surpluses. In the 2015 Annual Update PLOS reported $3.7M (8% of $46.5M total expenses) on R&D in FY 2014 and with a new platform launching imminently this probably saw a large uptick in FY 2015. eLife has not reported 2015 figures but again substantial development has gone into a recently launched platform, supported by access to capital. Smaller publishers, and particularly new market entrants don’t generally have access to the kind of capital that enables this scale of technology development. Community development is starting to make inroads into this and Open Source community projects are the most likely to challenge existing capital concentration but it is slow progress.

Broad Business Intelligence Base

Scholarly publishing is both slow and tribal. Publishers with a broad based set of journals have a substantial business intelligence advantage in understanding how publication behaviour and markets are changing. They also have privileged access to large numbers of authors, editors, and readers. They don’t always use this information well – the number of poorly constructed and misinterpreted surveys is appalling – and indeed sometimes still get tunnel vision based on the information they have, but nonetheless this is an incredible asset.

At best a disruptive market entrant will have deep insight into a specific community, and may be well placed to better serve that specific niche. The tradeoff is that they frequently struggle to scale beyond that base. This is in fact generally true of technology interventions in scholarly communications. What developers think is general often does not work outside of the proof of concept community. A variant of this is showing that something works, using the literature in Pubmed Central or Europe Pubmed Central as a demonstrator and failing to understand how the lack of infrastructure outside that narrow disciplinary space makes generalisation incredibly difficult.

Seeing a broad landscape and being able to see future trends developing is a powerful advantage of the big players. Used well it can drive the big strategic shifts (Springer’s big bet on APC funded Open Access, Elsevier’s massive shift in investment away from content and publishing into integrating Scopus/SciVal and Mendeley for information management). Used less effectively it can lead to stasis and confusion. But this information is an enormous strategic asset and one that smaller players struggle to compete with. Disruption in this context has to wait for the unforced error.

Diversified financial risks

Having the scale to develop a range of revenue sources is a huge benefit. Springer have a strong books business, Elsevier a significant revenue source in information. They are also diversified across disciplinary areas, some growing, some shrinking, and at generally have more coverage of varied geographies. New players like PeerJ, eLife, or Open Library of Humanities tend to have at best a few revenue sources, and often a restricted disciplinary focus. A lack of revenue diversity is certainly a risk factor for any growing business. On the other side most advice to start-ups to focus on developing one single revenue source, the transition from that start-up mode to diversification is often the challenge.

The other form of scale comes from having a sufficiently diverse and large journal portfolio to make up a big deal. It can be difficult for a small publisher to even get libraries to give them time when the amounts of money are relatively small. Effort is concentrated on the big negotiations. Having a stable of journals that includes “must-have” subscription titles along side volume (whether in mast heads or article numbers) that can be used to justify the annual price increases has been a solid strategy. The arguments for market failure for subscriptions are well rehearsed.

Ironically the serials crisis is leading to a new market emerging, a market in big deals themselves. With libraries increasingly asking the question of “which big deal do we cancel” the question of which of the big four is offering the worst deal becomes important. The deals being sought also differ. In North America it is usually Wiley or Springer deals being cancelled. The Elsevier big deal for subscription content generally appears to be better value for money in that context. In Europe, where the deal being sought includes provision of large scale Open Access publishing in some form it is the Elsevier big deal that more frequently looks at risk of being dropped. This kind of competition over big deals isn’t yet happening at a large scale but it does pose some risk of the economies of scale gained by a diverse journal list becoming a liability if large sets of (low volume and low prestige) titles become less attractive.

Successful Challengers

If we look at new and emerging entrants to the market that have succeeded we can see that in many cases their success lies in having a way around some of these economies of scale. Capital injection, directly through grants (PLOS, eLife, OLH) or from investors (Ubiquity, PeerJ) is common. Building on existing Open Source technology stacks (Ubiquity, Co-Action) and or applying deep technical knowledge to reduce start-up costs (PeerJ, Pensoft) is a strategy for avoiding this.

Successful startups have often created a new market in some form (PLOS ONE being the classic example) or build on deep experience of a specific segment (PeerJ, CoAction) or a new insight into how finances could work (Ubiquity, OLH). Many big players are poor at fully exploiting the business intelligence they have at their disposal. For whatever reason scholarly publishing is not a strongly data-led business in the way that term is usually used in the technology industry. Missed opportunities remains one of routes to success for the new smaller players.

Looking across the crop of existing and emerging new players revenue diversifications remains a serious weakness. And this limits the scale of any disruption they might lead. In this sense it could be argued that there are not yet any mature businesses amongst them. Ubiquity Press is probably the best example of a new publisher developing diversified revenue streams. Ubiquity’s offerings include an in-house OA publishing business with a a low enough cost-base to provide a range of funding models including APCs, endowments and community supported arrangements. It also provides services to a growing number of University Presses as well as underpinning the operations of OLH.

Real disruption of the big players will need a combination of financial stability, very low cost base, and technical systems that can truly scale. All of these need to come together at the same time as an ability to either co-opt or appropriate the existing markers of prestige. Christenson’s disruption narrative presumes that there is a parallel space or new market that can be created that is separate from existing assumptions about “quality” in the disrupted market. But when “quality” is really “prestige”, that is in a luxury goods market, this is much harder to achieve. The financial and business capacity to disrupt is not enough when quality, prestige and price are all coupled together in ways that don’t necessarily make any rational economic or business sense. That’s the piece I’ll move onto tackling next.