Home » Blog

They. Just. Don’t. Get. It…

6 March 2012 22 Comments
English: Traffic Jam in Delhi Français : Un em...

Image via Wikipedia

…although some are perhaps starting to see the problems that are going to arise.

Last week I spoke at a Question Time style event held at Oxford University and organised by Simon Benjamin and Victoria Watson called “The Scientific Evolution: Open Science and the Future of Publishing” featuring Tim Gowers (Cambridge), Victor Henning (Mendeley), Alison Mitchell (Nature Publishing Group), Alicia Wise (Elsevier), and Robert Winston (mainly in his role as TV talking head on science issues). You can get a feel for the proceedings from Lucy Pratt’s summary but I want to focus on one specific issue.

As is common for me recently I emphasised the fact that networked research communication needs to be different to what we are used to. I made a comparison to the fact that when the printing press was developed one of the first things that happened was that people created facsimiles of hand written manuscripts. It took hundreds of years for someone to come up with the idea of a newspaper and to some extent our current use of the network is exactly that – digital facsimiles of paper objects, not truly networked communication.

It’s difficult to predict exactly what form a real networked communication system will take, in much the same way that asking a 16th century printer how newspaper advertising would work would not provide a detailed and accurate answer, but there are some principles of successful network systems that we can see emerging. Effective network systems distribute control and avoid centralisation, they are loosely coupled, and distributed. Very different to the centralised systems for control of access and control we have today.

This is a difficult concept and one that scholarly publishers simply don’t get for the most part. This is not particularly suprising because truly disruptive innovation rarely comes from incumbent players. Large and entrenched organisations don’t generally enable the kind of thinking that is required to see the new possibilities. This is seen in publishers statements that they are providing “more access than ever before” via “more routes”, but all routes that are under tight centralised control, with control systems that don’t scale. By insisting on centralised control over access publishers are setting themselves up to fail.

Nowhere is this going to play out more starkly than in the area of text mining. Bob Campbell from Wiley-Blackwell walked into this – but few noticed it – with the now familiar claim that “text mining is not a problem because people can ask permission”. Centralised control, failure to appreciate scale, and failure to understand the necessity of distribution and distributed systems. I have with me a device capable of holding the text of perhaps 100,000 papers It also has the processor power to mine that text. It is my phone. In 2-3 years our phones, hell our watches, will have the capacity to not only hold the world’s literature but also to mine it, in context for what I want right now. Is Bob Campbell ready for every researcher, indeed every interested person in the world, to come into his office and discuss an agreement for text mining? Because the mining I want to do and the mining that Peter Murray-Rust wants to do will be different, and what I will want to do tomorrow is different to what I want to do today. This kind of personalised mining is going to be the accepted norm of handling information online very soon and will be at the very centre of how we discover the information we need. Google will provide a high quality service for free, subscription based scholarly publishers will charge an arm and a leg for a deeply inferior one – because Google is built to exploit network scale.

The problem of scale has also just played out in fact. Heather Piwowar writing yesterday describes a call with six Elsevier staffers to discuss her project and needs for text mining. Heather of course now has to have this same conversation with Wiley, NPG, ACS, and all the other subscription based publishers, who will no doubt demand different conditions, creating a nightmare patchwork of different levels of access on different parts of the corpus. But the bit I want to draw out is at the bottom of the post where Heather describes the concerns of Alicia Wise:

At the end of the call, I stated that I’d like to blog the call… it was quickly agreed that was fine. Alicia mentioned her only hesitation was that she might be overwhelmed by requests from others who also want text mining access. Reasonable.

Except that it isn’t. It’s perfectly reasonable for every single person who wants to text mine to want a conversation about access. Elsevier, because they demand control, have set themselves up as the bottleneck. This is really the key point, because the subscription business model implies an imperative to extract income from all possible uses of the content it sets up a need for control of access for differential uses. This means in turn that each different use, and especially each new use, has to be individually negotiated, usually by humans, apparently about six of them. This will fail because it cannot scale in the same way that the demand will.

The technology exists today to make this kind of mass distributed text mining trivial. Publishers could push content to bit torrent servers and then publish regular deltas to notify users of new content. The infrastructure for this already exists. There is no infrastructure investment required. The problems that publishers raise of their servers not coping is one that they have created for themselves. The catch is that distributed systems can’t be controlled from the centre and giving up control requires a different business model. But this is also an opportunity. The publishers also save money  if they give up control – no more need for six people to sit in on each of hundreds of thousands of meetings. I often wonder how much lower subscriptions would be if they didn’t need to cover the cost of access control, sales, and legal teams.

We are increasingly going to see these kinds of failures. Legal and technical incompatibility of resources, contractual requirements at odds with local legal systems, and above all the claim “you can just ask for permission” without the backing of the hundreds or thousands of people that would be required to provide a timely answer. And that’s before we deal with the fact that the most common answer will be “mumble”. A centralised access control system is simply not fit for purpose in a networked world. As demand scales, people making legitimate requests for access will have the effect of a distributed denial of service attack. The clue is in the name; the demand is distributed. If the access control mechanisms are manual, human and centralised, they will fail. But if that’s what it takes to get subscription publishers to wake up to the fact that the networked world is different then so be it.

Enhanced by Zemanta


  • Bjoern Brembs said:

    “I often wonder how much lower subscriptions would be if they didn’t
    need to cover the cost of access control, sales, and legal teams.”

    lol :-) What makes you think that lowered costs would mean lowered subscriptions and not increased profit margins? Reading all these Elsevier press releases must have gotten to you somehow :-)

  • Ross Mounce said:

    Just for the sake of providing a *real* supporting example -> decentralized journal/article distribution is ALREADY HAPPENING!

    I have 20,000+ PLoS articles on my computer right now. You can get them too – via http://www.biotorrents.net/browse.php
    When compressed it’s <16GB's of files – I can take PLoS on a USB stick with me wherever I go. It was easy to download too via my high-speed institutional connection – and didn't overload PLoS's servers because I didn't *get* the articles from their servers. With peer-2-peer file sharing the load is balanced between seeders (and in turn, I'm now seeding this torrent too, to help share the load). If all institutions or libraries agreed to seed the world's research literature, without copyright restriction on electronic redistribution (which we could do tomorrow if it weren't for the legal copyright barriers imposed by certain publishers) doing literature research would be pretty much frictionless!

    Institutions already agree to help distribute code e.g. R and it's packages http://cran.r-project.org/mirrors.html  – this is hugely beneficial, and helps share the costs associated with bandwidth used; why not for research publications?The PLoS corpus is great to try out mining ideas – it shows you how easy academic life *could* be. I've run some simple scripts on it myself. I'm not sure the simple things I did such as string matching could be classified as 'text mining' – but one thing I do know is – it was 100,000x times easier/quicker doing this locally, machine-reading files, rather than doing it paper by paper negotiating paywalls and getting cutoff by publishers. It's worth pointing out as well, that once you have all the literature you need on your computer – you don't even need the internet to do your research! For research in lesser economically developed countries, with weaker telecomms infrastructure – I'd imagine this would be a real boon for research. 

    It's a window on the world that *could* be possible if we just changed our attitude WRT to copyright and research publishing. That PLoS uses the CC-BY licence makes this all possible http://creativecommons.org/licenses/by/3.0/

    The rights to electronically redistribute, and machine-read texts are vital for 21st century research – yet currently we academics often relinquish these rights to publishers. This has got to stop. The world is networked, thus scholarly literature should move with the times and be openly networked too.

  • Barend Mons said:

    Cameron, like thuis
    However, we have ‘ million thieves approach’
    1. We all have a bookmarklet, we don’t have to wait for the IPhone here
    2. We all have a subscription to the biggies
    3. We mine and only focus of facts (essentially triples) that we do not know yet from public databases (the vast majority can be found anyway)
    4. We highlight the sentences containing ‘assertions we do not yet have in Open Access’
    5. We ‘cite’ them legally and create a nanopublication in RDF with properrovence to acknowledge the source article…done

    If 2 M scientists have that app. We will soon be done with text mining full text (for what it’s worth anyway).

  • Peter Murray-Rust said:

    Great analysis. Poor Heather. She’s been drawn into this without realising the issues. Anyone who hasn’t seen it before doesn’t see the issue of control. I’ve blogged it at:
    I’ll comment on UBC tomorrow although Cameron has done much of it already

  • Cameron Neylon said:

    Oh Heather is fully aware of the issues. I wouldn’t worry about that. She’s just pursuing a different path which I think is a good thing to do. You’re pushing out the principles, she’s doing what the publishers say she should – and is fully aware of the problems that it will create. I see this is a multipronged approach. Publishers do have their preferred route – we should demonstrate to them why it won’t work – while pointing out that the alternatives are easy and straightforward…and right.

  • Cameron Neylon said:

    Barend, agree this is a good stop gap measure but it still centralises the process. So it takes us a long way but doesn’t let us deal with the need for me to mine something of interest to me, that isn’t of interest to others, right now. So while it is an effective approach in my view it only takes us half way. That said it’s a valuable way of proving the point that the access will have real value as will the mined data.
    Bottom line, we shouldn’t need these stop gap approaches, we should just be able to get on with it and get back to the research.

  • Cameron Neylon said:

    :-) Ok, that should have been a “could” not a “would”. I think there is a point to thinking around what the costs of publication really are and how low they can in fact go.

  • Bill Hooker said:

    It’s a good job Cameron got there first because all I want to say to this is, don’t be such a patronizing git.  If you think Heather is a helpless naif then you don’t know nearly as much about schol comm as you think you do. 

  • Alicia Wise said:

    Hi Cameron,
    Where to start?  Well… I agree with you that text mining won’t scale if we (at Elsevier and every other publisher or content creator) require a conversation in advance with each and every person who wants to mine every time they want to do so for any reason.  There need to be easy, seamless ways of doing this.
    It’s actually not a new problem.  When photocopying emerged in the late 60’s and early 70’s, for example, there was consternation that authors and publishers would need to be found and asked permission each and every time a photocopy was made.  Of course that wouldn’t scale either, and so there emerged enabling solutions – like collective licensing organisations – that could scale.
    We are actively working on this enabling framework for text mining.  Heather Piwowar was extremely helpful – her discussion with my six colleagues helped us build some cross-disciplinary understanding and consensus internally on how best to continue developing our systems at Elsevier to better match her requirements and those of others who wish to text mine.
    What the digital age needs is better, automated infrastructure to handle all the messy business of rights and permissions online.  This should be seamless and invisible to users and simply work well.  To illustrate this think about making an old-fashioned phone call from that phone of yours.  You pay a subscription or as you go, and you make calls which are completed correctly and quickly each and every time.  A whole series of service providers and different telecomms companies are paid depending on who you call and where they are located – but you don’t have to know about all of that complex messiness.  It just works well.  That’s what we want for text mining (and other uses/users) too.
    There has been some interesting work done in this space, but it is not yet seamless and joined-up and functioning at scale.  For example, standards work has been done (the ONIX-PL work for example under the auspices of EDItEUR and NISO – see http://www.editeur.org/21/ONIX-PL/ and I've written a bit about this area – for example see http://copyright-debate.co.uk/?p=172).  Thoughts, constructive comments, collaboration would all be very welcome!
    With kind wishes,

  • Mike Taylor said:

    “Of course [photocopying permission inquiries] wouldn’t scale either, and so there emerged enabling
    solutions – like collective licensing organisations – that could scale.” … “What the digital age needs is better, automated infrastructure to handle all the messy business of rights and permissions online.”

    First of all, Alicia, I want to thank you for keeping on coming back into these discussions.  In the face of some of the hostility you’ve seen, it can’t be easy, and I want to place my appreciation on record.

    With that said, your answer fills me with dismay.  Faced with the current tangle of permission barriers, the publishers’ solution is to build more efficient machines for climbing over the barriers at particularly points; whereas the scientist’s goal is to get rid of the barriers and get some work done.  It seems as though the are of text-mining might be the one that most lays bear the conflicting goals of publishers and scientists.

    I know it’s a mug’s game to guess someone else’s motivations, but from the outside, publishers’ reactions to text-mining looks like fear.  Fear that someone, somewhere, might be making some money without giving the publishers a slice.  So the fear reflex kicks in and the barriers go up; and of course 99 times out of 100, what happens is not that the the work continues but publishers get paid; what happens is that the work never happens, or it happens on some other corpus of material.

    Imposing barriers to text-mining is never going to allow publishers to win.  It’s just going to make everyone else lose.

    There’s another factor that comes in here.  In the discussions surrounding the RWA, publishers were very keen to draw a distinction between research products (facts, information) and publication products (the PDF that you download from Science Direct).  Leaving aside the fact that this is rather a specious distinction in practice, since the former is nearly always incarnated in the latter, it must be clear that text-mining is interested only in the former — in the facts that scientists produced, not in the typesetting that publishers imposed.  (In fact the latter is often an active impediment).  So the publishers’ contribution is precisely what miners don’t want; and what they do want, publishers have no legitimate claim over.


  • Cameron Neylon said:

    Hi Alicia

    Just wanted to echo what Mike said, that we do value the engagement in this space. Also I realise that what I wrote probably came across as snarking at you and what you’d set up which wasn’t really my intention. I agree that what you did was the right thing and the most positive you can do given the constraints that you work under, particularly for the content which you were talking about going backwards. What really drove me to the post was comments from Bob Campbell and Graham Taylor saying “there is no access/text mining/whatever problem”. Clearly if you want to talk to people there are issues, both technical and policy, and its great that they are being discussed and I’m glad the conversation took place. But the publishers association in particular keeps making statements that the problems that I face day to day in my work don’t exist and that _really_ gets my goat.
    In particular the fact that you get so few text mining requests is often presented as evidence that their is no problem – rather than prima facie evidence that there is a serious problem. We could do cool stuff today, and it would solve real problems that I have – I would really like to be text mining all the literature all the time based on what I’m actually doing. I want that running on my desktop so when I find something interesting it just pops up with all the related stuff that would provide context. This isn’t even very hard with stuff that exists today, and yet it sounds like some sort of fanciful fairy tale. It’s something everybody wants, but nobody has. That’s got to be indicative of a really serious underlying problem.
    I’ll take some time to read the copyright debate piece but I’ll get this off now. My initial response to it remains the same though. This is a terribly complex system proposed to solve a non-existent problem if we shift the business models. If it works _perfectly_ then its fine, but any point of failure will lead to the same problems we have with DRM, everyone just cracks it and gets on with their life. If we align the business models with the web then these problems just go away. I’m not saying that’s simple but I think the argument for mixed models will fail because it won’t scale – and neither will complex negotiation systems. Even if you get the rights element to work, you then need to solve the problem of privacy, legal obligations over data and freedom of information, and those are much _much_ harder. But all that said – I’ll read it in full because I may well be blathering based on a quick skim.

  • bentoth said:

    The creation of the Copyright Licensing Agency in no way sets a good precedent for what should happen to text mining. Essentially, the CLA has obscured user rights and brought in large sums from the public sector by creating a curious organisation that indemnifies licensees against prosecution from itself.  

  • hacking the scholarly method 2: hack harder | cottage labs said:

    […] The function of writing up some research should be one of aggregating bits and pieces of other research, distilling them, adding something new, and presenting them as another piece of the scholarly landscape. It should be easy to link to and represent relevant works and supporting data, to collect feedback and suggested alterations on works, and to present works as being at a particular version – both in terms of changes and in terms of concept e.g. early days, in progress, “published”, updated. This is radically different to the paper paradigm, and is not just a digitisation attempt – it is something new […]

  • Alex Tolley said:

    Perhaps a more constructive idea for the publishers is how do they make money offering a scalable, free[?]  system to allow text mining?   If the remuneration is not directly from the enhanced access, where is it coming from?  It might be defensive – more subscriptions versus competitors, especially in a saturated market, but perhaps more creative ways could be found to create such an incentive.

    The online newspaper model experience must look decidedly unpalatable, but it is being driven in that direction from amateur sources of information about “news”.   What are the scientific publishers thinking, apart from manning the battlements?

  • Cameron Neylon said:

    Oh its very simple. Offer Service Level Agreement access to the systems with rapid updates, notifications, and other goodies that don’t really matter to the average user, at least not at that level, and charge for those. Create an entire ecosystem of businesses that feed of your information and combine it with others to create fully aggregated sets – compete with other publishers to push your content out as effectively as possible because that’s the service you are offering to authors. Plenty of opportunities to make money here – just none of them involve selling subscriptions to journals, or reducing access – they involve increasing access and bandwidth.

  • Pete Carroll said:

    Interesting quote from Martin Brassell post today (Creative Industries KTN Intellectual Property and Open Source official group)

    “The paper also states that the Government is open to proposals on alternative solutions that would overcome the problem of seeking multiple permissions for analytics and data mining. This could involve the proposed Digital Copyright Exchange, currently subject to a feasibility study, and the separate change being proposed to introduce extended collective licensing”

    The background to this is sections 7.93 and 7.94 of the IPO copyright consultation document http://www.ipo.gov.uk/consult-2011-copyright.pdf

    “7.93 However, under current conditions, in some cases research projects could requirespecific permissions from a very large number (potentially hundreds) ofpublishers in order to proceed. The current requirement for specific permissionsfrom each publisher may be an insurmountable obstacle, preventing someresearch from taking place at all.
    7.94 The Government is not aware that publishers currently offer a collective solutionthat overcomes this difficulty. Therefore the current arrangements for usinganalytic technologies may well not be the best way of serving t he overall publicgood, and the overall public benefit of a text and data mining e xception appearsto outweigh the harm to the licensing market. However, the Government will bevery interested to hear of any alternative solutions which solve this “hold-up”problem of multiple permissions”

    Sub text here appears to be. Hey publishers if you can’t get your act together on licensing we’ll judge an exception is worthwhile.The Digital Copyright Exchange is not explicitly mentioned but it would be logical. It would be quite a coup for the new DCE to run a (?global) ECL for presumably both commercial and non commercial text mining of STM material.

    Read Alicia Wise’s comment below in the light of the above:

    “What the digital age needs is better, automated infrastructure to handle all the messy business of rights and permissions online.  This should be seamless and invisible to users and simply work well” 

    and I think you can see where Reed Elsevier may be now coming from wrt to the copyright consultation. This is quite a change from their line in their submission to parliamentary BIS select committee last autumn. http://www.publications.parliament.uk/pa/cm201012/cmselect/cmbis/writev/1498/m57.htm

    which was basically, oooh the overload, oh the pirates, don’t let go of nurse for fear of finding something worse – and didn’t mention “one click” collective licensing etc at all.

    I wonder if there have been some little chats going on? Quite how the open access community should respond to a Digital Copyright Exchange solution I am not sure. Don’t forget consultation closes this Wed. 21st March

    pete carroll

  • Carl Boettiger said:

    Hi Cameron,

    Just finished watching the panel discussion.  Issues of text-mining aside, I was struck most by the comments from the Nature rep on the economics of open access publishing, also reflected in the recent Science piece (http://www.sciencemag.org/content/335/6074/1279.full).  Nature editor Alison Mitchell mentioned that an author-pays model for Nature would mean a fee of $10,000 – $30,000, and that few authors would pay such a cost.

    Actually I was fine with the first statement by itself – but the second statement seems to say, in the publishers own assessment, that the cost of their product exceeds the added value to the author.  The publishers seem to be suggesting that the their expenses are “fixed costs.”  In a market economy, if my costs for making a product exceed what customers are willing to pay for the product, I’d be out run out of business by competitors whose prices match demand. 

    Any publisher who says that an author-pays model would be too expensive to work is saying that their business model is inefficient — that their product costs more than it’s worth.  In a gold open access system, these inefficiencies would be eliminated and publishing would become both cheaper overall and publication costs would be forced to reflect added value.  Only by putting costs on readers (as well as authors) and then by bundling subscriptions together have publishers avoided market pressure to be efficient. 

    The mere existence of long-lived gold open access journals demonstrates that added value can match production costs.  The fact that these journals have different fees probably reflects their added value.  Indeed believe that publication fee should be a more accurate indicator of “added value” of a journal than “impact factor” — it works for cars. 

    Certainly Nature or Science can claim to add more value, and charge higher publication fees.  But when they say, publication costs would be so high that no one would submit, it sounds like a honest admission that they are overrated. 

    I’m no economist — did I miss something in my argument?

  • Anna Sharman said:

    I’ve blogged about one possible solution to this issue: submission fees. For journals with high rejection rates, which are the ones whose publication fees would have to be large, submission fees make a lot of sense. They spread the costs between more people and are closer to being proportional to the cost of peer review. http://sharmanedit.wordpress.com/2012/03/21/submission-fees/

  • Carl Boettiger said:

     Interesting idea.  Certainly higher rejection rates raises operating costs, while potentially increasing the “added value” of the journal.  This increased value could justify higher author fees, whether at submission time or publication time.  

    I don’t think it actually matters much *when* that fee is applied — that’s just a question of whether the risk of not being published is passed to the author or the journal, and would be expected to come out in the price.  (If the journal adds the extra fee at submission time, the author assumes the risk of it not being published, and therefore be willing to pay less than if the fee is applied as part of the publication cost).  Therefore, I think it’s actually irrelevant when the fee is applied. 

    Either way, the market will force the fees to reflect the added value of the journal (including the risk if that fee is paid upon submission, the added value of selectivity, the name recognition, the peer review, the editorial board, etc). 

    Perhaps it is more interesting to ask where the equilibrium will fall between higher rejection rates and higher added value.  Now that we can measure added value in terms of real dollars (author’s willingness to pay the publication fee) instead of impact factor, etc, we could start to learn the dollar value of selectivity — how much more are authors willing to pay to publish in a journal with high selectivity vs one that is less selective?

  • Cameron Neylon said:

    Carl, you might also like this piece I wrote for the LSE Impact Blog:


  • Cameron Neylon said:

    Yes, I’ve been saying this for a while but it doesn’t seem to dawn on people what it means – despite everything else, even Nature think that authors don’t get enough out of their product to actually pay for it…this is really why I am in favour of authors dealing with costs – it will see a reordering of the market that it will at least push as in the right general direction…
    See also: http://blogs.lse.ac.uk/impactofsocialsciences/2012/03/01/real-cost-overpaying-journals/

  • Cameron Neylon said:

    I agree, submission fees have always been the obvious route for those journals. But they’re too scared to even try it basically. Someone needs to jump first and they’re all nervously eyeing each other. I suspect eLife may well be the first to go down that route because they can, being new.