text-mining – Science in the Open

March 6, 2012March 6, 2012

They. Just. Don’t. Get. It…

English: Traffic Jam in Delhi FranÃ§ais : Un em... — Image via Wikipedia

…although some are perhaps starting to see the problems that are going to arise.

Last week I spoke at a Question Time style event held at Oxford University and organised by Simon Benjamin and Victoria Watson called “The Scientific Evolution: Open Science and the Future of Publishing” featuringÂ Tim Gowers (Cambridge), Victor Henning (Mendeley), Alison Mitchell (Nature Publishing Group),Â Alicia Wise (Elsevier), and Robert Winston (mainly in his role as TV talking head on science issues). You can get a feel for the proceedings from Lucy Pratt’s summary but I want to focus on one specific issue.

As is common for me recently I emphasised the fact that networked research communication needs to be different to what we are used to. I made a comparison to the fact that when the printing press was developed one of the first things that happened was that people created facsimiles of hand written manuscripts. It took hundreds of years for someone to come up with the idea of a newspaper and to some extent our current use of the network is exactly that – digital facsimiles of paper objects, not truly networked communication.

It’s difficult to predict exactly what form a real networked communication system will take, in much the same way that asking a 16th century printer how newspaper advertising would work would not provide a detailed and accurate answer, but there are some principles of successful network systems that we can see emerging. Effective network systems distribute control and avoid centralisation, they are loosely coupled, and distributed. Very different to the centralised systems for control of access and control we have today.

This is a difficult concept and one that scholarly publishers simply don’t get for the most part. This is not particularly suprising because truly disruptive innovation rarely comes from incumbent players. Large and entrenched organisations don’t generally enable the kind of thinking that is required to see the new possibilities. This is seen in publishers statements that they are providing “more access than ever before” via “more routes”, but all routes that are under tight centralised control, with control systems that don’t scale. By insisting on centralised control over access publishers are setting themselves up to fail.

Nowhere is this going to play out more starkly than in the area of text mining. Bob Campbell from Wiley-Blackwell walked into this – but few noticed it – with the now familiar claim that “text mining is not a problem because people can ask permission”. Centralised control, failure to appreciate scale, and failure to understand the necessityÂ of distribution and distributed systems. I have with me a device capable of holding the text of perhaps 100,000 papers It also has the processor power to mine that text. It is my phone. In 2-3 years our phones, hell our watches, will have the capacity to not only hold the world’s literature but also to mine it, in context for what I want right now. Is Bob Campbell ready for every researcher, indeed every interested person in the world, to come into his office and discuss an agreement for text mining? Because the mining I want to do and the mining that Peter Murray-Rust wants to do will be different, and what I will want to do tomorrow is different to what I want to do today. This kind of personalised mining is going to be the accepted norm of handling information online very soon and will be at the very centre of how we discover the information we need. Google will provide a high quality service for free, subscription based scholarly publishers will charge an arm and a leg for a deeply inferior one – because Google is built to exploit network scale.

The problem of scale has also just played out in fact. Heather Piwowar writing yesterday describes a call with six Elsevier staffers to discuss her project and needs for text mining. Heather of course now has to have this same conversation with Wiley, NPG, ACS, and all the other subscription based publishers, who will no doubt demand different conditions, creating a nightmare patchwork of different levels of access on different parts of the corpus. But the bit I want to draw out is at the bottom of the post where Heather describes the concerns of Alicia Wise:

At the end of the call, I stated that Iâ€™d like to blog the callâ€¦ it was quickly agreed that was fine.Â Alicia mentioned her only hesitation was that she might be overwhelmed by requests from others who also want text mining access. Reasonable.

Except that it isn’t. It’s perfectly reasonable for every single person who wants to text mine to want a conversation about access. Elsevier, because they demand control, have set themselves up as the bottleneck. This is really the key point, because the subscription business model implies an imperative to extract income from all possible uses of the content it sets up a need for control of access for differential uses. This means in turn that each different use, and especially each new use, has to be individually negotiated, usually by humans, apparently about six of them. This willÂ fail because it cannot scale in the same way that the demand will.

The technology exists today to make this kind of mass distributed text mining trivial. Publishers could push content to bit torrent servers and then publish regular deltas to notify users of new content. The infrastructure for this already exists. There is no infrastructure investment required. The problems that publishers raise of their servers not coping is one that they have created for themselves. The catch is that distributed systems can’t be controlled from the centre and giving up control requires a different business model. But this is also an opportunity. The publishers also save money Â if they give up control – no more need for six people to sit in on each of hundreds of thousands of meetings. I often wonder how much lower subscriptions would be if they didn’t need to cover the cost of access control, sales, and legal teams.

We are increasingly going to see these kinds of failures. Legal and technical incompatibility of resources, contractual requirements at odds with local legal systems, and above all the claim “you can just ask for permission” without the backing of the hundreds or thousands of people that would be required to provide a timely answer. And that’s before we deal with the fact that the most common answer will be “mumble”. A centralised access control system is simply not fit for purpose in a networked world. As demand scales, people making legitimate requests for access will have the effect of a distributed denial of service attack. The clue is in the name; the demand is distributed. If the access control mechanisms are manual, human and centralised, they will fail. But if that’s what it takes to get subscription publishers to wake up to the fact that the networked world is different then so be it.

December 14, 2011

An Open Letter to David Willetts: A bold step towards opening British research

Image via Wikipedia

On the 8th December David Willetts, the Minister of State for Universities and Science, and announced new UK government strategies to develop innovation and research to support growth. The whole document is available onlineÂ and you can see more analysis at the links at the bottom of the post.Â Â A key aspect for Open Access advocates was the section that discussed a wholesale move by the UK to an author pays system to freely accessible research literature with SCOAP3Â raised as a possible model. The report refers not to Open Access, but to freely accessible content. I think this is missing a massive opportunity for Britain to take a serious lead in defining the future direction of scholarly communication. That’s the case I attempt to lay out in this open letter. This post should be read in the context of my usual disclaimer.

Minister of State for Universities and Science

Department of Business Innovation and Skills

Dear Mr Willetts,

I am writing in the first instance to congratulate you on your stance on developing routes to a freely accessible research outputs. I cannot say I am a great fan of many current government positions and I might have wished for greater protection of the UK science budget but in times of resource constraint for research I believe your focus on ensuring the efficiency of access to and exploitation of research outputs in its widest sense is the right one.

The position you have articulated offers a real opportunity for the UK to take a lead in this area. But along with the opportunities there are risks, and those risks could entrench existing inefficiencies of our scholarly communication system. They could also reduce the value for money that the public purse, and it will be the public purse one way or another, gets for its investment. In our current circumstances this would be unfortunate. I would therefore ask you to consider the following as the implementation pathway for this policy is developed.

Firstly, the research community will be buying a service. This is a significant change from the current system where the community buys a product, the published journal. The purchasing exercise should be seen in this light and best practice in service procurement applied.

Secondly the nature of this service must be made clear. The service that is being provided must provide for any and all downstream uses, including commercial use, text mining, indeed any use that might developed at some point in the future. We are paying for this service and we must dictate its terms. Incumbent publishers will say in response that they need to retain commercial rights, or text mining rights, to ensure their viability, as indeed they have done in response to the Hargreaves Review.

This, not to put to fine a point on it, is hogwash. PLoS and BioMedCentral, both operate financially viable operations in which no downstream rights beyond that of appropriate attribution are retained by the publishers and where the author charges are lower in price then many of the notionally equivalent, but actually far more limited, offerings of more traditional publishers. High quality scholarly communication can be supported by reasonable author charges without any need for publishers to retain rights beyond those protected by their trademarks. An effective market place could therefore be expected to bring the average costs of this form of scholarly communications down.

The reason for supporting a system that demands that any downstream use of the communication be enabled is that we need innovation and development within the publishing systems well as innovation and development as a result of its content. Our scholarship is currently being held back by a morass of retained rights that prevent the development of research projects, of new technology startups and potentially new industries. The government consultation document of 14 December on the Hargreaves report explicitly notes that enabling downstream uses of content, and scholarly content in particular, can support new economic activity. It can also support new scholarly activity. The exploitation of our research outputs requires new approaches to indexing, mining, and parsing the literature. The shame of our current system is that much of this is possible today. The technology exists but is prevented from being exploited at scale by the logistical impossibility of clearing the required rights. These new approaches will require money and it is entirely appropriate, indeed desirable, that some of this work therefore occurs in the private sector. Experimentation will require both freedom to act as well as freedom to develop new business models. Our content and its accessibility and its reusability must support this.

Finally I ask you to look beyond the traditional scholarly publishing industry to the range of experimentation that is occurring globally in academic spaces, non-profits, and commercial endeavours. The potential leaps in functionality as well as the potential cost reductions are enormous. We need to work to encourage this experimentation and develop a diverse and vibrant market which both provides the quality assurance and stability that we are used to while encouraging technical experimentation and the improvement of business models. What we don’t need is a five or ten year deal that cements in existing players, systems, and practices.

Your government’s philosophy is based around the effectiveness of markets. The recent history of major government procurement exercises is not a glorious one. This is one we should work to get right. We should take our time to do so and ensure a deal that delivers on its promise. The vision of a Britain that is lead by innovation and development supported by a vibrant and globally leading research community is, I believe, the right one. Please ensure that this innovation isn’t cut off at the knees by agreeing terms that prevent our research communication tools being re-used to improve the effectiveness of that communication. And please ensure that the process of procuring these services is one that supports innovation and development in scholarly communications itself.

Yours truly,

Cameron Neylon

UK to make publicly funded research free to read (newscientist.com)
UK: Results of publicly funded research will be open access (junkscience.com)
Results of publicly funded research will be open access – science minister (guardian.co.uk)

July 13, 2011July 13, 2011

How to waste public money in one easy step…

Peter Murray-Rust has sparked off another round in the discussion of the value that publishers bring to the scholarly communication game and told a particular story of woe and pain inflicted by the incumbent publishers. On the day he posted that I had my own experience of just how inefficient and ineffective our communication systems are by wasting the better part of the day trying to find some information. I thought it might be fun to encourage people to post their own stories of problems and frustrations with access to the literature and the downstream issues that creates, so here is mine.

I am by no means a skilled organic chemist but I’ve done a bit of synthesis in my time and I certainly know enough to be able to read synthetic chemistry papers and decide whether a particular synthesis is accessible. So on this particular day I was interested in deciding whether it was easy or difficult to make deuterated mono-olein. This molecule can be made by connecting glycerol to oleic acid. Glycerol is cheap and I should have in my hands some deuterated oleic acid in the next month or so. The chemistry for connecting acids to alcohols is straightforward, I’ve even done it myself, but this is a slightly special case. Firstly the standard methods tend to be wasteful of the acid, which in my case is the expensive bit. The second issue is that glycerol has three alcohol groups. I only want to modify one, leaving the other two unchanged, so it is important to find a method that gives me mostly what I want and only a little of what I don’t.

So the question for me is: is there a high yielding reaction that will give me mostly what I want, while wasting as little as possible of the oleic acid? And if there is a good technique is it accessible given the equipment I have in the lab? Simple question, quick trip to Google Scholar, to find reams of likely looking papers, not one of which I had full text access to. The abstracts are nearly useless in this case because I need to know details of yields and methodology so I had several hundred papers, and no means of figuring out which might be worth an inter-library loan. I spent hours trying to parse the abstracts to figure out which were the most promising and in the end I broke…I asked someone to email me a couple of pdfs because I knew they had access. Bear in mind what I wanted to do was spend a quick 30 minutes or so to decide whether this was pursuing in detail. What is took was about three hours, which at full economic cost of my time comes to about Â£250. That’s about Â£200 of UK taxpayers money down the toilet because, on the site of the UKs premiere physical and biological research facilities I don’t have access to those papers. Yes I could have asked someone else to look but that would have taken up theirÂ time.

But you know what’s really infuriating. I shouldn’t even have been looking at the papers at all when I’m doing my initial search. What I should have been able to do was ask the question:

Show me all syntheses of mono-olein ranked first by purity of the product and secondly by the yield with respect to oleic acid.

There should be a database where I can get this information. In fact there is. But we can’t afford access to the ACS’ information services here. These are incredibly expensive because it used to be necessary for this information to be culled from papers by hand. But today that’s not necessary. It could be done cheaply and rapidly. In fact I’ve seen it done cheaply and rapidly by tools developed in Peter’s group that get around ~95% accuracy and ~80% recall over synthetic organic chemistry. Those are hit rates that would have solved my problem easily and effectively.

Unfortunately despite the fact those tools exist, despite the fact that they could be deployed easily and cheaply, and that they could save researchers vast amounts of time research is being held back by a lack of access to the literature, and where there is access by contracts that prevent us collating, aggregating, and analysing our own work. The public pays for the research to be done, the public pays for researchers to be able to read it, and in most cases the public has to pay again if they should want to read it. But what is most infuriating is the way the public pays yet again when I and a million other scientists waste our time, the public’s time, because the tools that exist and workÂ cannot be deployed.

How many researchers in the UK or world wide are losing hours or even days every week because of these inefficiencies. How many new tools or techniques are never developed because they can’t legally be deployed? And how many hundreds of millions of dollars of public money does that add up to?

Related articles