Home » Blog, Featured, Headline

It’s not information overload, nor is it filter failure: It’s a discovery deficit

8 July 2010 1,027 views 74 Comments
Clay Shirky
Image via Wikipedia

Clay Shirky’s famous soundbite has helped to focus on minds on the way information on the web needs to be tackled and a move towards managing the process of selecting and prioritising information. But in the research space I’m getting a sense that it is fuelling a focus on preventing publication in a way that is analogous to the conventional filtering process involved in peer reviewed publication.

Most recently this surfaced at Chronicle of Higher Education, to which there were many responses, Derek Lowe’s being one of the most thought out. But this is not isolated.

@JISC_RSC_YH: How can we provide access to online resources and maintain quality of content?  #rscrc10 [twitter via@branwenhide]

Me: @branwenhide @JISC_RSC_YH isn’t the point of the web that we can decouple the issues of access and quality from each other? [twitter]

There is a widely held assumption that putting more research onto the web makes it harder to find the research you are looking for. Publishing more makes discovery easier.

The great strength of the web is that you can allow publication of anything at very low marginal cost without limiting the ability of people to find what they are interested in, at least in principle. Discovery mechanisms are good enough, while being a long way from perfect, to make it possible to mostly find what you’re looking for while avoiding what you’re not looking for.  Search acts as a remarkable filter over the whole web through making discovery possible for large classes of problem. And high quality search algorithms depend on having a lot of data.

It is very easy to say there is too much academic literature – and I do. But the solution which seems to be becoming popular is to argue for an expansion of the traditional peer review process. To prevent stuff getting onto the web in the first place. This is misguided for two important reasons. Firstly it takes the highly inefficient and expensive process of manual curation and attempts to apply it to every piece of research output created. This doesn’t work today and won’t scale as the diversity and sheer number of research outputs increases tomorrow. Secondly it doesn’t take advantage of the nature of the web. They way to do this efficiently is to publish everything at the lowest cost possible, and then enhance the discoverability of work that you think is important. We don’t need publication filters, we need enhanced discovery engines. Publishing is cheap, curation is expensive whether it is applied to filtering or to markup and search enhancement.

Filtering before publication worked and was probably the most efficient place to apply the curation effort when the major bottleneck was publication. Value was extracted from the curation process of peer review by using it reduce the costs of layout, editing, and printing through simple printing less.  But it created new costs, and invisible opportunity costs where a key piece of information was not made available. Today the major bottleneck is discovery. Of the 500 papers a week I could read, which ones should I read, and which ones just contain a single nugget of information which is all I need? In the Research Information Network study of costs of scholarly communication the largest component of publication creation and use cycle was peer review, followed by the cost of finding the articles to read which represented some 30% of total costs. On the web, the place to put in the curation effort is in enhancing discoverability, in providing me the tools that will identify what I need to read in detail, what I just need to scrape for data, and what I need to bookmark for my methods folder.

The problem we have in scholarly publishing is an insistence on applying this print paradigm publication filtering to the web alongside an unhealthy obsession with a publication form, the paper, which is almost designed to make discovery difficult. If I want to understand the whole argument of a paper I need to read it. But if I just want one figure, one number, the details of the methodology then I don’t need to read it, but I still need to be able to find it, and to do so efficiently, and at the right time.

Currently scholarly publishers vie for the position of biggest barrier to communication. The stronger the filter the higher the notional quality. But being a pure filter play doesn’t add value because the costs of publication are now low. The value lies in presenting, enhancing, curating the material that is published. If publishers instead vied to identify, markup, and make it easy for the right people to find the right information they would be working with the natural flow of the web. Make it easy for me to find the piece of information, feature work that is particularly interesting or important, re-intepret it so I can understand it coming from a different field, preserve it so that when a technique becomes useful in 20 years the right people can find it. The brand differentiator then becomes which articles you choose to enhance, what kind of markup you do, and how well you do it.

All of these are things that publishers already do. And they are services that authors and readers will be willing to pay for. But at the moment the whole business and marketing model is built around filtering, and selling that filter. By impressing people with how much you are throwing away. Trying to stop stuff getting onto the web is futile, inefficient, and expensive. Saving people time and money by helping them find stuff on the web is an established and successful business model both at scale, and in niche areas. Providing credible and respected quality measures is a viable business model.

We don’t need more filters or better filters in scholarly communications – we don’t need to block publication at all. Ever. What we need are tools for curation and annotation and re-integration of what is published. And a framework that enables discovery of the right thing at the right time. And the data that will help us to build these. The more data, the more reseach published, the better. Which is actually what Shirky was saying all along…

Enhanced by Zemanta

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...

  • http://friendfeed.com/mrgunn Mr. Gunn

    What I read Michael Neilsen to be saying isn’t anything so much to do with semantics, RDFa, or any of those wonderful things under development, but rather integrating with the web 1.0. Links in a published research paper almost always go to somewhere else within that paper or on the publisher’s site, and rarely to another paper or another website. Even citation links go to the bibliography entry at the bottom of the paper instead of the actual paper being cited. Publishers have had almost 2 decades to get that right, so any argument saying essentially "it’s hard but we want to do it and we’re making progress" needs to be backed up a little better.

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/eabrown25 Elizabeth Brown

    I agree Mr Gunn that the publishers keep the publication links sequestered. I would explain this by history – traditionally other outlets (like WoS, etc.) added this info. Even today I don’t think they feel it’s worth their time to do it, even if there’s promises to do so. The discipline-based abstract sources added interlinked citations earlier – I recall even these vendors took several years to add them after being on the web.

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/danielmietchen Daniel Mietchen

    And now in chorus: "We don’t need publication filters, we need enhanced discovery engines."

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/danielmietchen Daniel Mietchen

    And now in chorus: "We don’t need [pre-]publication filters, we need enhanced discovery engines."

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/brembs Björn Brembs

    @Mr. Gunn: see slides 10/11 on http://www.slideshare.net/brembs/whats-wrong-with-scholarly-publishing-today-ii :-)

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/cameronneylon Cameron Neylon

    @Euan, @Jill – yes agree things are changing and the next twelve months are going to be very interesting. But just to be a little more precise in response to Euan, I didn’t mean necessarily that we needed semantic integration (although I believe that will be the ultimate route) but exposure of elements that support the state of the art in internet search – I used "rich snippets" advisedly, not because I like that approach particularly but because it implies working to enhance the ability of search engines to really dig into the content in a rich and faceted way. This is Search Engine Optimization in a true sense – working to optimize the ability of people to find what they are looking for via third party providers.

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/cameronneylon Cameron Neylon

    On the open data front I would disagree. Signing contracts isn’t just a PITA it simply doesn’t scale to web scale effectively unless they are open and presumptive contracts (actually contracts just scare me in a federate world – I’m not allowed to sign contracts relating to work stuff because I’m not competent – and every time contracts people get involved the amount of time to get things sorted is enormous. I can’t see that working at scale). But then I would say that ;-)

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/egonw Egon Willighagen

    @Cameron: Very much agreed! This PITA is a serious threat to civilization… I wonder how much money is currently lost on the legal department because is needless licensing, acquisition and defending of patents, … instead of actual service providing …

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/tillje Jim Till

    Disclosure: a link to this FF thread has been included in my blog post: "Finding influential OATP news items" (July 25, 2010), http://tillje.wordpress.com/2010/07/25/finding-influential-oatp-news-items/

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/danielmietchen Daniel Mietchen

    I’d like to see more of such backtracking to ff threads (ideally in some automated manner), but why label it "disclosure"?

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/tillje Jim Till

    Because there are no automated trackbacks, I chose to add one manually. Perhaps I should have used "Trackback" instead of "Disclosure"? (Use of "Disclosure" was probably influenced by past experience with "Internet research ethics" – do all those who have contributed comments to this thread regard FF as a "public forum"? If so, no need to use the word "Disclosure").

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/cameronneylon Cameron Neylon

    Trackback might have made some more sense to me – I did wonder. My personal view is that this is public but we’ve had discussions in the past about linking in and it does upset some people who feel this is "semi-private". I don’t think it’s an issue for this thread certainly.

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/cameronneylon Cameron Neylon

    Trackback might have made some more sense to me – I did wonder. My personal view is that this is public but we’ve had discussions in the past about linking in and it does upset some people who feel this is "semi-private". I don’t think it’s an issue for this thread certainly. In any case, thanks for the link and the thoughtful blog post!

    This comment was originally posted on FriendFeed

  • Pingback: Twenty Million Papers in PubMed: A Triumph or a Tragedy? « O'Really?

  • Pingback: Bibliotheken en het online leven in Juli 2010 « Dee'tjes