Reflections on Science 2.0 from a distance – Part I

Some months ago now I gave a talk at very exciting symposium organized by Greg Wilson as a closer for the Software Carpentry course he was running at Toronto University. It was exciting because of the lineup but also because it represented a real coming together of views on how developments in computer science and infrastructure as well as new social capabilities brought about by computer networks are changing scientific research.I talked, as I have several times recently, about the idea of a web-native laboratory record, thinking about what the paper notebook would look like if it were re-invented with today’s technology. Jon Udell gave a two tweet summary of my talk which I think captured the two key aspects of my view point perfectly. In this post I want to explore the first of these.

@cameronneylon: “The minimal publishable unit of science — the paper — is too big, too monolithic. The useful unit: a blog post.”#osci20

The key to the semantic web, linked open data, and indeed the web and the internet in general, is the ability to be able to address objects. URLs in and of themselves provide an amazing resource making it possible to identify and relate digital objects and resources. The “web of things” expands this idea to include addresses that identify physical objects. In science we aim to connect physical objects in the real world (samples, instruments) to data (digital objects) via concepts and models. All of these can be made addressable at any level of granularity we choose. But the level of detail is important. From a practical perspective too much detail means that the researcher won’t, or even can’t, record it properly. Too little detail and the objects aren’t flexible enough to allow re-wiring when we discover we’ve got something wrong.

A single sample deserves an identity. A single data file requires an identity, although it may be wrapped up within a larger object. The challenge comes when we look at process, descriptions of methodology and claims. A traditionally published paper is too big an object, something that is shown clearly by the failure of citations to papers to be clear. A paper will generally contain multiple claims, and multiple processes. A citation could  refer to any of these. At the other end I have argued that a tweet, 140 characters, is too small, because while you can make a statement it is difficult to provide context in the space available. To be a unit of science a tweet really needs to contain a statement and two references or citations, providing the relationship between two objects. It can be done but its just a bit too tight in my view.

So I proposed that the natural unit of science research is the blog post. There are many reasons for this. Firstly the length is elastic, accommodating something (nearly) as short as a tweet, to thousands of lines of data, code, or script. But equally there is a broad convention of approximate length, ranging from a few hundred to a few thousand words, about the length in fact of of a single lab notebook page, and about the length of a simple procedure. The second key aspect of a blog post is that it natively comes with a unique URL. The blog post is a first class object on the web, something that can be pointed at, scraped, and indexed. And crucially the blog post comes with a feed, and a feed that can contain rich and flexible metadata, again in agreed and accessible formats.

If we are to embrace the power of the web to transform the laboratory and scientific record then we need to think carefully about what the atomic components of that record are. Get this wrong and we make a record which is inaccessible, and which doesn’t take advantage of the advanced tooling that the consumer web now provides. Get it right and the ability to Google for scientific facts will come for free. And that would just be the beginning.

If you would like to read more about these ideas I have a paper just out in the BMC Journal Automated Experimentation.

A question of trust

I have long being sceptical of the costs and value delivered by our traditional methods of peer review. This is really on two fronts, firstly that the costs, where they have been estimated are extremely high, representing a multi-billion dollar subsidy by governments of the scholarly publishing industry. Secondly the value that is delivered through peer review, the critical analysis of claims, informed opinion on the quality of the experiments, is largely lost. At best it is wrapped up in the final version of the paper. At worst it is simply completely lost to the final end user. A part of this, which the more I think about the more I find bizarre is that the whole process is carried on under a shroud of secrecy. This means that as an end user, as I do not know who the peer reviewers are, and do not necessarily know what  process has been followed or even the basis of the editorial decision to publish. As a result I have no means of assessing the quality of peer review for any given journal, let alone any specific paper.

Those of us who see this as a problem have a responsibility to provide credible and workable alternatives to traditional peer review. So far despite many ideas we haven’t, to be honest, had very much success. Post-publication commenting, open peer review, and Digg like voting mechanisms have been explored but have yet to have any large success in scholarly publishing. PLoS is leading the charge on presenting article level metrics for all of its papers, but these remain papers that have also been through a traditional peer review process. Very little that is both radical with respect to the decision and means of publishing and successful in getting traction amongst scientists has been seen as yet.

Out on the real web it has taken non-academics to demonstrate the truly radical when it comes to publication. Whatever you may think of the accuracy of Wikipedia in your specific area, and I know it has some weaknesses in several of mine, it is the first location that most people find, and the first location that most people look for, when searching for factual information on the web. Roderic Page put up some interesting statistics when he looked this week at the top hits for over 5000 thousand mammal names in Google. Wikipedia took the top spot 48% of the time and was in the top 10 in virtually every case (97%). If you want to place factual information on the web Wikipedia should be your first port of call. Anything else is largely a waste of your time and effort. This doesn’t incidentally mean that other sources are not worthwhile or have a place, but that people need to work with the assumption that people’s first landing point will be Wikipedia.

“But”, I hear you say, “how do we know whether we can trust a given Wikipedia article, or specific statements in it?”

The traditional answer has been to say you need to look in the logs, check the discussion page, and click back the profiles of the people who made specific edits. However this in inaccessible to many people, simply because they do not know how to process the information. Very few universities have an “Effective Use of Wikipedia 101” course. Mostly because very few people would be able to teach it.

So I was very interested in an article on Mashable about marking up and colouring Wikipedia text according to its “trustworthiness”. Andrew Su kindly pointed me in the direction of the group doing the work and their papers and presentations. The system they are using, which can be added to any MediaWiki installation measures two things, how long a specific piece of text has stayed in situ, and who either edited it, or left it in place. People who write long lasting edits get higher status, and this in turn promotes the text that they have “approved” by editing around but not changing.

This to me is very exciting because it provides extra value and information for both users and editors without requiring anyone to do any more work than install a plugin. The editors and writers simply continue working as they have. The user can access an immediate view of the trustworthiness of the article with a high level of granularity, essentially at the level of single statements. And most importantly the editor gets a metric, a number that is consistently calculated across all editors, that they can put on a CV. Editors are peer reviewers, they are doing review, on a constantly evolving and dynamic article that can both change in response to the outside world and also be continuously improved. Not only does the Wikipedia process capture most of the valuable aspects of traditional peer review, it jettisons many of the problems. But without some sort of reward it was always going to be difficult to get professional scientists to be active editors. Trust metrics could provide that reward.

Now there are many questions to ask about the calculation of this “karma” metric, should it be subject biased so we know that highly ranked editors have relevant expertise, or should it be general so as to discourage highly ranked editors from modifying text that is outside of their expertise? What should the mathematics behind it be? It will take time clearly for such metrics to be respected as a scholarly contribution, but equally I can see the ground shifting very rapidly towards a situation where a lack of engagement, a lack of interest in contributing to the publicly accessible store of knowledge, is seen as a serious negative on a CV. However this particular initiative pans out it is to me this is one of the first and most natural replacements for peer review that could be effective within dynamic documents, solving most of the central problems without requiring significant additional work.

I look forward to the day when I see CVs with a Wikipedia Karma Rank on them. If you happen to be applying for a job with me in the future, consider it a worthwhile thing to include.

Where is the best place in the Research Stack for the human API?

Interesting conversation yesterday on Twitter with Evgeniy Meyke of EarthCape prompted in part by my last post. We started talking about what a Friendfeed replacement might look like and how it might integrate more directly into scientific data. Is it possible to build something general or will it always need to be domain specific. Might this in fact be an advantage? Evgeniy asked:

@CameronNeylon do you think that “something new” could be more vertically oriented rather then for “research community” in general?

His thinking being, as I understand it that to get at domain specific underlying data is always likely to take local knowledge. As he said in his next tweet:

@CameronNeylon It might be that the broader the coverage the shallower is integration with underlining research data, unless api is good

This lead me to thinking about integration layers between data and people and recalled something that I said in jest to someone some time ago;

“If you’re using a human as your API then you need to work on your user interface.”

Thinking about the way Friendfeed works there is a real sense in which the system talks to a wide range of automated APIs but at the core there is a human layer that firstly selects feeds of interest and then when presented with other feeds selects from them specific items. What Friendfeed does very well in some senses is provide a flexible API between feeds and the human brain. But Evegeniy made the point that this “works only 4 ‘discussion based’ collaboration (as in FF), not 4 e.g. collab. taxonomic research that needs specific data inegration with taxonomic databases”.

Following from this was an interesting conversation [Webcite Archived Version] about how we might best integrate the “human API” for some imaginary “Science Stream” with domain specific machine APIs that work at the data level. In a sense this is the core problem of scientific informatics. How do you optimise the ability of machines to abstract and use data and meaning while at the same time fully exploiting the ability of the human scientist to contribute their own unique skills, pattern recognition, insight, lateral thinking. And how do you keep these in step with each other so both are optimally utilised? Thinking in computational terms about the human as a layer in the system with its own APIs could be a useful way to design systems.

Friendfeed in this view is a peer to peer system for pushing curated and annotated data streams. It mediates interactions with the underlying stream but also with other known and unknown users. Friendfeed seems to get three things very right: 1) Optimising the interaction with the incoming data stream; 2) Facilitating the curation and republication of data into a new stream for consumption by others, creating a virtuous feedback look in fact; and 3) Facilitating discovery of new peers. Friendfeed is actually a bittorrent for sharing conversational objects.

This conversational layer, a research discourse layer if you like, is at the very top of the stack, keeping the humans to a high level abstracted level of conversation, where we are probably still at our best. And my guess is that something rather like Friendfeed is pretty good at being the next layer down, the API to feeds of interesting items.  But Evgeniy’s question was more about the bottom of the stack, where the data is being generated and needs to be turned into a useful and meaningful feed, ready to be consumed. The devil is always in the details and vertical integration is likely to help her. So what do these vertical segments look like?

In some domains these might be lab notebooks, in some they might be specific databases, or they might be a mixture of both and of other things. At the coal face it is likely to be difficult to find a way of describing the detail in a way that is both generic enough to be comprehensible and detailed enough to be useful. The needs of the data generator are likely to be very different to those of a generic data consumer. But if there is a curation layer, perhaps human or machine mediated, that partly abstracts this then we may be on the way to generating the generic feeds that will be finally consumed at the top layer.  This curation layer would enable semantic markup, ideally automatically, would require domain specific tooling to translate from the specific to the generic, and provide a publishing mechanism. In short it sounds (again) quite a bit like Wave. Actually it might just as easily be Chem4Word or any other domain specific semantic authoring tool, or just a translation engine that takes in detailed domain specific info and correlates it with a wider vocabulary.

One of the things that appeals to me about Wave, and Chem4Word, is that they can (or at least have the potential to) hide the complexities of the semantics within a straightforward and comprehensible authoring environment. Wave can be integrated into domain specific systems via purpose built Robots making it highly extensible. Both are capable of “speaking web” and generating feeds that can be consumed and processed in other places and by other services. At the bottom layer we can chew the problem off one piece at a tim, including human processing where it is appropriate and avoiding it where we can.

The middleware is of coures, as always, the problem. The middleware is agreed and standardised vocabularies and data formats. While in the past I have thought this near intractable actually it seems as though many of the pieces are actually falling into place. There is still a great need for standardisation and perhaps a need for more meta-standards but it seems like a lot of this is in fact on the way. I’m still not convinced that we have a useful vocabulary for actually describing experiments but enough smart people disagree with me that I’m going to shut up on that one until I’ve found the time to have a closer look at the various things out there in more detail.

These are half baked thoughts – but I think the idea of where we optimally place the human in the system is a useful question. It also hasn’t escaped my notice that I’m talking about something very similar to the architecture that Simon Coles of Amphora Research Systems always puts up in his presentations on Electronic Lab Notebooks. Fundamentally because the same core drivers are there.

The trouble with business models (Facebook buys Friendfeed)

…is that someone needs to make money out of them. It was inevitable at some point that Friendfeed would take a route that lead it towards mass adoption and away from the needs of the (rather small) community of researchers that have found a niche that works well for them. I had thought it more likely that Friendfeed would gradually move away from the aspects that researchers found attractive rather than being absorbed wholesale by a bigger player but then I don’t know much about how Silicon Valley really works. It appears that Friendfeed will continue in its current form as the two companies work out how they might integrate the functionality into Facebook but in the long term it seems unlikely that current service will survive. In a sense the sudden break may be a good thing because it forces some of the issues about providing this kind of research infrastructure out into the open in a way a gradual shift probably wouldn’t.

What is about Friendfeed that makes it particularly attractive to researchers? I think there are a couple of things, based more on hunches than hard data but in comparing with services like Twitter and Facebook there are a couple of things that standout.

  1. Conversations are about objects. At the core of the way Friendfeed works are digital objects, images, blog posts, quotes, thoughts, being pushed into a shared space. Most other services focus on the people and the connections between them. Friendfeed (at least the way I use it) is about the objects and the conversations around them.
  2. Conversation is threaded and aggregated. This is where Twitter loses out. It is almost impossible to track a specific conversation via Twitter unless you do so in real time. The threaded nature of FF makes it possible to track conversations days or months after they happen (as long as you can actually get into them)
  3. Excellent “person discovery” mechanisms. The core functionality of Friendfeed means that you discover people who “like” and comment on things that either you, or your friends like and comment on. Friendfeed remains one of the most successful services I know of at exploiting this “friend of a friend” effect in a useful way.
  4. The community. There is a specific community, with a strong information technology, information management, and bioinformatics/structural biology emphasis, that grew up and aggregated on Friendfeed. That community has immense value and it would be sad to lose it in any transition.

So what can be done? One option is to set back and wait to be absorbed into Facebook. This seems unlikely to be either feasible or popular. Many people in the FF research community don’t want this for reasons ranging from concerns about privacy, through the fundamentals of how Facebook works, to just not wanting to mix work and leisure contacts. All reasonable and all things I agree with.

We could build our own. Technically feasible but probably not financially. Lets assume a core group of say 1000 people (probably overoptimistic) each prepared to pay maybe $25 a year subscription as well as do some maintenance or coding work. That’s still only $25k, not enough to pay a single person to keep a service running let alone actually build something from scratch. Might the FF team make some of the codebase Open Source? Obviously not what they’re taking to Facebook but maybe an earlier version? Would help but there would still need to be either a higher subscription or many more subscribers to keep it running I suspect. Chalk one up for the importance of open source services though.

Reaggregating around other services and distributing the functionality would be feasible perhaps. A combination of Google Reader, Twitter, with services like Tumblr, Posterous, and StoryTlr perhaps? The community would be likely to diffuse but such a distributed approach could be more stable and less susceptible to exactly this kind of buy out. Nonetheless these are all commercial services that can easily dissappear. Google Wave has been suggested as a solution but I think has fundamental differences in design that make it at best a partial replacement. And it would still require a lot of work.

There is a huge opportunity for existing players in the Research web space to make a play here. NPG, Research Gate, and Seed, as well as other publishers or research funders and infrastructure providers (you know who you are) could fill this gap if they had the resource to build something. Friendfeed is far from perfect, the barrier to entry is quite high for most people, the different effective usage patterns are unclear for new users. Building something that really works for researchers is a big opportunity but it would still need a business model.

What is clear is that there is a signficant community of researchers now looking for somewhere to go. People with a real critical eye for the best services and functionality and people who may even be prepared to pay something towards it. And who will actively contribute to help guide design decisions and make it work. Build it right and we may just come.

Talking to the next generation – NESTA Crucible Workshop

Yesterday I was privileged to be invited to give a talk at the NESTA Crucible Workshop being held in Lancaster. You can find the slides on slideshare. NESTA, the National Endowment for Science, Technology, and the Arts,  is an interesting organization funded via a UK government endowment to support innovation and enterprise and more particularly the generation of a more innovative and entrepreneurial culture in the UK. Among the programmes it runs in pursuit of this is the Crucible program where a small group of young researchers, generally looking for or just in their first permanent or independent positions, attend a series of workshops to get them thinking broadly about the role of their research in the wider world and to help them build new networks for support and collaboration.

My job was to talk about “Science in Society” or “Open Science”. My main theme was the question of how we justify taxpayer expenditure on research; that to me this implies an obligation to maximise the efficiency of how we do our research. Research is worth doing but we need to think hard about how and what we do. Not surprisingly I focussed on the potential of using web based tools and open approaches to make things happen cheaper, quicker, and more effectively. To reduce waste and try to maximise the amount of research output for the money spent.

Also not surprisingly there was significant pushback – much of it where you would expect. Concerns over data theft, over how “non-traditional” contributions might appear (or not) on a CV, and over the costs in time were all mentioned. However what surprised me most was the pushback against the idea of putting material on the open web versus traditional journal formats. There was a real sense that the group had a respect for the authority of the printed, versus online, word which really caught me out. I often use a gotcha moment in talks to try and illustrate how our knowledge framework is changed by the web. It goes “how many people have opened a physical book for information in the last five years?”. Followed by “and how many haven’t used Google in the last 24 hours”. This is shamelessly stolen from Jamie Boyle incidentally.

Usually you get three or four sheepish hands going up admitting a personal love of real physical books. Generally it is around 5-10% of the audience, and this has been pretty consistent amongst mid-career scientists in both academia and industry, and people in publishing. In this audience about 75% put their hands up.  Some of these were specialist “tool” books, mathematical forms, algorithmic recipes, many of them were specialist texts and many referred to the use of undergraduate textbooks. Interestingly they also brought up an issue that I’ve never had an audience bring up before; that of how do you find a good route into a new subject area that you know little about, but that you can trust?

My suspicion is that this difference comes from three places, firstly that these researchers were already biased towards being less discipline bound by the fact that they’d applied for the workshop. They were therefore more likely to discipline hoppers,  jumping into new fields where they had little experience and needed a route in. Secondly, they were at a stage of their career where they were starting to teach, again possibly slightly outside their core expertise and therefore looking for good, reliable material, to base their teaching on. Finally though there was a strong sense of respect for the authority of the printed word. The printing of the German Wikipedia was brought up as evidence that printed matter was, at least perceived to be, more trustworthy. Writing this now I am reminded of the recent discussion on the hold that the PDF has over the imagination of researchers. There is a real sense that print remains authoritative in a way that online material is not. Even though the journal may never be printed the PDF provides the impression that it could or should be. I would guess also that the group were young enough also to be slightly less cynical about authority in general.

Food for thought, but it was certainly a lively discussion. We actually had to be dragged off to lunch because it went way over time (and not I hope just because I had too many slides!). Thanks to all involved in the workshop for such an interesting discussion and thanks also to the twitter people who replied to my request for 140 character messages. They made a great way of structuring the talk.

Euan Adie asks for help characterising PLoS comments

Euan Adie has asked for some help to do further analysis on the comments made on PLoS ONE articles. He is doing this via crowd sourcing through a specially written app at appspot to get people to characterize all the comments in PLoS ONE. Euan is very good at putting these kind of things together and again this shows the power of Friendfeed as a way of getting the message out. Dividing the job up into bite sized chunks so people can help even with a little bit of time, providing the right tools, and getting them in the hands of people who care enough to dedicate a little time. If anything counts as Science2.0 then this must be pretty close.

Call for submissions for a project on The Use and Relevance of Web 2.0 Tools for Researchers

The Research Information Network has put out a cal for expressions of interest in running a research project on how Web 2.0 tools are changing scientific practice. The project will be funded up to £90,000. Expressions of interest are due on Monday 3 November (yes next week) and the projects are due to start in January. You can see the call in full here but in outline RIN seeking evidence whether web 2.0 tools are:

• making data easier to share, verify and re-use, or otherwise

facilitating more open scientific practices;

• changing discovery techniques or enhancing the accessibility of

research information;

• changing researchers’ publication and dissemination behaviour,

(for example, due to the ease of publishing work-in-progress and

grey literature);

• changing practices around communicating research findings (for

example through opportunities for iterative processes of feedback,

pre-publishing, or post-publication peer review).

Now we as a community know that there are cases where all of these are occurring and have fairly extensively documented examples. The question is obviously one of the degree of penetration. Again we know this is small – I’m not exactly sure how you would quantify it.

My challenge to you is whether it would be possible to use the tools and community we already have in place to carry out the project? In the past we’ve talked a lot about aggregating project teams and distributed work but the problem has always been that people don’t have the time to spare. We would need to get some help from social scientists on process and design of the investigation but with £90,000 there is easily enough money to pay people properly for their time. Indeed I know there are some people out there freelancing already who are in many ways already working on these issues anyway. So my question is: Are people interested in pursuing this? And if so, what do you think your hourly rate is?

Convergent evolution of scientist behaviour on Web 2.0 sites?

A thought sparked off by a comment from Maxine Clarke at Nature Networks where she posted a link to a post by David Crotty. The thing that got me thinking was Maxine’ statement:

I would add that in my opinion Cameron’s points about FriendFeed apply also to Nature Network. I’ve seen lots of examples of highly specific questions being answered on NN in the way Cameron describes for FF…But NN and FF aren’t the same: they both have the same nice feature of discussion of a partiular question or “article at a URL somewhere”, but they differ in other ways,…[CN- my emphasis]

Alright, in isolation this doesn’t look like much, read through both David’s post and the comments, and then come back to Maxine’s,  but what struck me was that on many of these sites many different communities seem to be using very different functionality to do very similar things. In Maxine’s words ‘…discussion of a…paricular URL somewhere…’ And that leads me to wonder the extent to which all of these sites are failing to do what it is that we actually want them to do. And the obvious follow on question: What is it we want them to do?

There seem to be two parts to this. One, as I wrote in my response to David, is that a lot of this is about the coffee room conversation, a process of building and maintaining a social network. It happens that this network is online, which makes it tough to drop into each others office, but these conversational tools are the next best thing. In fact they can be better because they let you choose when someone can drop into your office, a choice you often don’t have in the physical world. Many services; Friendfeed, Twitter, Nature Networks, Faceboo, or a combination can do this quite well – indeed the conversation spreads across many services helping the social network (which bear in mind probably actually has less than 500 total members) to grow, form, and strengthen the connections between people.

Great. So the social bit, the bit we have in common with the general populace, is sorted. What about the science?

I think what we want as scientists is two things. Firstly we want the right URL delivered at the right time to our inbox (I am assuming anything important is a resource on the web – this may not be true now but give it 18 months and it will be) . Secondly we want a rapid and accurate assessment of this item, its validity, its relevance, and its importance to us judged by people we trust and respect. Traditionally this was managed by going to the library and reading the journals – and then going to the appropriate conference and talking to peopl. We know that the volume of material and the speed at which we need to deal with this is way too fast. Nothing new there.

My current thinking is that we are failing in building the right tools because we keep thinking of these two steps as separate when actually combining them into one integrated process would actual provide efficiency gains for both phases. I need to sleep on this to get it straight in my head, there are issues of resource discovery, timeframes, and social network maintenance that are not falling into place for me at the moment, so that will be the subject of another post.

However, whether I am right or wrong in that particular line of thought, if it is true that we are reasonably consistent in what we want then it is not suprising that we try to bend the full range of services available into achieving those goals. The interesting question is whether we can discern what the killer app would be by looking at the details of what people do to different services and where they are failing. In a sense, if there is a single killer app for science then it should be discernable what it would do based on what scientists try to do with different services…

How to make Connotea a killer app for scientists

So Ian Mulvaney asked, and as my solution did not fit into the margin I thought I would post here. Following on from the two rants of a few weeks back and many discussions at Scifoo I have been thinking about how scientists might be persuaded to make more use of social web based tools. What does it take to get enough people involved so that the network effects become apparent. I had a discussion with Jamie Heywood of Patients Like Me at Scifoo because I was interested as to why people with chronic diseases were willing to share detailed and very personal information in a forum that is essentially public. His response was that these people had an ongoing and extremely pressing need to optimise as far as is possible their treatment regime and lifestyle and that by correlating their experiences with others they got to the required answers quicker. Essentially successful management of their life required rapid access to high quality information sliced and diced in a way that made sense to them and was presented in as efficient and timely a manner as possible. Which obviously left me none the wiser as to why scientists don’t get it….

Nonetheless there are some clear themes that emerge from that conversation and others looking at uptake and use of web based tools. So here are my 5 thoughts. These are framed around the idea of reference management but the principles I think are sufficiently general to apply to most web services.

  1. Any tool must fit within my existing workflows. Once adopted I may be persuaded to modify or improve my workflow but to be adopted it has to fit to start with. For citation management this means that it must have one click filing (ideally from any place I might find an interesting paper)  but will also monitor other means of marking papers by e.g. shared items from Google reader, ‘liked’ items on Friendfeed, or scraping tags in del.icio.us.
  2. Any new tool must clearly outperform all the existing tools that it will replace in the relevant workflows without the requirement for network or social effects. Its got to be absolutely clear on first use that I am going to want to use this instead of e.g. Endnote. That means I absolutely have to be able to format and manage references in a word processor or publication document. Technically a nightmare I am sure (you’ve got to worry about integration with Word, Open Office, GoogleDocs, Tex) but an absolute necessity to get widespread uptake. And this has to be absolutely clear the first time I use the system, before I have created any local social network and before you have a large enough user base for theseto be effective.
  3. It must be near 100% reliable with near 100% uptime. Web services have a bad reputation for going down. People don’t trust their network connection and are much happier with local applications still. Don’t give them an excuse to go back to a local app because the service goes down. Addendum – make sure people can easily backup and download their stuff in a form that will be useful even if your service dissappears. Obviously they’ll never need to but it will make them feel better (and don’t scrimp on this because they will check if it works).
  4. Provide at least one (but not too many) really exciting new feature that makes people’s life better. This is related to #2 but is taking it a step further. Beyond just doing what I already do better I need a quick fix of something new and exciting. My wishlist for Connotea is below.
  5. Prepopulate. Build in publically available information before the users arrive. For a publications database this is easy and this is something that BioMedExperts got right. You have a pre-existing social network and pre-existing library information. Populate ‘ghost’ accounts with a library that includes people’s papers (doesn’t matter if its not 100% accurate) and connections based on co-authorships. This will give people an idea of what the social aspect can bring and encourage them to bring more people on board.

So that is so much motherhood and applepie. And nothing that Ian didn’t already know (unlike some other developers who I shan’t mention). But what about those cool features? Again I would take a back to basics approach. What do I actually want?

Well what I want is a service that will do three quite different things. I want it to hold a library of relevant references in a way I can search and use and I want to use this to format and reference documents when I write them. I want it to help me manage the day to day process of dealing with the flood of literature that is coming in (real time search). And I want it to help me be more effective when I am researching a new area or trying to get to grips with something (offline search). Real time search I think is a big problem that isn’t going to be solved soon. The library and document writing aspects I think are a given and need to be the first priority. The third problem is the one that I think is amenable to some new thinking.

What I would really like to see here is a way of pivoting my view of the literature around a specific item. This might be a paper, a dataset, or a blog post. I want to be able to click once and see everything that item cites, click again and see everything that cites it. Pivot away from that to look at what GoPubmed thinks the paper is about and see what it has which is related and then pivot back and see how many of those two sets are common. What are the papers in this area that this review isn’t citing? Is there a set of authors this paper isn’t citing? Have they looked at all the datasets that they should have? Are there general news media items in this area, books on Amazon, books in my nearest library, books on my bookshelf? Are they any good? Have any of my trusted friends published or bookmarked items in this area? Do they use the same tags or different ones for this subject? What exactly is Neil Saunders doing looking at that gene? Can I map all of my friends tags onto a controlled vocabulary?

Essentially I am asking for is to be able to traverse the graph of how all these things are interconnected. Most of these connections are already explicit somewhere but nowhere are they all brought together in a way that the user can slice and dice them the way they want. My belief is that if you can start to understand how people use that graph effectively to find what they want then you can start to automate the process and that that will be the route towards real time search that actually works.

…but you’ll struggle with uptake…