A personal view of Open Science – Part II – Tools

The second installment of the paper (first part here) where I discuss building tools for Open (or indeed any) Science.

Tools for open science – building around the needs of scientists

It is the rapid expansion and development of tools that are loosely categorised under the banner of ‘Web2.0’ or ‘Read-write web’ that makes the sharing of research material available. Many of the generic tools, particularly those that provide general document authoring capabilities, have been adopted and used by a wide range of researchers. Online office tools can enable collaborative development of papers and proposals without the need for emailing documents to multiple recipients and the resultant headaches associated with which version is which. Storing spreadsheets, databases, or data online means that collaborators have easy access to the most recent versions and can see how these are changing. More generally the use of RSS feed readers and bookmarking sites to share papers of interest and, to some extent, to distribute the task of triaging the literature are catching in in some communities. The use of microblogging platforms such as Twitter and aggregation and conversational tools such as Friendfeed have recently been used very effectively to provide coverage of conferences in progress, including collaborative note-taking. In combination with streamed or recorded video as well as screencasts and sharing of presentations online the idea of a dstributed conference, while not an everyday reality, is becoming feasible.

However it is often the case that,while useful, generic web based services do not provide desired functionality or do not fit well into the existing workflows of researchers. Here there is the opportunity, and sometime necessity, to build specialised or adapated tools. Collaborative preparation of papers is a good example of this. Conventional web bookmarking services, such as del.icio.us provide a great way of sharing the literature or resources that a paper builds on with other authors but they do not automatically capture and recognise the necessary metadata associated with published papers (journal, date, author, volume, page numbers). Specialised services such as citeulke and Connotea have been developed to enable one click bookmarking of published literature and these have been used effectively by for example using a specific tag for references associated with a specific paper in progress. The problem with these services as they exist at the moment is that they don’t provide the crucial element in the workflow that scientists want to aggregate the references for, the formatting of the references in the finalised paper. Indeed the lack of formatting functionality in GoogleDocs, the most widely used collaborative writing tool, means that in practice the finalised document is usually cut and pasted into Word and the references formatted using proprietary software such as Endnote.The available tools do not provide the required functionality.

A number of groups and organisations have investigated the use of Blogs and Wikis as collaborative and shareable laboratory notebooks. However few of these systems offer good functionality ‘out of the box’. While there are many electronic laboratory notebook systems sold by commercial interests most are actually designed around securing data rather than sharing it so are not of interesthere. While the group of Jean-Claude Bradley has used the freely hosted WikiSpaces as a laboratory notebook without further modification, much of the data and analysis is hosted on other services, including YouTube, FlickR, and GoogleDocs. The OpenWetWare group has made extensive modifications to the MediaWiki system to provide laboratory notebook functionality whereas Garret Lisi has adapted the TiddlyWiki framework as a way of presenting his notebook. The Chemtools collaboration at the University of Southampton has developed a specialised Blog platform . Commercial offerings in the area of web based lab notebooks are also starting to appear. All of these different systems have developed because of the specialised needs of recording the laboratory work of the scientists they were designed for. The different systems make different assumptions about where they fit in the workflow of the research scientist, and what that workflow looks like. They are all, however, built around the idea that they need to satisfy the needs of the user.

This creates a tension in tool building. General tools, that can be used across a range of disciplines, are extremely challenging to design, because workflows, and the perception of how they work, are different in different disciplines. Specialist tools can be built for specific fields but often struggle to translate into new areas. Because the market is small in any field the natural desire for designers is to make tools as general as possible. However in the process of trying to build for a sufficiently general workflow it is often the case that applicability to specific workflows is lost. There is a strong argument based on this for building interoperable modules, rather than complete systems, that will allow domain specialists to stich together specific solutions for specific fields or even specific experiments. Interoperability of systems and standards that enable it is a criteria that is sometimes lost in the development process, but is absolutely essential to making tools and processes shareable. The use of workflow management tools, such as Taverna, Kepler, and VisTrails have an important role to play here.

While not yet at a stage where they are widely configurable by end users the vision behind them has the potential both to make data analysis much more straightforward for experimental scientist but also to solve many of the problems involved in sharing process, as opposed to data. The idea of visually wiring up online or local analysis tools to enable data processing pipelines is compelling. The reason most experimental scientists use spreadsheets for data analysis is that they do not wish to learn programming languages. Providing visual programming tools along with services with clearly defined inputs and outputs will make it possible for a much wider range of scientists to use more sophisticated and poweful analysis tools. What is more the ability to share, version, and attribute, workflows will go some significant distance towards solving the problem of sharing process. Services like MyExperiment which provide an environment for sharing and versioning Taverna workflows provide a natural way of sharing the details of exactly how a specific analysis is carried out. Along with an electronic notebook to record each specific use of a given workflow or analysis procedure (which can be achieved automatically though an API) the full details of the raw data, analysis procedure, and any specific parameters used, can be recorded. This combination offers a potential route out of the serious problem of sharing research processes if the appropriate support infrastructure can be built up.

Also critical to successful sharing is a shared language or vocabulary. The development of ontologies, controlled vocabularies, and design standards are all important in sharing knowledge and crucial to achieving the ulitmate goals of making this knowledge machine readable. While there are divisions in the technical development and user communities over the development and use of controlled vocabularies there is little disagreement over the fact that good vocabularies combined with good tools are useful. The disagreements tend to lie in how they are best developed, when they should be applied, and whether they are superior to or complementary to other approaches such as text mining and social tagging. An integrated and mixed approach to the use of controlled vocabularies and standards is the most likely to be successful. In particular it is important to match the degree of structure in the description to the natural degree of structure in the object or objects being described. Highly structured and consistent data types, such as crystal structures and DNA sequences, can benefit greatly from highly structured descriptions which are relatively straightforward to create, and in many cases are the standard outputs of an analysis process. For large scale experimental efforts the scale of the data and sample management problem makes an investment in detailed and structured desriptions worth while. In a small laboratory doing unique work, however, there may be a strong case for using local descriptions and vocabularies that are less rigorous but easier to apply and able to grow to fit the changing situation on the ground. Ideally designed in such a way that mapping onto an external vocabulary is feasible if it is required or useful in the future.

Making all of this work requires that researchers adopt these tools and that a community develops that is big enough to provide the added value that these tools might deliver. For a broad enough community to adopt these approaches the tools must fit well in their existing workflow and help to deliver the things that researchers are already motivated to produce. For most researchers, published papers are the measure of their career success and the basis of their reward structures. Therefore tools that make it easier to write papers, or that help researchers to write better papers, are likely to get traction. As the expectations of the quality and completeness of supporting data increase for published papers, tools that make it easier for the researcher to collate and curate the record of their research will become important. It is the process of linking the record of what happened in the laboratory, or study, to the first pass intepretation and analysis of data, through further rounds of analysis until a completed version is submitted for review, that is currently poorly supported by available tools, and it is this need that will drive the development of improved tools. These tools will enable the disparate elements of the record of research, currently scattered between paper notebooks, various data files on multiple hard drives, and unconnected electronic documents, to be chained together. Once this record is primarily electronic, and probably stored online in a web based system, the choice to make the record public at any stage from the moment the record is made to the point of publication, will be available. The reason to link this to publication is to tie it into an existing workflow in the first instance. Once the idea is embedded the steps involved in making the record even more open are easily taken.

Part III covers social issues around Open Science.

How to make Connotea a killer app for scientists

So Ian Mulvaney asked, and as my solution did not fit into the margin I thought I would post here. Following on from the two rants of a few weeks back and many discussions at Scifoo I have been thinking about how scientists might be persuaded to make more use of social web based tools. What does it take to get enough people involved so that the network effects become apparent. I had a discussion with Jamie Heywood of Patients Like Me at Scifoo because I was interested as to why people with chronic diseases were willing to share detailed and very personal information in a forum that is essentially public. His response was that these people had an ongoing and extremely pressing need to optimise as far as is possible their treatment regime and lifestyle and that by correlating their experiences with others they got to the required answers quicker. Essentially successful management of their life required rapid access to high quality information sliced and diced in a way that made sense to them and was presented in as efficient and timely a manner as possible. Which obviously left me none the wiser as to why scientists don’t get it….

Nonetheless there are some clear themes that emerge from that conversation and others looking at uptake and use of web based tools. So here are my 5 thoughts. These are framed around the idea of reference management but the principles I think are sufficiently general to apply to most web services.

  1. Any tool must fit within my existing workflows. Once adopted I may be persuaded to modify or improve my workflow but to be adopted it has to fit to start with. For citation management this means that it must have one click filing (ideally from any place I might find an interesting paper)  but will also monitor other means of marking papers by e.g. shared items from Google reader, ‘liked’ items on Friendfeed, or scraping tags in del.icio.us.
  2. Any new tool must clearly outperform all the existing tools that it will replace in the relevant workflows without the requirement for network or social effects. Its got to be absolutely clear on first use that I am going to want to use this instead of e.g. Endnote. That means I absolutely have to be able to format and manage references in a word processor or publication document. Technically a nightmare I am sure (you’ve got to worry about integration with Word, Open Office, GoogleDocs, Tex) but an absolute necessity to get widespread uptake. And this has to be absolutely clear the first time I use the system, before I have created any local social network and before you have a large enough user base for theseto be effective.
  3. It must be near 100% reliable with near 100% uptime. Web services have a bad reputation for going down. People don’t trust their network connection and are much happier with local applications still. Don’t give them an excuse to go back to a local app because the service goes down. Addendum – make sure people can easily backup and download their stuff in a form that will be useful even if your service dissappears. Obviously they’ll never need to but it will make them feel better (and don’t scrimp on this because they will check if it works).
  4. Provide at least one (but not too many) really exciting new feature that makes people’s life better. This is related to #2 but is taking it a step further. Beyond just doing what I already do better I need a quick fix of something new and exciting. My wishlist for Connotea is below.
  5. Prepopulate. Build in publically available information before the users arrive. For a publications database this is easy and this is something that BioMedExperts got right. You have a pre-existing social network and pre-existing library information. Populate ‘ghost’ accounts with a library that includes people’s papers (doesn’t matter if its not 100% accurate) and connections based on co-authorships. This will give people an idea of what the social aspect can bring and encourage them to bring more people on board.

So that is so much motherhood and applepie. And nothing that Ian didn’t already know (unlike some other developers who I shan’t mention). But what about those cool features? Again I would take a back to basics approach. What do I actually want?

Well what I want is a service that will do three quite different things. I want it to hold a library of relevant references in a way I can search and use and I want to use this to format and reference documents when I write them. I want it to help me manage the day to day process of dealing with the flood of literature that is coming in (real time search). And I want it to help me be more effective when I am researching a new area or trying to get to grips with something (offline search). Real time search I think is a big problem that isn’t going to be solved soon. The library and document writing aspects I think are a given and need to be the first priority. The third problem is the one that I think is amenable to some new thinking.

What I would really like to see here is a way of pivoting my view of the literature around a specific item. This might be a paper, a dataset, or a blog post. I want to be able to click once and see everything that item cites, click again and see everything that cites it. Pivot away from that to look at what GoPubmed thinks the paper is about and see what it has which is related and then pivot back and see how many of those two sets are common. What are the papers in this area that this review isn’t citing? Is there a set of authors this paper isn’t citing? Have they looked at all the datasets that they should have? Are there general news media items in this area, books on Amazon, books in my nearest library, books on my bookshelf? Are they any good? Have any of my trusted friends published or bookmarked items in this area? Do they use the same tags or different ones for this subject? What exactly is Neil Saunders doing looking at that gene? Can I map all of my friends tags onto a controlled vocabulary?

Essentially I am asking for is to be able to traverse the graph of how all these things are interconnected. Most of these connections are already explicit somewhere but nowhere are they all brought together in a way that the user can slice and dice them the way they want. My belief is that if you can start to understand how people use that graph effectively to find what they want then you can start to automate the process and that that will be the route towards real time search that actually works.

…but you’ll struggle with uptake…