A little bit of federated Open Notebook Science

Girl Reading a Letter at an Open Window
Image via Wikipedia

Jean-Claude Bradley is the master when it comes to organising collaborations around diverse sets of online tools. The UsefulChem and Open Notebook Science Challenge projects both revolved around the use of wikis, blogs, GoogleDocs, video, ChemSpider and whatever tools are appropriate for the job at hand. This is something that has grown up over time but is at least partially formally organised. At some level the tools that get used are the ones Jean-Claude decides will be used and it is in part his uncompromising attitude to how the project works (if you want to be involved you interact on the project’s terms) that makes this work effectively.

At the other end of the spectrum is the small scale, perhaps random collaboration that springs up online, generates some data and continues (or not) towards something a little more organised. By definition such “projectlets” will be distributed across multiple services, perhaps uncoordinated, and certainly opportunistic. Just such a project has popped up over the past week or so and I wanted to document it here.

I have for some time been very interested in the potential of visualising my online lab notebook as a graph. The way I organise the notebook is such that it, at least in a sense, automatically generates linked data and for me this is an important part of its potential power as an approach. I often use a very old graph visualisation in talks I give out the notebook as a way of trying to indicate the potential which I wrote about previously, but we’ve not really taken it any further than that.

A week or so ago, Tony Hirst (@psychemedia) left a comment on a blog post which sparked a conversation about feeds and their use for generating useful information. I pointed Tony at the feeds from my lab notebook but didn’t take it any further than that. Following this he posted a series of graph visualisations of the connections between people tweeting at a set of conferences and then the penny dropped for me…sparking this conversation on twitter.

@psychemedia You asked about data to visualise. I should have thought about our lab notebook internal links! What formats are useful? [link]

@CameronNeylon if the links are easily scrapeable, it’s easy enough to plot the graph eg http://blog.ouseful.info/2010/08/30/the-structure-of-ouseful-info/ [link]

@psychemedia Wouldn’t be too hard to scrape (http://biolab.isis.rl.ac.uk/camerons_labblog) but could possibly get as rdf or xml if it helps? [link]

@CameronNeylon structured format would be helpful… [link]

At this point the only part of the whole process that isn’t publicly available takes place as I send an email to find out how to get an XML download of my blog and then report back  via Twitter.

@psychemedia Ok. XML dump at http://biolab.isis.rl.ac.uk/camerons_labblog/index.xml but I will try to hack some Python together to pull the right links out [link]

Tony suggests I pull out the date and I respond that I will try to get the relevant information into some sort of JSON format, and I’ll try to do that over the weekend. Friday afternoons being what they are and Python being what is I actually manage to do this much quicker than I expect and so I tweet that I’ve made the formatted data, raw data, and script publicly available via DropBox. Of course this is only possible because Tony tweeted the link above to his own blog describing how to pull out and format data for Gephi and it was easy for me to adapt his code to my own needs, an open source win if there ever was one.

Despite the fact that Tony took the time out to put the kettle on and have dinner and I went to a rehearsal by the time I went to bed on Friday night Tony had improved the script and made it available (with revisions) via a Gist, identified some problems with the data, and posted an initial visualisation. On Saturday morning I transfer Tony’s alterations into my own code, set up a local Git repository, push to a new Github repository, run the script over the XML dump as is (results pushed to Github). I then “fix” the raw data by manually removing the result of a SQL insertion attack – note that because I commit and push to the remote repository I get data versioning for free – this “fixing” is transparent and recorded. Then I re-run the script, pushing again to Github. I’ve just now updated the script and committed once more following further suggestions from Tony.

So over a couple of days we used Twitter for communication, DropBox, GitHub, Gists, and Flickr for sharing data and code, and the whole process was carried out publicly. I wouldn’t have even thought to ask Tony about this if he hadn’t  been publicly posting his visualisations (indeed I remember but can’t find an ironic tweet from Tony a few weeks back about it would be clearly much better to publish in a journal in 18 months time when no-one could even remember what the conference he was analysing was about…).

So another win for open approaches. Again, something small, something relatively simple, but something that came together because people were easily connected in a public space and were routinely sharing research outputs, something that by default spread into the way we conducted the project. It never occurred to me at the time, I was just reaching for the easiest tool at each stage, but at every stage every aspect of this was carried out in the open. It was just the easiest and most effective way to do it.

Enhanced by Zemanta

The triumph of document layout and the demise of Google Wave

Google Wave
Image via Wikipedia

I am frequently overly enamoured of the idea of where we might get to, forgetting that there are a lot of people still getting used to where we’ve been. I was forcibly reminded of this by Carole Goble on the weekend when I expressed a dislike of the Utopia PDF viewer that enables active figures and semantic markup of the PDFs of scientific papers. “Why can’t we just do this on the web?” I asked, and Carole pointed out the obvious, most people don’t read papers on the web. We know it’s a functionally better and simpler way to do it, but that improvement in functionality and simplicity is not immediately clear to, or in many cases even useable by, someone who is more comfortable with the printed page.

In my defence I never got to make the second part of the argument which is that with the new generation of tablet devices, lead by the iPad, there is a tremendous potential to build active, dynamic and (under the hood hidden from the user) semantically backed representations of papers that are both beautiful and functional. The technical means, and the design basis to suck people into web-based representations of research are falling into place and this is tremendously exciting.

However while the triumph of the iPad in the medium term may seem assured, my record on predicting the impact of technical innovations is not so good given the decision by Google to pull out of futher development of Wave primarily due to lack of uptake. Given that I was amongst the most bullish and positive of Wave advocates and yet I hadn’t managed to get onto the main site for perhaps a month or so, this cannot be terribly surprising but it is disappointing.

The reasons for lack of adoption have been well rehearsed in many places (see the Wikipedia page or Google News for criticisms). The interface was confusing, a lack of clarity as to what Wave is for, and simply the amount of user contribution required to build something useful. Nonetheless Wave remains for me an extremely exciting view of the possibilites. Above all it was the ability for users or communities to build dynamic functionality into documents and to make this part of the fabric of the web that was important to me. Indeed one of the most important criticisms for me was PT Sefton’s complaint that Wave didn’t leverage HTML formatting, that it was in a sense not a proper part of the document web ecosystem.

The key for me about the promise of Wave was its ability to interact with web based functionality, to be dynamic; fundamentally to treat a growing document as data and present that data in new and interesting ways. In the end this was probably just too abstruse a concept to grab hold of a user. While single demonstrations were easy to put together, building graphs, showing chemistry, marking up text, it was the bigger picture that this was generally possible that never made it through.

I think this is part of the bigger problem, similar to that we experience with trying to break people out of the PDF habit that we are conceptually stuck in a world of communicating through static documents. There is an almost obsessive need to control the layout and look of documents. This can become hilarious when you see TeX users complaining about having to use Word and Word users complaining about having to use TeX for fundamentally the same reason, that they feel a loss of control over the layout of their document. Documents that move, resize, or respond really seem to put people off. I notice this myself with badly laid out pages with dynamic sidebars that shift around, inducing a strange form of motion sickness.

There seems to be a higher aesthetic bar that needs to be reached for dynamic content, something that has been rarely achieved on the web until recently and virtually never in the presentation of scientific papers. While I philosophically disagree with Apple’s iron grip over their presentation ecosystem I have to admit that this has made it easier, if not quite yet automatic, to build beautiful, functional, and dynamic interfaces.

The rapid development of tablets that we can expect, as the rough and ready, but more flexible and open platforms do battle with the closed but elegant and safe environment provided by the iPad, offer real possibilities that we can overcome this psychological hurdle. Does this mean that we might finally see the end of the hegemony of the static document, that we can finally consign the PDF to the dustbin of temporary fixes where it belongs? I’m not sure I want to stick my neck out quite so far again, quite so soon and say that this will happen, or offer a timeline. But I hope it does, and I hope it does soon.

Enhanced by Zemanta

Writing a Wave Robot – Some thoughts on good practice for Research Robots

ChemSpidey lives! Even in the face of Karen James’ heavy irony I am still amazed that someone like me with very little programming experience was able to pull together something that actually worked effectively in a live demo. As long as you’re not actively scared of trying to put things together it is becoming relatively straightforward to build tools that do useful things. Building ChemSpidey relied heavily on existing services and other people’s code but pulling that together was a relatively straightforward process. The biggest problems were fixing the strange and in most cases undocumented behaviour of some of the pieces I used. So what is ChemSpidey?

ChemSpidey is a Wave robot that can be found at chemspidey@appspot.com. The code repository is available at Github and you should feel free to re-use it in anyway you see fit, although I wouldn’t really recommend it at the moment, it isn’t exactly the highest quality. One of the first applications I see for Wave is to make it easy to author (semi-)semantic documents which link objects within the document to records on the web. In chemistry it would be helpful to link the names of compounds through to records about those compounds on the relevant databases.

If ChemSpidey is added to a wave it watches for text of the form “chem[ChemicalName{;weight {m}g}]” where the curly bracketed parts are optional. When a blip is submitted by hitting the “done” button ChemSpidey searches through the blip looking for this text and if it finds it, strips out the name and sends it) to the ChemSpider SimpleSearch service. ChemSpider returns a list of database ids and the robot currently just pulls the top one off the list and adds the text ChemicalName (csid:####) to the wave, where the id is linked back to ChemSpider. If there is a weight present it asks the ChemSpider MassSpec API for the nominal molecular weight calculates the number of moles and inserts that. You can see video of it working here (look along the timeline for the ChemSpidey tag).

ChemSpidey

What have I learned? Well some stuff that is probably obvious to anyone who is a proper developer. Use the current version of the API. Google AppEngine pushes strings around as unicode which broke my tests because I had developed things using standard Python strings. But I think it might be useful to start drawing some more general lessons about how best to design robots for research, so to kick of the discussion here are my thoughts, many of which came out of discussions with Ian Mulvany as we prepared for last weeks demo.

  1. Always add a Welcome Blip when the Robot is added to a wave. This makes the user confident that something has happened. Lets you notify users if a new version has been released, which might change the way the robot works, and lets you provide some short instructions.It’s good to include a version number here as well.
  2. Have some help available. Ian’s Janey robot responds to the request (janey help) in a blip with an extended help blip explaining context. Blips are easily deleted later if the user wants to get rid of them and putting these in separate blips keeps them out of the main document.
  3. Where you modify text leave an annotation. I’ve only just started to play with annotations but it seems immensely useful to at least attempt to leave a trace here of what you’ve done that makes it easy for either your own Robot or others, or just human users to see who did what. I would suggest leaving annotations that identfy the robot, include any text that was parsed, and ideally provide some domain information. We need to discuss how to setup some name spaces for this.
  4. Try to isolate the “science handling” from the “wave handling”. ChemSpidey mixes up a lot of things into one Python script. Looking back at it now it makes much more sense to isolate the interaction with the wave from the routines that parse text, or do mole calculations. This means both that the different levels of code should become easier for others to re-use and also if Wave doesn’t turn out to be the one system to rule them all that we can re-use the code. I am no architecture expert and it would be good to get some clues from some good ones about how best to separate things out.

These are just some initial thoughts from a very novice Python programmer. My code satisfies essentially none of these suggestions but I will make a concerted attempt to improve on that. What I really want to do is kick the conversation on from where we are at the moment, which is basically playing around, into how we design an architecture that allows rapid development of useful and powerful functionality.

Reflecting on a Wave: The Demo at Science Online London 2009

Yesterday, along with Chris Thorpe and Ian Mulvany I was involved in what I imagine might be the first of a series of demos of Wave as it could apply to scientists and researchers more generally. You can see the backup video I made in case we had no network on Viddler. I’ve not really done a demo like that live before so it was a bit difficult to tell how it was going from the inside but although much of the tweetage was apparently underwhelmed the direct feedback afterwards was very positive and perceptive.

I think we struggled to get across an idea of what Wave is, which confused a significant proportion of the audience, particularly those who weren’t already aware of it or who didn’t have a pre-conceived idea of what it might do for them. My impression was that those in the audience who were technically oriented were excited by what they saw. If I was to do a demo again I would focus more on telling a story about writing a paper – really give people a context for what is going on. One problem with Wave is that it is easy to end up with a document littered with chat blips and I think this confused an audience more used to thinking about documents.

The other problem is perhaps that a bunch of things “just working” is underwhelming when people are used to the idea of powerful applications that they do their work in. Developers get the idea that this is all happening and working in a generic environment, not a special purpose built one, and that is exciting. Users just expect things to work or they’re not interested. Especially scientists. And it would be fair to say that the robots we demonstrated, mostly the work of a few hours or a few days, aren’t incredibly impressive on the surface. In addition, when it is working at its best the success of Wave is that it can make it look easy, if not yet polished. Because it looks easy people then assume it is so its not worth getting excited about. The point is not that it is possible to automatically mark up text, pull in data, and then process it. It is that you can do this effectively in your email inbox with unrelated tools, that are pretty easy to build, or at least adapt. But we also clearly need some flashier demos for scientists.

Ian pulled off a great coup in my view by linking up the output of one Robot to a visualization provided by another. Ian has written a robot called Janey which talks to the Journal/Author Name Estimator service. It can either suggest what journal to send a paper to based on the abstract or suggest other articles of interest. Ian had configured the Robot the night before so it could also get the co-authorship graph for a set of papers and put that into a new blip in the form of a list of edges (or co-authorships).

The clever bit was that Ian had found another Robot, written by someone entirely different that visualizes connection graphs. Ian set the blip that Janey was writing to to be one that the Graph robot was watching and the automatically pulled data was automatically visualized [see a screencast here]. Two Robots written by different people for different purposes can easily be hooked up together and just work. I’m not even sure whether Ian had had a chance to test it or not prior to the demo…but it looked easy, why wouldn’t people expect two data processing tools to work seamlessly together? I mean, it should just work.

The idea of a Wave as a data processing workflow was implicit in what I have written previously but Ian’s demo, and a later conversation with Alan Cann really sharpened that up in my mind. Alan was asking about different visual representations of a wave. The current client essentially uses the visual metaphor of an email system. One of the points for me that came from the demo is that it will probably be necessary to write specific clients that make sense for specific tasks. Alan asked about the idea of a Yahoo Pipes type of interface. This suggests a different way of thinking about Wave, instead of a set of text or media elements, it becomes a way to wire up Robots, automated connections to webservices. Essentially with a set of Robots and an appropriate visual client you could build a visual programming engine, a web service interface, or indeed a visual workflow editing environment.

The Wave client has to walk a very fine line between presenting a view of the Wave that the target user can understand and working with and the risk of constraining the users thinking about what can be done. The amazing thing about Wave as a framework is that these things are not only do-able but often very easy. The challenge is actually thinking laterally enough to even ask the question in the first place. The great thing about a public demo is that the challenges you get from the audience make you look at things in different ways.

Allyson Lister blogged the session, there was a FriendFeed discussion, and there should be video available at some point.

The trouble with business models (Facebook buys Friendfeed)

…is that someone needs to make money out of them. It was inevitable at some point that Friendfeed would take a route that lead it towards mass adoption and away from the needs of the (rather small) community of researchers that have found a niche that works well for them. I had thought it more likely that Friendfeed would gradually move away from the aspects that researchers found attractive rather than being absorbed wholesale by a bigger player but then I don’t know much about how Silicon Valley really works. It appears that Friendfeed will continue in its current form as the two companies work out how they might integrate the functionality into Facebook but in the long term it seems unlikely that current service will survive. In a sense the sudden break may be a good thing because it forces some of the issues about providing this kind of research infrastructure out into the open in a way a gradual shift probably wouldn’t.

What is about Friendfeed that makes it particularly attractive to researchers? I think there are a couple of things, based more on hunches than hard data but in comparing with services like Twitter and Facebook there are a couple of things that standout.

  1. Conversations are about objects. At the core of the way Friendfeed works are digital objects, images, blog posts, quotes, thoughts, being pushed into a shared space. Most other services focus on the people and the connections between them. Friendfeed (at least the way I use it) is about the objects and the conversations around them.
  2. Conversation is threaded and aggregated. This is where Twitter loses out. It is almost impossible to track a specific conversation via Twitter unless you do so in real time. The threaded nature of FF makes it possible to track conversations days or months after they happen (as long as you can actually get into them)
  3. Excellent “person discovery” mechanisms. The core functionality of Friendfeed means that you discover people who “like” and comment on things that either you, or your friends like and comment on. Friendfeed remains one of the most successful services I know of at exploiting this “friend of a friend” effect in a useful way.
  4. The community. There is a specific community, with a strong information technology, information management, and bioinformatics/structural biology emphasis, that grew up and aggregated on Friendfeed. That community has immense value and it would be sad to lose it in any transition.

So what can be done? One option is to set back and wait to be absorbed into Facebook. This seems unlikely to be either feasible or popular. Many people in the FF research community don’t want this for reasons ranging from concerns about privacy, through the fundamentals of how Facebook works, to just not wanting to mix work and leisure contacts. All reasonable and all things I agree with.

We could build our own. Technically feasible but probably not financially. Lets assume a core group of say 1000 people (probably overoptimistic) each prepared to pay maybe $25 a year subscription as well as do some maintenance or coding work. That’s still only $25k, not enough to pay a single person to keep a service running let alone actually build something from scratch. Might the FF team make some of the codebase Open Source? Obviously not what they’re taking to Facebook but maybe an earlier version? Would help but there would still need to be either a higher subscription or many more subscribers to keep it running I suspect. Chalk one up for the importance of open source services though.

Reaggregating around other services and distributing the functionality would be feasible perhaps. A combination of Google Reader, Twitter, with services like Tumblr, Posterous, and StoryTlr perhaps? The community would be likely to diffuse but such a distributed approach could be more stable and less susceptible to exactly this kind of buy out. Nonetheless these are all commercial services that can easily dissappear. Google Wave has been suggested as a solution but I think has fundamental differences in design that make it at best a partial replacement. And it would still require a lot of work.

There is a huge opportunity for existing players in the Research web space to make a play here. NPG, Research Gate, and Seed, as well as other publishers or research funders and infrastructure providers (you know who you are) could fill this gap if they had the resource to build something. Friendfeed is far from perfect, the barrier to entry is quite high for most people, the different effective usage patterns are unclear for new users. Building something that really works for researchers is a big opportunity but it would still need a business model.

What is clear is that there is a signficant community of researchers now looking for somewhere to go. People with a real critical eye for the best services and functionality and people who may even be prepared to pay something towards it. And who will actively contribute to help guide design decisions and make it work. Build it right and we may just come.

Watching the future…student demos at University of Toronto

On Wednesday morning I had the distinct pleasure of seeing a group of students in the Computer Science department at the University of Toronto giving demos of tools and software that they have been developing over the past few months. The demos themselves were of a consistently high standard throughout, in many ways more interesting and more real than some of the demos that I saw the previous night at the “professional” DemoCamp 21. Some, and I emphasise only some, of the demos were less slick and polished but in every case the students had a firm grasp of what they had done and why, and were ready to answer criticisms or explain design choices succinctly and credibly. The interfaces and presentation of the software was consistently not just good, but beautiful to look at, and the projects generated real running code that solved real and immediate problems. Steve Easterbrook has given a run down of all the demos on his blog but here I wanted to pick out three that really spoke to problems that I  have experienced myself.

I mentioned Brent Mombourquette‘s work on Breadcrumbs yesterday (details of the development of all of these demos is available on the student’s linked blogs). John Pipitone demonstrated this Firefox extension that tracks your browsing history and then presents it as a graph. This appealed to me immensely for a wide range of reasons: firstly that I am very interested in trying to capture, visualise, and understand the relationships between online digital objects. The graphs displayed by breadcrumbs immediately reminded me of visualisations of thought processes with branches, starting points, and the return to central nodes all being clear. In the limited time for questions the applications in improving and enabling search, recording and sharing collections of information, and even in identifying when thinking has got into a rut and needs a swift kick were all covered. The graphs can be published from the browser and the possibilities that sharing and analysing these present are still popping up with new ideas in my head several days later. In common with the rest of the demos my immediate response was, “I want to play with that now!”

The second demo that really caught my attention was a MediaWiki extension called MyeLink written by Maria Yancheva that aimed to find similar pages on a wiki. This was particularly aimed at researchers keeping a record of their work and wanting to understand how one page, perhaps describing an experiment that didn’t work, was different to a similar page, describing and experiment that did. The extension identifies similar pages in the wiki based on either structure (based primarily on headings I think) or in the text used. Maria demonstrated comparing pages as well as faceted browsing of the structure of the pages in line with the extension. The potential here for helping people manage their existing materials is huge. Perhaps more exciting, particularly in the context of yesterday’s post about writing up stories, is the potential to assist people with preparing summaries of their work. It is possible to imagine the extension first recognising that you are writing a summary based on the structure, and then recognising that in previous summaries you’ve pulled text from a different specific class of pages, all the while helping you to maintain a consitent and clear structure.

The last demo I want to mention was from Samar Sabie of a second MediaWiki extension called VizGraph. Anyone who has used a MediaWiki or a similar framework for recording research knows the problem. Generating tables, let alone graphs, sucks big time. You have your data in a CSV or Excel file and you need to transcribe, by hand, into a fairly incomprehensible, but more importantly badly fault intolerant, syntax to generate any sort of sensible visualisation. What you want, and what VizGraph supplies is a simple Wizard that allows you to upload your data file (CSV or Excel naturally) steps you through a few simple questions that are familiar from the Excel chart wizards and then drops that back into the page as a structured text data that is then rendered via the GoogleChart API. Once it is there you can, if you wish, edit the structured markup to tweak the graph.

Again, this was a great example of just solving the problem for the average user, fitting within their existing workflow and making it happen. But that wasn’t the best bit. The best bit was almost a throwaway comment as we were taken through the Wizard; “and check this box if you want to enable people to download the data directly from a link on the chart…”. I was sitting next to Jon Udell and we both spontaneously did a big thumbs up and just grinned at each other. It was a wonderful example of “just getting it”. Understanding the flow, the need to enable data to be passed from place to place, while at the same time make the user experience comfortable and seamless.

I am sceptical about the rise of a mass “Google Generation” of tech savvy and sophisticated users of web based tools and computation. But what Wednesday’s demos showed to me in no uncertain terms was that when you provide a smart group of people, who grew up with the assumption that the web functions properly, with tools and expertise to effectively manipulate and compute on the web then amazing things happen.  That these students make assumptions of how things should work, and most importantly that they should, that editing and sharing should be enabled by default, and that user experience needs to be good as a basic assumptionwas brought home by a conversation we had later in the day at the Science 2.0 symposium.

The question was  “what does Science 2.0 mean anyway?”. A question that is usually answered by reference to Web 2.0 and collaborative web based tools. Steve Easterbrooks‘s opening gambit in response was “well you know what Web 2.0 is don’t you?” an this was met with slightly glazed stares. We realized that, at least to a certain extent, for these students there is no Web 2.0. It’s just the way that the web, and indeed the rest of the world, works. Give people with these assumptions the tools to make things and amazing stuff happens. Arguably, as Jon Udell suggested later in the day, we are failing a generation by not building this into a general education. On the other hand I think it pretty clear that these students at least are going to have a big advantage in making their way in the world of the future.

Apparently screencasts for the demoed tools will be available over the next few weeks and I will try and post links here as they come up. Many thanks to Greg Wilson for inviting me to Toronto and giving me the opportunity to be at this session and the others this week.

A personal view of Open Science – Part II – Tools

The second installment of the paper (first part here) where I discuss building tools for Open (or indeed any) Science.

Tools for open science – building around the needs of scientists

It is the rapid expansion and development of tools that are loosely categorised under the banner of ‘Web2.0’ or ‘Read-write web’ that makes the sharing of research material available. Many of the generic tools, particularly those that provide general document authoring capabilities, have been adopted and used by a wide range of researchers. Online office tools can enable collaborative development of papers and proposals without the need for emailing documents to multiple recipients and the resultant headaches associated with which version is which. Storing spreadsheets, databases, or data online means that collaborators have easy access to the most recent versions and can see how these are changing. More generally the use of RSS feed readers and bookmarking sites to share papers of interest and, to some extent, to distribute the task of triaging the literature are catching in in some communities. The use of microblogging platforms such as Twitter and aggregation and conversational tools such as Friendfeed have recently been used very effectively to provide coverage of conferences in progress, including collaborative note-taking. In combination with streamed or recorded video as well as screencasts and sharing of presentations online the idea of a dstributed conference, while not an everyday reality, is becoming feasible.

However it is often the case that,while useful, generic web based services do not provide desired functionality or do not fit well into the existing workflows of researchers. Here there is the opportunity, and sometime necessity, to build specialised or adapated tools. Collaborative preparation of papers is a good example of this. Conventional web bookmarking services, such as del.icio.us provide a great way of sharing the literature or resources that a paper builds on with other authors but they do not automatically capture and recognise the necessary metadata associated with published papers (journal, date, author, volume, page numbers). Specialised services such as citeulke and Connotea have been developed to enable one click bookmarking of published literature and these have been used effectively by for example using a specific tag for references associated with a specific paper in progress. The problem with these services as they exist at the moment is that they don’t provide the crucial element in the workflow that scientists want to aggregate the references for, the formatting of the references in the finalised paper. Indeed the lack of formatting functionality in GoogleDocs, the most widely used collaborative writing tool, means that in practice the finalised document is usually cut and pasted into Word and the references formatted using proprietary software such as Endnote.The available tools do not provide the required functionality.

A number of groups and organisations have investigated the use of Blogs and Wikis as collaborative and shareable laboratory notebooks. However few of these systems offer good functionality ‘out of the box’. While there are many electronic laboratory notebook systems sold by commercial interests most are actually designed around securing data rather than sharing it so are not of interesthere. While the group of Jean-Claude Bradley has used the freely hosted WikiSpaces as a laboratory notebook without further modification, much of the data and analysis is hosted on other services, including YouTube, FlickR, and GoogleDocs. The OpenWetWare group has made extensive modifications to the MediaWiki system to provide laboratory notebook functionality whereas Garret Lisi has adapted the TiddlyWiki framework as a way of presenting his notebook. The Chemtools collaboration at the University of Southampton has developed a specialised Blog platform . Commercial offerings in the area of web based lab notebooks are also starting to appear. All of these different systems have developed because of the specialised needs of recording the laboratory work of the scientists they were designed for. The different systems make different assumptions about where they fit in the workflow of the research scientist, and what that workflow looks like. They are all, however, built around the idea that they need to satisfy the needs of the user.

This creates a tension in tool building. General tools, that can be used across a range of disciplines, are extremely challenging to design, because workflows, and the perception of how they work, are different in different disciplines. Specialist tools can be built for specific fields but often struggle to translate into new areas. Because the market is small in any field the natural desire for designers is to make tools as general as possible. However in the process of trying to build for a sufficiently general workflow it is often the case that applicability to specific workflows is lost. There is a strong argument based on this for building interoperable modules, rather than complete systems, that will allow domain specialists to stich together specific solutions for specific fields or even specific experiments. Interoperability of systems and standards that enable it is a criteria that is sometimes lost in the development process, but is absolutely essential to making tools and processes shareable. The use of workflow management tools, such as Taverna, Kepler, and VisTrails have an important role to play here.

While not yet at a stage where they are widely configurable by end users the vision behind them has the potential both to make data analysis much more straightforward for experimental scientist but also to solve many of the problems involved in sharing process, as opposed to data. The idea of visually wiring up online or local analysis tools to enable data processing pipelines is compelling. The reason most experimental scientists use spreadsheets for data analysis is that they do not wish to learn programming languages. Providing visual programming tools along with services with clearly defined inputs and outputs will make it possible for a much wider range of scientists to use more sophisticated and poweful analysis tools. What is more the ability to share, version, and attribute, workflows will go some significant distance towards solving the problem of sharing process. Services like MyExperiment which provide an environment for sharing and versioning Taverna workflows provide a natural way of sharing the details of exactly how a specific analysis is carried out. Along with an electronic notebook to record each specific use of a given workflow or analysis procedure (which can be achieved automatically though an API) the full details of the raw data, analysis procedure, and any specific parameters used, can be recorded. This combination offers a potential route out of the serious problem of sharing research processes if the appropriate support infrastructure can be built up.

Also critical to successful sharing is a shared language or vocabulary. The development of ontologies, controlled vocabularies, and design standards are all important in sharing knowledge and crucial to achieving the ulitmate goals of making this knowledge machine readable. While there are divisions in the technical development and user communities over the development and use of controlled vocabularies there is little disagreement over the fact that good vocabularies combined with good tools are useful. The disagreements tend to lie in how they are best developed, when they should be applied, and whether they are superior to or complementary to other approaches such as text mining and social tagging. An integrated and mixed approach to the use of controlled vocabularies and standards is the most likely to be successful. In particular it is important to match the degree of structure in the description to the natural degree of structure in the object or objects being described. Highly structured and consistent data types, such as crystal structures and DNA sequences, can benefit greatly from highly structured descriptions which are relatively straightforward to create, and in many cases are the standard outputs of an analysis process. For large scale experimental efforts the scale of the data and sample management problem makes an investment in detailed and structured desriptions worth while. In a small laboratory doing unique work, however, there may be a strong case for using local descriptions and vocabularies that are less rigorous but easier to apply and able to grow to fit the changing situation on the ground. Ideally designed in such a way that mapping onto an external vocabulary is feasible if it is required or useful in the future.

Making all of this work requires that researchers adopt these tools and that a community develops that is big enough to provide the added value that these tools might deliver. For a broad enough community to adopt these approaches the tools must fit well in their existing workflow and help to deliver the things that researchers are already motivated to produce. For most researchers, published papers are the measure of their career success and the basis of their reward structures. Therefore tools that make it easier to write papers, or that help researchers to write better papers, are likely to get traction. As the expectations of the quality and completeness of supporting data increase for published papers, tools that make it easier for the researcher to collate and curate the record of their research will become important. It is the process of linking the record of what happened in the laboratory, or study, to the first pass intepretation and analysis of data, through further rounds of analysis until a completed version is submitted for review, that is currently poorly supported by available tools, and it is this need that will drive the development of improved tools. These tools will enable the disparate elements of the record of research, currently scattered between paper notebooks, various data files on multiple hard drives, and unconnected electronic documents, to be chained together. Once this record is primarily electronic, and probably stored online in a web based system, the choice to make the record public at any stage from the moment the record is made to the point of publication, will be available. The reason to link this to publication is to tie it into an existing workflow in the first instance. Once the idea is embedded the steps involved in making the record even more open are easily taken.

Part III covers social issues around Open Science.

An open letter to the developers of Social Network and ‘Web 2.0’ tools for scientists

My aim is to email this to all the email addresses that I can find on the relevant sites over the next week or so, but feel free to diffuse more widely if you feel it is appropriate.

Dear Developer(s)

I am writing to ask your support in undertaking a critical analysis of the growing number of tools being developed that broadly fall into the category of social networking or collaborative tools for scientists. There has been a rapid proliferation of such tools and significant investment in time and effort for their development. My concern, which I wrote about in a recent blog post (here), is that the proliferation of these tools may lead to a situation where, because of a splitting up of the potential user community, none of these tools succeed.

One route forward is to simply wait for the inevitable consolidation phase where some projects move forward and others fail. I feel that this would be missing an opportunity to critically analyse the strengths and weaknesses of these various tools, and to identify the desirable characteristics of a next generation product. To this end I propose to write a critical analysis of the various tools, looking at architecture, stability, usability, long term funding, and features. I have proposed some criteria and received some comments and criticisms of these. I would appreciate your views on what the appropriate criteria are and would welcome your involvement in the process of writing this analysis. This is not meant as an attack on any given service or tool, but as a way of getting the best out of the development work that has already taken place, and taking the opportunity to reflect on what has worked and what has not in a collaborative and supportive fashion.

I will also be up front and say that I have an agenda on this. I would like to see a portable and agreed data model that would enable people to utilise the best features of all these services without having to rebuild their network within each site. This approach is very much part of the data portability agenda and would probably have profound implications for the design architecture of your site. My feeling, however, is that this would be the most productive architectural approach. It does not mean that I am right of course and I am prepared to be convinced otherwise if the arguments are strong.

I hope you will feel free to take part in this exercise and contribute. I do believe that if we take a collaborative approach then it will be possible to identify the features and range of services that the community needs and wants. Please comment at the blog post or request access to the GoogleDoc where we propose to write up this analysis.

Yours sincerely,

Cameron Neylon

Policy for Open Science – reflections on the workshop

Written on the train on the way from Barcelona to Grenoble. This life really is a lot less exotic than it sounds… 

The workshop that I’ve reported on over the past few days was both positive and inspiring. There is a real sense that the ideas of Open Access and Open Data are becoming mainstream. As several speakers commented, within 12-18 months it will be very unusual for any leading institution not to have a policy on Open Access to its published literature. In many ways as far as Open Access to the published literature is concerned the war has been won. There will remains battles to be fought over green and gold routes – the role of licenses and the need to be able to text mine – successful business models remain to be made demonstrably sustainable – and there will be pain as the inevitable restructuring of the publishing industry continues. But be under no illusions that this restructuring has already begun and it will continue in the direction of more openness as long as the poster children of the movement like PLoS and BMC continue to be successful.

Open Data remains further behind, both with respect to policy and awareness. Many people spoke over the two days about Open Access and then added, almost as an addendum ‘Oh and we need to think about data as well’. I believe the policies will start to grow and examples such as the BBSRC Data Sharing Policy give a view of the future. But there is still much advocacy work to be done here. John Wilbanks talked about the need to set achievable goals, lines in the sand which no-one can argue with. And the easiest of these is one that we have discussed many times. All data associated with a published paper, all analysis, and all processing procedures, should be made available. This is very difficult to argue with – nonetheless we know of examples where the raw data of important experiments is being thrown away. But if an experiment cannot be shown to have been done, cannot be replicated and checked, can it really be publishable? Nonetheless this is a very useful marker, and a meme that we can spread and promote.

In the final session there was a more critical analysis of the situation. A number of serious questions were raised but I think they divide into two categories. The first involves the rise of the ‘Digital Natives’ or the ‘Google Generation’. The characteristics of this new generation (a gross simplification in its own right) are often presented as a pure good. Better networked, more sharing, better equipped to think in the digital network. But there are some characteristics that ought to give pause. A casualness about attribution, a sense that if something is available then it is fine to just take it (its not stealing after all, just copying). There is perhaps a need to recover the roots of ‘Mertonian’ science, to as I think James Boyle put it, publicise and embed the attitudes of the last generation of scientists, for whom science was a public good and a discipline bounded by strict rules of behaviour. Some might see this as harking back to an elitist past but if we are constructing a narrative about what we want science to be then we can take the best parts of all of our history and use it to define and refine our vision. There is certainly a place for a return to the compulsory study of science history and philosophy.

The second major category of issues discussed in the last session revolved around the question of what do we actually do now. There is a need to move on many fronts, to gather evidence of success, to investigate how different open practices work – and to ask ourselves the hard questions. Which ones do work, and indeed which ones do not. Much of the meeting revolved around policy with many people in favour of, or at least not against, mandates of one sort or another. Mike Carroll objected to the term mandate – talking instead about contractual conditions. I would go further and say that until these mandates are demonstrated to be working in practice they are aspirations. When they are working in practice they will be norms, embedded in the practice of good science. The carrot may be more powerful than the stick but peer pressure is vastly more powerful than both.

So they key questions to me revolve around how we can convert aspirations into community norms. What is needed in terms of infrastructure, in terms of incentives, and in terms of funding to make this stuff happen? One thing is to focus on the infrastructure and take a very serious and critical look at what is required. It can be argued that much of the storage infrastructure is in place. I have written on my concerns about institutional repositories but the bottom line remains that we probably have a reasonable amount of disk space available. The network infrastructure is pretty good so these are two things we don’t need to worry about. What we do need to worry about, and what wasn’t really discussed very much in the meeting, is the tools that will make it easy and natural to deposit data and papers.

The incentive structure remains broken – this is not a new thing – but if sufficiently high profile people start to say this should change, and act on those beliefs, and they are, then things will start to shift. It will be slow but bit by bit we can imagine getting there. Can we take shortcuts. Well there are some options. I’ve raised in the past the idea of a prize for Open Science (or in fact two, one for an early career researcher and one for an established one). Imagine if we could make this a million dollar prize, or at least enough for someone to take a year off. High profile, significant money, and visible success for someone each year. Even without money this is still something that will help people – give them something to point to as recognition of their contribution. But money would get people’s attention.

I am sceptical about the value of ‘microcredit’ systems where a person’s diverse and perhaps diffuse contributions are aggregated together to come up with some sort of ‘contribution’ value, a number by which job candidates can be compared. Philosophically I think it’s a great idea, but in practice I can see this turning into multiple different calculations, each of which can be gamed. We already have citation counts, H-factors, publication number, integrated impact factor as ways of measuring and comparing one type of output. What will happen when there are ten or 50 different types of output being aggregated? Especially as no-one will agree on how to weight them. What I do believe is that those of us who mentor staff, or who make hiring decisions should encourage people to describe these contributions, to include them in their CVs. If we value them, then they will value them. We don’t need to compare the number of my blog posts to someone else’s – but we can ask which is the most influential – we can compare, if subjectively, the importance of a set of papers to a set of blog posts. But the bottom line is that we should actively value these contributions – let’s start asking the questions ‘Why don’t you write online? Why don’t you make your data available? Where are your protocols described? Where is your software, your workflows?’

Funding is key, and for me one of the main messages to come from the meeting was the need to think in terms of infrastructure, and in particular, to distinguish what is infrastructure and what is science or project driven. In one discussion over coffee I discussed the problem of how to fund development projects where the two are deeply intertwined and how this raises challenges for funders. We need new funding models to make this work. It was suggested in the final panel that as these tools become embedded in projects there will be less need to worry about them in infrastructure funding lines. I disagree. Coming from an infrastructure support organisation I think there is a desperate need for critical strategic oversight of the infrastructure that will support all science – both physical facilities, network and storage infrastructure, tools, and data. This could be done effectively using a federated model and need not be centralised but I think there is a need to support the assumption that the infrastructure is available and this should not be done on a project by project basis. We build central facilities for a reason – maybe the support and development of software tools doesn’t fit this model but I think it is worth considering.

This ‘infrastructure thinking’ goes wider than disk space and networks, wider than tools, and wider than the data itself. The concept of ‘law as infrastructure’ was briefly discussed. There was also a presentation looking at different legal models of a ‘commons’; the public domain, a contractually reconstructed commons, escrow systems etc. In retrospect I think there should have been more of this. We need to look critically at different models, what they are good for, how they work. ‘Open everything’ is a wonderful philosophical position but we need to be critical about where it will work, where it won’t, and where it needs contractual protection, or where such contractual protection is counter productive. I spoke to John Wilbanks about our ideas on taking Open Source Drug Discovery into undergraduate classes and schools and he was critical of the model I was proposing, not from the standpoint of the aims or where we want to be, but because it wouldn’t be effective at drawing in pharmaceutical companies and protecting their investment. His point was, I think, that by closing off the right piece of the picture with contractual arrangements you bring in vastly more resources and give yourself greater ability to ensure positive outcomes. That sometimes to break the system you need to start by working within it by, in this case, making it possible to patent a drug. This may not be philosophically in tune with my thinking but it is pragmatic. There will be moments, especially when we deal with the interface with commerce, where we have to make these types of decisions. There may or may not be ‘right’ answers, and if there are they will change over time but we need to know our options and know them well so as to make informed decisions on specific issues.

But finally, as is my usual wont, I come back to the infrastructure of tools. The software that will actually allow us to record and order this data that we are supposed to be sharing. Again there was relatively little on this in the meeting itself. Several speakers recognised the need to embed the collection of data and metadata within existing workflows but there was very little discussion of good examples of this. As we have discussed before this is much easier for big science than for ‘long tail’ or ‘small science’. I stand by my somewhat provocative contention that for the well described central experiments of big science this is essentially a solved problem – it just requires the will and resources to build the language to describe the data sets, their formats, and their inputs. But the problem is that even for big science, the majority of the workflow is not easily automated. There are humans involved, making decisions moment by moment, and these need to be captured. The debate over institutional repositories and self archiving of papers is instructive here. Most academics don’t deposit because they can’t be bothered. The idea of a negative click repository – where this is a natural part of the workflow can circumvent this. And if well built it can make the conventional process of article submission easier. It is all a question of getting into the natural workflow of the scientist early enough that not only do you capture all the contextual information you want, but that you can offer assistance that makes them want to put that information in.

The same is true for capturing data. We must capture it at source. This is the point where it has the potential to add the greatest value to the scientist’s workflow by making their data and records more available, by making them more consistent, by allowing them to reformat and reanalyse data with ease, and ultimately by making it easy for them to share the full record. We can and we will argue about where best to order and describe the elements of this record. I believe that this point comes slightly later – after the experiment – but wherever it happens it will be made much easier by automatic capture systems that hold as much contextual information as possible. Metadata is context – almost all of it should be possible to catch automatically. Regardless of this we need to develop a diverse ecosystem of tools. It needs to be an open and standards based ecosystem and in my view needs to be built up of small parts, loosely coupled. We can build this – it will be tough, and it will be expensive but I think we know enough now to at least outline how it might work, and this is the agenda that I want to explore at SciFoo.

John Wilbanks had the last word, and it was a call to arms. He said ‘We are the architects of Open’. There are two messages in this. The first is we need to get on and build this thing called Open Science. The moment to grasp and guide the process is now. The second is that if you want to have a part in this process the time to join the debate is now. One thing that was very clear to me was that the attendees of the meeting were largely disconnected from the more technical community that reads this and related blogs. We need to get the communication flowing in both directions – there are things the blogosphere knows, that we are far ahead on, and we need to get the information across. There are things we don’t know much about, like the legal frameworks, the high level policy discussions that are going on. We need to understand that context. It strikes me though that if we can combine the strengths of all of these communities and their differing modes of communication then we will be a powerful force for taking forward the open agenda.