End of Feed

This is icon for social networking website. Th...
Friendfeed (Photo credit: Wikipedia) Public Domain

Following on from (but unrelated to) my post last week about feed tools we have two posts, one from Deepak Singh, and one from Neil Saunders, both talking about ‘friend feeds’ or ‘lifestreams’. The idea here is of aggregating all the content you are generating (or is being generated about you?) into one place. There are a couple of these about but the main ones seem to be Friendfeed and Profiliac. See Deepaks’s post (or indeed his Friendfeed) for details of the conversations that can come out of these type of things.

A (small) feeding frenzy – Cameron Neylon, Science in the Open – 10 March 2008

Half the links in that quote are dead. I wrote the post above seven years ago today, and it very much marked a beginning. Friendfeed went on to become the coffee house for a broad community of people interested in Open Science and became the place where, for me at least, many of the key discussions took place. Friendfeed was one of a number of examples of “life feed” services. The original intent was as an aggregation point for your online activity but the feed itself rapidly became the focus. Facebook in particular owes a debt to the user experience of Friendfeed. Facebook bought Friendfeed for the team in 2009 and rapidly started incorporating its ideas.

Yesterday Facebook announced they were going to shutter the service that they have to be fair kept going for many years now with no revenue source and no doubt declining user numbers. Of course those communities that remained are precisely the ones that most loved what the service offered. The truly shocking thing is that although nothing has been done to the interface or services that Friendfeed offers for five years it still remains a best in class experience. Louis Gray had some thoughts on what was different about Friendfeed. It remains, in my view, the best technical solution and user experience for enabling the kind of sharing that researchers actually want to do.  I remember reading about Robert Scoble disliked the way that Friendfeed worked, and thinking “all those things are a plus for researchers…”. Twitter is ok, Facebook really not up to the job, Figshare doesn’t have the social features and all the other “facebooks for science” simply don’t have critical mass. Of course, neither did Friendfeed once everyone left either…but while there was a big community there we had a glimpse of what might be possible.

It’s also a reminder, as discussed in the Principles for Scholarly Infrastructures that Geoff Bilder, Jennifer Lin and myself released a week or so back, that relying on the largesse of third parties is not a reliable foundation to build on. If we want to take care of our assets as a community, we need to take responsibility for them as well. In my view there is some important history buried in the records of Friendfeed and I’m going to make some effort to build an archive. This script appears to do a good job of grabbing public feeds. It doesn’t pull discussions (ie the comments on other people’s posts) unless you have the “remote key” for that account. If anyone wants to send me their remote key (log in to friendfeed and navigate to http://friendfeed.com/remotekey) I’ll take a shot at grabbing their discussions as well. Otherwise I’ll just try and prioritize the most important accounts from my perspective to archive.

Is it recent history or is it ancient? We lost Jean-Claude Bradley last year, one of the original thinkers, and perhaps more importantly do-ers, of many strands in Open Research. Much of his thinking from 2008-2011 was on Friendfeed. For me, it was the space in which the foundations for a lot of my current thinking was laid. And where I met many of the people who helped me lay those foundations. And a lot of my insights into how technology does and does not help communities were formed by watching how much better Friendfeed was than many other services. Frankly a lot of the half-baked crap out there today could learn a lot by looking at how this nearly decade-old website works. And still works for those communities that have stayed in strength.

But that is the second lesson. It is the combination of functionality and the community that makes the experience so rich. My community, the Open Science group, left en masse after Facebook acquired Friendfeed. That community no longer trusted that the service would stay around (c.f. again those principles on trust). The librarian community stayed and had an additional five years of rich interactions. It’s hardly new to say that you need both community and technology working together to build a successful social media experience. But it still makes me sad to see it play out like this. And sad that the technology that demonstrably had the best user experience for research and scholarship in small(ish) communities never achieved the critical mass that it needed to succeed.

 

What would scholarly communications look like if we invented it today?

Picture1
Image by cameronneylon via Flickr

I’ve largely stolen the title of this post from Daniel Mietchen because I it helped me to frame the issues. I’m giving an informal talk this afternoon and will, as I frequently do, use this to think through what I want to say. Needless to say this whole post is built to a very large extent on the contributions and ideas of others that are not adequately credited in the text here.

If we imagine what the specification for building a scholarly communications system would look like there are some fairly obvious things we would want it to enable. Registration of ideas, data or other outputs for the purpose of assigning credit and priority to the right people is high on everyone’s list. While researchers tend not to think too much about it, those concerned with the long term availability of research outputs would also place archival and safekeeping high on the list as well. I don’t imagine it will come as any surprise that I would rate the ability to re-use, replicate, and re-purpose outputs very highly as well. And, although I won’t cover it in this post, an effective scholarly communications system for the 21st century would need to enable and support public and stakeholder engagement. Finally this specification document would need to emphasise that the system will support discovery and filtering tools so that users can find the content they are looking for in a huge and diverse volume of available material.

So, filtering, archival, re-usability, and registration. Our current communications system, based almost purely on journals with pre-publication peer review doesn’t do too badly at archival although the question of who is actually responsible for actually doing the archival, and hence paying for it doesn’t always seem to have a clear answer. Nonetheless the standards and processes for archiving paper copies are better established, and probably better followed in many cases, than those for digital materials in general, and certainly for material on the open web.

The current system also does reasonably well on registration, providing a firm date of submission, author lists, and increasingly descriptions of the contributions of those authors. Indeed the system defines the registration of contributions for the purpose of professional career advancement and funding decisions within the research community. It is a clear and well understood system with a range of community expectations and standards around it. Of course this is circular as the career progression process feeds the system and the system feeds career progression. It is also to some extent breaking down as wider measures of “impact” become important. However for the moment it is an area where the incumbent has clear advantages over any new system, around which we would need to grow new community standards, expectations, and norms.

It is on re-usability and replication where our current system really falls down. Access and rights are a big issue here, but ones that we are gradually pushing back. The real issues are much more fundamental. It is essentially assumed, in my experience, by most researchers that a paper will not contain sufficient information to replicate an experiment or analysis. Just consider that. Our primary means of communication, in a philosophical system that rests almost entirely on reproducibility, does not enable even simple replication of results. A lot of this is down to the boundaries created by the mindset of a printed multi-page article. Mechanisms to publish methods, detailed laboratory records, or software are limited, often leading to a lack of care in keeping and annotating such records. After all if it isn’t going in the paper why bother looking after it?

A key advantage of the web here is that we can publish a lot more with limited costs and we can publish a much greater diversity of objects. In principle we can solve the “missing information” problem by simply making more of the record available. However those important pieces of information need to be captured in the first place. Because they aren’t currently valued, because they don’t go in the paper, they often aren’t recorded in a systematic way that makes it easy to ultimately publish them. Open Notebook Science, with its focus on just publishing everything immediately, is one approach to solving this problem but it’s not for everyone, and causes its own overhead. The key problem is that recording more, and communicating it effectively requires work over and above what most of us are doing today. That work is not rewarded in the current system. This may change over time, if as I have argued we move to metrics based on re-use, but in the meantime we also need much better, easier, and ideally near-zero burden tools that make it easier to capture all of this information and publish it when we choose, in a useful form.

Of course, even with the perfect tools, if we start to publish a much greater portion of the research record then we will swamp researchers already struggling to keep up. We will need effective ways to filter this material down to reduce the volume we have to deal with. Arguably the current system is an effective filter. It almost certainly reduces the volume and rate at which material is published. Of all the research that is done, some proportion is deemed “publishable” by those who have done it, a small portion of that research is then incorporated into a draft paper and some proportion of those papers are ultimately accepted for publication. Up until 20 years ago where the resource pinch point was the decision of whether or not to publish something this is exactly what you would want. The question of whether it is an effective filter; is it actually filtering the right stuff out, is somewhat more controversial. I would say the evidence for that is weak.

When publication and distribution was the expensive part that was the logical place to make the decision. Now these steps are cheap the expensive part of the process is either peer review, the traditional process of making a decision prior to publication, or conversely, the curation and filtering after publication that is seen more widely on the open web. As I have argued I believe that using the patterns of the web will be ultimately a more effective means of enabling users to discover the right information for their needs. We should publish more; much more and much more diversely but we also need to build effective tools for filtering and discovering the right pieces of information. Clearly this also requires work, perhaps more than we are used to doing.

An imaginary solution

So what might this imaginary system that we would design look like. I’ve written before about both key aspects of this. Firstly I believe we need recording systems that as far as possible record and publish both the creation of objects, be they samples, data, or tools. As far as possible these should make a reliable, time stamped, attributable record of the creation of these objects as a byproduct of what the researcher needs to do anyway. A simple concept for instance is a label printer that, as a byproduct of printing off a label, makes a record of who, what, and when, publishing this simultaneously to a public or private feed.

Publishing rapidly is a good approach, not just for ideological reasons of openness but also some very pragmatic concerns. It is easier to publish at the source than to remember to go back and do it later. Things that aren’t done immediately are almost invariably forgotten or lost. Secondly rapid publication has the potential to both efficiently enable re-use and to prevent scooping risks by providing a time stamped citable record. This of course would require people to cite these and for those citations to be valued as a contribution; requiring a move away from considering the paper as the only form of valid research output (see also Michael Nielsen‘s interview with me).

It isn’t enough though, just to publish the objects themselves. We also need to be able to understand the relationship between them. In a semantic web sense this means creating the links between objects, recording the context in which they were created, what were their inputs and outputs. I have alluded a couple of times in the past to the OREChem Experimental Ontology and I think this is potentially a very powerful way of handling these kind of connections in a general way. In many cases, particularly in computational research, recording workflows or generating batch and log files could serve the same purpose, as long as a general vocabulary could be agreed to make this exchangeable.

As these objects get linked together they will form a network, both within and across projects and research groups, providing the kind of information that makes Google work, a network of citations and links that make it possible to directly measure the impact of a single dataset, idea, piece of software, or experimental method through its influence over other work. This has real potential to help solve both the discovery problem and the filtering problem. Bottom line, Google is pretty good at finding relevant text and they’re working hard on other forms of media. Research will have some special edges but can be expected in many ways to fit patterns that mean tools for the consumer web will work, particularly as more social components get pulled into the mix.

On the rare occasions when it is worth pulling together a whole story, for a thesis, or a paper, authors would then aggregate objects together, along with text and good visuals to present the data. The telling of a story then becomes a special event, perhaps one worthy of peer review in its traditional form. The forming of a “paper” is actually no more than providing new links, adding grist to the search and discovery mill, but it can retain its place as a high status object, merely losing its role as the only object worth considering.

So in short, publish fragments, comprehensively and rapidly. Weave those into a wider web of research communication, and from time to time put in the larger effort required to tell a more comprehensive story. This requires tools that are hard to build, standards that are hard to agree, and cultural change that at times seems like spitting into a hurricane. Progress is being made, in many places and in many ways, but how can we take this forward today?

Practical steps for today

I want to write more about these ideas in the future but here I’ll just sketch out a simple scenario that I hope can be usefully implemented locally but provide a generic framework to build out without necessarily requiring a massive agreement on standards.

The first step is simple, make a record, ideally an address on the web for everything we create in the research process. For data and software just the files themselves, on a hard disk is a good start. Pushing them to some sort of web storage, be it a blog, github, an institutional repository, or some dedicated data storage service, is even better because it makes step two easy.

Step two is to create feeds that list all of these objects, their addresses and as much standard metadata as possible, who and when would be a good start. I would make these open by choice, mainly because dealing with feed security is a pain, but this would still work behind a firewall.

Step three gets slightly harder. Where possible configure your systems so that inputs can always be selected from a user-configurable feed. Where possible automate the pushing of outputs to your chosen storage systems so that new objects are automatically registered and new feeds created.

This is extraordinarily simple conceptually. Create feeds, use them as inputs for processes. It’s not so straightforward to build such a thing into an existing tool or framework, but it doesn’t need to be too terribly difficult either. And it doesn’t need to bother the user either. Feeds should be automatically created, and presented to the user as drop down menus.

The step beyond this, creating a standard framework for describing the relationships between all of these objects is much harder. Not because its difficult, but because it requires an agreement on standards for how to describe those relationships. This is do-able and I’m very excited by the work at Southampton on the OREChem Experimental Ontology but the social problems are harder. Others prefer the Open Provenance Model or argue that workflows are the way to manage this information. Getting agreement on standards is hard, particularly if we’re trying to maximise their effective coverage but if we’re going to build a computable record of science we’re going to have to tackle that problem. If we can crack it and get coverage of the records via a compatible set of models that tell us how things are related then I think we will be will placed to solve the cultural problem of actually getting people to use them.

Enhanced by Zemanta

The integrated lab record – or the web native lab notebook

At Science Online 09 and at the Smi Electronic Laboratory Notebook meeting in London later in January I talked about how laboratory notebooks might evolve. At Science Online 09 the session was about Open Notebook Science and here I wanted to take the idea of what a “web native” lab record could look like and show that if you go down this road you will get the most out if you are open. At the ELN meeting which was aimed mainly at traditional database backed ELN systems for industry I wanted to show the potential of a web native way of looking at the laboratory record, and in passing to show that these approaches work best when they are open, before beating a retreat back to the position of “but if you’re really paranoid you can implement all of this behind your firewall” so as not to scare them off too much. The talks are quite similar in outline and content and I wanted to work through some of the ideas here.The central premise is one that is similar to that of many web-service start ups: “The traditional paper notebook is to the fully integrated web based lab record as a card index is to Google”. Or to put it another way, if you think in a “web-native” way then you can look to leverage the power of interlinked networks, tagging, social network effects, and other things that don’t exist on a piece of paper, or indeed in most databases. This means stripping back the lab record to basics and re-imagining it as thought it were built around web based functionality.

So what is a lab notebook? At core it is a journal of events, a record of what has happened. Very similar to a Blog in many ways. An episodic record containing dates, times, bits and pieces of often disparate material, cut and pasted into a paper notebook. It is interesting that in fact most people who use online notebooks based on existing services use Wikis rather than blogs. This is for a number of reasons; better user interfaces, sometimes better services and functionality, proper versioning, or just personal preference. But there is one thing that Wikis tend to do very badly that I feel is crucial to thinking about the lab record in a web native way; they generate at best very ropey RSS feeds. Wikis are well suited to report writing and formalising and sharing procedures but they don’t make very good diaries. At the end of the day it ought to possible to do clever things with a single back end database being presented as both blog and wiki but I’ve yet to see anything really impressive in this space so for the moment I am going to stick with the idea of blog as lab notebook because I want to focus on feeds.

So we have the idea of a blog as the record – a minute to minute and day to day record. We will assume we have a wonderful backend and API and a wide range of clients that suit different approaches to writing things down and different situations where this is being done. Witness the plethora of clients for Twittering in every circumstance and mode of interaction for instance. We’ll assume tagging functionality as well as key value pairs that are exposed as microformats and RDF as appropriate. Widgets for ontology look up and autocompleton if it is desired and the ability to automatically generate input forms from any formal description of what an experiment should look like. But above all, this will be exposed in a rich machine readable format in an RSS/Atom feed.What we don’t need is the ability to upload data. Why not? Because we’re thinking web native. On a blog you don’t generally upload images and video directly, you host them on an appropriate service and embed them on the blog page. All of the issues are handled for you and a nice viewer is put in place. The hosting service is optimised for handling the kind of content you need; Flickr for photos, YouTube (Viddler, Bioscreencast) for video, Slideshare for presentations etc. In a properly built ecosystem there would be a data hosting service, ideally one optimised for your type of data, that would provide cut and paste embed codes providing the appropiate visualisations. The lab notebook only needs to point at the data; doesn’t need to know anything much about that data beyond the fact that it is related to the stuff going on around it and that it comes with some html code to embed a visualisation of some sort.

That pointing is the next thing we need to think about. In the way I use the Chemtools LaBLog I use a one post, one item system. This means that every object gets its own post. Each sample, each bottle of material, should have its own post and its own identity. This creates a network of posts that I have written about before. What it also means is that it is possible to apply page rank style algorithms and link analysis more generally in looking at large quantities of posts. Most importantly it encodes the relationship between objects, samples, procedures, data, and analysis in the way the web is tooled up to understand: the relationships are encoded in links. This is a lightweight way of starting to build up a web of data – it doesn’t matter so much to start with whether this is in hardcore RDF as long as there is enough contextual data to make it useful. Some tagging or key-value pairs would be a good start. Most importantly it means that it doesn’t matter at all where our data files are as long as we can point at them with sufficient precision.

But if we’re moving the datafiles off the main record then what about the information about samples? Wouldn’t it be better to use the existing Laboratory Information Management System, or sample management system or database? Well again, as long as you can point at each sample independently with the precision you need then it doesn’t matter. You can use a GoogleSpreadsheet if you want to – you can give URL for each cell, there is a powerful API that would let you build services to make putting the links in easy. We use the LaBLog to keep information on our samples because we have such a wide variety of different materials put to different uses, that the flexibility of using that system rather than a database with a defined schema is important for our way of working. But for other people this may not be the case. It might even be better to use multiple different systems, a database for oligonucleotides, a spreadsheet for environmental samples, and a full blown LIMS for barcoding and following the samples through preparation for sequencing. As long; as it can be pointed at, it can be used. Similar to the data case, it is best to use the system that is best suited to the specific samples. These systems are better developed than they are for data – but many of the existing systems don’t allow a good way of pointing at specific samples from an external document – and very few make it possible to do this via a simple http compliant URL.

So we’ve passed off the data, we’ve passed off the sample management. What we’re left with is the procedures which after all is the core of the record, right? Well no. Procedures are also just documents. Maybe they are text documents, but perhaps they are better expressed as spreadsheets or workflows (or rather the record of running a workflow). Again these may well be better handled by external services, be they word processors, spreadsheets, or specialist services. They just need to be somewhere where we can point at them.

What we are left with is the links themselves, arranged along a timeline. The laboratory record is reduced to a feed which describes the relationships between samples, procedures, and data. This could be a simple feed containing links or a sophisticated and rich XML feed which points out in turn to one or more formal vocabularies to describe the semantic relationship between items. It can all be wired together, some parts less tightly coupled than others, but in principle it can at least be connected. And that takes us one significant step towards wiring up the data web that many of us dream of

The beauty of this approach is that it doesn’t require users to shift from the applications and services that they are already using, like, and understand. What it does require is intelligent and specific repositories for the objects they generate that know enough about the object type to provide useful information and context. What it also requires is good plugins, applications, and services to help people generate the lab record feed. It also requires a minimal and arbitrarily extensible way of describing the relationships. This could be as simple html links with tagging of the objects (once you know an object is a sample and it is linked to a procedure you know a lot about what is going on) but there is a logic in having a minimal vocabulary that describes relationships (what you don’t know explicitly in the tagging version is whether the sample is an input or an output). But it can also be fully semantic if that is what people want. And while the loosely tagged material won’t be easily and tightly coupled to the fully semantic material the connections will at least be there. A combination of both is not perfect, but it’s a step on the way towards the global data graph.

Friendfeed, lifestreaming, and workstreaming

As I mentioned a couple of weeks or so ago I’ve been playing around with Friendfeed. This is a ‘lifestreaming’ web service which allows you to aggregate ‘all’ of the content you are generating on the web into one place (see here for mine). This is interesting from my perspective because it maps well onto our ideas about generating multiple data streams from a research lab. This raw data then needs to be pulled together and turned into some sort of narrative description of what happened. Continue reading “Friendfeed, lifestreaming, and workstreaming”

Slashing and hacking feeds

More followup on feeds. I’ve set up a friendfeed which is available here. Amongst other things I’ve been trying to aggregate my comments on other people’s blogs. The way I am doing this is creating a tag in Google Reader which I have made public. When I leave a comment I try to subscribe to a comment feed for that Blog, I then tag it as ‘Comments feed’ which aggregates and makes it public as an RSS feed. I then suck that feed through yahoo pipes to find all the comments that are mine which then generates another feed, which I have included on my friendfeed.

So a couple of noob questions. Is there a better way to do this? Yes I could edit the Pipe every time I leave a comment to add the new feed and this would be neater, but it involves many more mouseclicks. It also means that any Blog that doesn’t have an obvious comments feed (BBGM is the most obvious example, but also Open Reading Frame and a number of others) I can’t do this with. Including, as it happens, this Blog (am I wrong about this, do we have a comments feed somewhere). Can anyone advise whether there are hidden comment feeds on blogs that I am missing? Incidentally for reference the comments feed for any blogspot blog can be found at xxx.blogspot.com/feeds/comments/default if it’s not immediately obvious.

I also do a few other things, for instance aggregating all the posts and comments on chemtools blogs that are mine. And naturally my lab book is available through the friendfeed, although at the moment that just shows that I haven’t done any real work for a while…

A (small) Feeding Frenzy

Following on from (but unrelated to) my post last week about feed tools we have two posts, one from Deepak Singh, and one from Neil Saunders, both talking about ‘friend feeds’ or ‘lifestreams’. The idea here is of aggregating all the content you are generating (or is being generated about you?) into one place. There are a couple of these about but the main ones seem to be Friendfeed and Profiliac. See Deepaks’s post (or indeed his Friendfeed) for details of the conversations that can come out of these type of things.

What piqued my interest though was the comment Neil made at the bottom of his post about Workstreams.

Here’s a crazy idea – the workstream:

* Neil parsed SwissProt entry Q38897 using parser script swiss2features.pl
* Bob calculated all intersubunit contacts in PDB entry 2jdq using CCP4 package contact

This is exactly the kind of thing I was thinking about as the raw material for the aggregators that would suggest things that you ought to look at, whether it be a paper, a blog post, a person, or a specific experimental result. This type of system will rely absolutely on the willingness of people to make public what they are reading, doing, even perhaps thinking. Indeed I think this is the raw information that will make another one of Neil’s great suggestions feasible.

Following on from Neil’s post I had a short conversation with Alf in the comments about blogging (or Twittering) machines. Alf pointed out a really quite cool example. This is something that we are close to implementing in the open in the lab at RAL. We hope to have the autoclave, PCR machine, and balances all blogging out what they are seeing. This will generate a data feed that we can use to pull specific data items down into the LaBLog.

Perhaps more interesting is the idea of connecting this to people. At the moment the model is that the instruments are doing the blogging. This is probably a good way to go because it keeps a straightforward identifiable data stream. At the moment the trigger for the instruments to blog is a button. However at RAL we use RFID proximity cards for access to the buildings. This means we have an easy identifier for people, so what we aim to do is use the RFID card to trigger data collection (or data feeding).

If this could be captured and processed there is the potential for capturing a lot of the detail of what has happened in the laboratory. Combine this with a couple of Twitter posts giving a little more personal context and it may be possible to reconstruct a pretty complete record of what was done and precisely when. The primary benefit of this would be in trouble shooting but if we could get a little bit of processing into this, and if there are specific actions with agreed labels, then it may be possible to automatically create a large portion of the lab book record.

This may be a great way of recording the kind of machine readable description of experiments that Jean-Claude has been posting about. Imagine a simplistic Twitter interface where you have a limited set of options (I am stirring, I am mixing, I am vortexing, I have run a TLC, I have added some compound). Combine this with a balance, a scanner, and a heating mantle which are blogging out what they are currently seeing, and a barcode reader (and printer) so as to identify what is being manipulated and which compound is which.

One of the problems we have with our lab books is that they can never be detailed enough to capture everything that somebody might be interested in one day. However at the same time they are too detailed for easy reading by third parties. I think there is general agreement that on top of the lab book you need an interpretation layer, an extra blog that explains what is going on to the general public. Perhaps by capturing all the detailed bits automatically we can focus on planning and thinking about the experiments rather than worrying about how to capture everything manually. Then anyone can mash up the results, or the discussion, or the average speed of the stirrer bar, any way they like.