Third party data repositories – can we/should we trust them?

This is a case of a comment that got so long (and so late) that it probably merited it’s own post. David Crotty and Paul (Ling-Fung Tang) note some important caveats in comments on my last post about the idea of the “web native” lab notebook. I probably went a bit strong in that post with the idea of pushing content onto outside specialist services in my effort to try to explain the logic of the lab notebook as a feed. David notes an important point about any third part service (do read the whole comment at the post):

Wouldn’t such an approach either:
1) require a lab to make a heavy investment in online infrastructure and support personnel, or
2) rely very heavily on outside service providers for access and retention of one’s own data? […]

Any system that is going to see mass acceptance is going to have to give the user a great deal of control, and also provide complete and redundant levels of back-up of all content. If you’ve got data scattered all over a variety of services, and one goes down or out of business, does that mean having to revise all of those other services when/if the files are recovered?

This is a very wide problem that I’ve also seen in the context of the UK web community that supports higher education (see for example Brian Kelly‘s risk assessment for use of third party web services). Is it smart, or even safe, to use third party services? The general question divides into two sections: is the service more or less reliable than you own hard drive or locally provided server capacity (technical reliability, or uptime); and what is the long term reliability of the service remaining viable (business/social model reliability). Flickr probably has higher availability than your local institutional IT services but there is no guarantee that it will still be there tomorrow. This is why data portability is very important. If you can’t get your data out, don’t put it in there in the first place.

In the context of my previous post these data services could be local, they could be provided by the local institution, or by a local funder, or they could even be a hard disk in the lab. People are free to make those choices and to find the best balance of reliability, cost, and maintenance that suits them. My suspicion is that after a degree of consolidation we will start to see institutions offering local data repositories as well as specialised services on the cloud that can provide more specialised and exciting functionality. Ideally these could all talk to each other so that multiple copies are held in these various services.

David says:

I would worry about putting something as valuable as my own data into the “cloud” […]

I’d rather rely on an internally controlled system and not have to worry about the business model of Flickr or whether Google was going to pull the plug on a tool I regularly use. Perhaps the level to think on is that of a university, or company–could you set up a system for all labs within an institution that’s controlled (and heavily backed up) by that institution? Preferably something standardized to allow interaction between institutions.

Then again, given the experiences I’ve had with university IT departments, this might not be such a good approach after all.

Which I think encapsulates a lot of the debate. I actually have greater faith in Flickr keeping my pictures safe than my own hard disk. And more faith in both than insitutional repository systems that don’t currently provide good data functionality and that I don’t understand. But I wouldn’t trust either in isolation. The best situation is to have everything everywhere, using interchange standards to keep copies in different places; specialised services out on the cloud to provide functionality (not every institution will want to provide a visualisation service for XAFS data), IRs providing backup archival and server space for anything that doesn’t fit elsewhere, and ultimately still probably local hard disks for a lot of the short to medium term storage. My view is that the institution has the responsibility of aggregating, making available, and archiving the work if its staff, but I personally see this role as more harvester than service provider.

All of which will turn on the question of business models. If the data stores a local, what is the business model for archival? If they are institutional how much faith do you have that the institution won’t close them down. And if they are commercial or non-profit third parties, or even directly government funded service, does the economics make sense in the long term. We need a shift in science funding if we want to archive and manage data in the longer term. And with any market some services will rise and some will die. The money has to come from somewhere and ultimately that will always be the research funders. Until there is a stronger call from them for data preservation and the resources to back it up I don’t think we will see much interesting development. Some funders are pushing fairly hard in this direction so it will be interesting to see what develops. A lot will turn on who has the responsibility for ensuring data availability and sharing. The researcher? The institution? The funder?

In the end you get what you pay for. Always worth remembering that sometimes even things that are free at point of use aren’t worth the price you pay for them.

Friendfeed for scientists: What, why, and how?

There has been lots of interest amongst some parts of the community about what has been happening on FriendFeed. A growing number of people are signed up and lots of interesting conversations are happening. However it was suggested that as these groups grow they become harder to manage and the perceived barriers to entry get higher. So this is an attempt to provide a brief intro to FriendFeed for the scientist who may be interested in using it; what it is, why it is useful, and some suggestions on how to get involved without getting overwhelmed. This are entirely my views and your mileage may obviously vary.

What is FriendFeed?

FriendFeed is a ‘lifestreaming’ service or more simply a personal aggregator. It takes data streams that you generate and brings them all together into one place where people can see them. You choose to subscribe to any of the feeds you already generate (Flickr stream, blog posts, favorited YouTube videos, and lots of other services integrated). In addition you can post links to specific web pages or just comments into your stream. A number of these types of services have popped up in the recent months including Profilactic and Social Thing but FriendFeed has two key aspects that have led it to the fore. Firstly the commenting facilities enable rapid and effective conversations and secondly there was rapid adoption by a group of life scientists which has created a community. Like anything some of the other services have advantages and probably have their own communities but for science and in particular the life sciences FriendFeed is where it is at.

My FriendFeed

As well as allowing other people to look at what you have been doing FriendFeed allows you to subscribe to other people and see what they have been doing. You have the option of ‘liking’ particular items and commenting on them. In addition to seeing the items of your friends, people you are subscribed to, you also see items that they have liked or commented on. This helps you to find new people you may be interested in following. It also helps people to find you. As well as this items with comments or likes then get popped up to the top of the feed so items that are generating a conversation keep coming back to your attention.

These conversations can happen very fast. Some conversations baloon within minutes, most take place at a more sedate pace over a couple of hours or days but it is important to be aware that many people are live most of the time.

Why is FriendFeed useful?

So how is FriendFeed useful to a scientist? First and foremost it is a great way of getting rapid notification of interesting content from people you trust. Obviously this depends on there people who are interested in the same kinds of things that you are but this is something that will grow as the community grows. A number of FriendFeed users stream both del.icio.us bookmark pages as well as papers or web articles they have put into citeulike or connotea or simply via sharing it in Google Reader. Also you can get information that people have shared on opportunities, meetings, or just interesting material on the web. Think of it as an informal but continually running journal club – always searching for the next thing you will need to know about.

Notifications of interesting material on friendfeed

But FriendFeed is about much more than finding things on the web. One of its most powerful features is the conversations that can take place. Queries can be answered very rapidly going some way towards making possible the rapid formation of collaborative networks that can come together to solve a specific problem. Its not there yet but there are a growing number of examples where specific ideas were encouraged, developed, or problems solved quickly by bringing the right expertise to bear.

One example is shown in the following figure where I was looking for some help in building a particular protein model for a proposal. I didn’t really know how to go about this and didn’t have the appropriate software to hand. Pawel Szczesny offered to help and was able to quickly come up with what I wanted. In the future we hope to generate data which Pawel may be able to help us analyse. You can see the whole story and how it unfolded after this at http://friendfeed.com/search?q=mthkMthK model by Friendfeed

We are still a long way from the dream of just putting out a request and getting an answer but it is worth point out that the whole exchange here lasted about four hours. Other collaborative efforts have also formed, most recently leading to the formation of BioGang, a collaborative area for people to work up and comment on possible projects.

So how do I use it? Will I be able to cope?

FriendFeed can be as high volume as you want it be but if its going to be useful to you it has to be manageable. If you’re the kind of person who already manages 300 RSS feeds, your twitter account, Facebook and everthing else then you’ll be fine. In fact your’re probably already there. For those of you who are looking for something a little less high intensity the following advice may be helpful.

  1. Pick a small amount of your existing feeds as a starting point to see what you feel comfortable with sharing. Be aware that if you share e.g. Flickr or YouTube feeds it will also include your favourites, including old ones. Do share something – even if only some links – otherwise people won’t know that you’re there.
  2. Subscribe to someone you know and trust and stick with just one or two people for a while as you get to understand how things work. As you see extra stuff coming in from other people (friends of your friends) start to subscribe to one or two of them that you think look interesting. Do not subscribe to Robert Scoble if you don’t want to get swamped.
  3. Use the hide button. You probably don’t need to know about everyone’s favourite heavy metal bands (or perhaps you do). The hide button can get rid of a specific service from a specific person but you can set it so that you do so it if other people like it.
  4. Don’t worry if you can’t keep up. Using the Best of the Day/Week/Month button will let you catch up on what people thought was important.
  5. Find a schedule that suits you and stick to it. While the current users are dominated by the ‘always on’ brigade that doesn’t mean you need to do it the same way. But also don’t feel that because you came in late you can’t comment. It may just be that you are needed to kick that conversation back onto some people’s front page
  6. Join the Life Scientists Room and share interesting stuff. This provides a place to put particularly interesting links and is followed by a fair number of people, probably more than you are. If it is worthy of comment then put it in front of people. If you aren’t sure whether its relevant ask, you can always start a new room if need be.
  7. Enjoy, comment and participate in a way you feel comfortable with. This is a (potential) work tool. If it works for you, great! If not well so be it – there’ll be another one along in a minute.
Related articles






Zemanta Pixie

Friendfeed, lifestreaming, and workstreaming

As I mentioned a couple of weeks or so ago I’ve been playing around with Friendfeed. This is a ‘lifestreaming’ web service which allows you to aggregate ‘all’ of the content you are generating on the web into one place (see here for mine). This is interesting from my perspective because it maps well onto our ideas about generating multiple data streams from a research lab. This raw data then needs to be pulled together and turned into some sort of narrative description of what happened. Continue reading “Friendfeed, lifestreaming, and workstreaming”