How I got into open science – a tale of opportunism and serendipity

So Michael Nielsen, one morning at breakfast at Scifoo asked one of those questions which never has a short answer; ‘So how did you get into this open science thing?’ and I realised that although I have told the story to many people I haven’t ever written it down. Perhaps this is a meme worth exploring more generally but I thought others might be interested in my story, partly because it illustrates how funding drives scientists, and partly because it shows how the combination of opportunism and serendipity can make for successful bedfellows.

In late 2004 I was spending a lot of my time on the management of a large collaborative research project and had had a run of my own grant proposals rejected. I had a student interested in doing a PhD but no direct access to funds to support the consumables cost of the proposed project. Jeremy Frey had been on at me for a while to look at implementing the electronic lab notebook system that he had lead the development of and at the critical moment he pointed out to me a special call from the BBSRC for small projects to prototype, develop, or implement e-science technologies in the biological sciences. It was a light touch review process and a relatively short application. More to the point it was a way of funding some consumables.

So the grant was written. I wrote the majority of it, which makes somewhat interesting reading in retrospect. I didn’t really know what I was talking about at the time (which seems to be a theme with my successful grants). The original plan was to use the existing, fully semantic, rdf backed electronic lab notebook and develop models for use in a standard biochemistry lab. We would then develop systems to enable a relational database to be extracted from the rdf representation and present this on the web.

The grant was successful but the start was delayed due to shenanigans over the studentship that was going to support the grant and the movement of some of the large project to another institution with one of the investigators. Partly due to the resulting mess I applied for the job I ultimately accepted at RAL and after some negotiation organised an 80:20 split between RAL and Southampton.

By the time we had a student in place and had got the grant started it was clear that the existing semantic ELN was not in a state that would enable us to implement new models for our experiments. However at this stage there was a blog system that had been developed in Jeremy’s group and it was thought it would be an interesting experiment to use this as a notebook. This would be almost the precise opposite of the rdf backed ELN. Looking back at it now I would describe it as taking the opportunity to look at a Web 2.0 approach to the notebook as compared to a Web 3.0 approach but bear in mind that at the time I had little or no idea of what these terms meant, let alone the care with which they need to be used.

The blog based system was great for me as it meant I could follow the student’s work online and doing this I gradually became aware of blogs in general and the use of feed readers. The RSS feed of the LaBLog was a great help as it made following the details of experiments remotely straightforward. This was important as by now I was spending three or four days a week at RAL while the student was based in Southampton. As we started to use the blog, at first in a very naïve way we found problems and issues which ultimately led to us thinking about and designing the organisational approach I have written about elsewhere [1, 2]. By this stage I had started to look at other services online and was playing around with OpenWetWare and a few other services, becoming vaguely aware of Creative Commons licenses and getting a grip on the Web 2.0 versus Web 3.0 debate.

To implement our newly designed approach to organising the LaBLog we decided the student would start afresh with a clean slate in a new blog. By this stage I was playing with using the blog for other things and had started to discover that there were issues that meant the ID authentication we were using didn’t always work through the RAL firewall. I ended up having complicated VPN setups, particularly working from home, where I couldn’t log on to the blog and I have my email online at the same time. This, obviously, was a pain and as we were moving to a new blog which could have new security settings I said, ‘stuff it, let’s just make it completely visible and be done with it’.

So there you go. The critical decision to move to an Open Notebook status was taken as the result of a firewall. So serendipity, or at least the effect of outside pressures, was what made it happen.  I would like to say it was a carefully thought out philosophical decision but, although the fact that I was aware of the open access movement, creative commons, OpenWetWare, and others no doubt prepared the background that led me to think down that route, it was essentially the result of frustration.

So, so far, opportunism and serendipity, which brings us back to opportunism again, or at least seizing an opportunity. Having made the decision to ‘go open’ two things clicked in my mind. Firstly the fact that this was rather radical. Secondly, the fact that all of these Web 2.0 tools combined with an open approach could lead to a marked improvement in the efficiency of collaborative science, a kind of ‘Science 2.0’ [yes, I know, don’t laugh, this would have been around March 2007]. Here was an opportunity to get my name on a really novel and revolutionary concept! A quick Google search revealed that, funnily enough, I wasn’t the first person to think of this (yes! I’d been scooped!), but more importantly it led to what I think ought to be three of the Standard Works of Open Science, Bill Hooker’s three part series on Open Science at 3 Quarks Daily [1, 2, 3], Jean-Claude Bradley’s presentation on Open Notebook Science at Nature Precedings (and the associated original blog post coining the term), and Deepak Singh’s talk on Open Science at Ignite Seattle. From there I was inspired to seize the opportunity, get a blog of my own, and get involved. The rest of my story story, so far, is more or less available online here and via the usual sources.

Which leads me to ask. What got you involved in the ‘open’ movement? What, for you, were the ‘primary texts’ of open science and open research? There is a value in recording this, or at least our memories of it, for ourselves, to learn from our mistakes and perhaps discern the direction going forward. Perhaps it isn’t even too self serving to think of it as history in the making. Or perhaps, more in line with our own aims as ‘open scientists’, that we would be doing a poor job if we didn’t record what brought us to where we are and what is influencing our thinking going forward. I think the blogosphere does a pretty good job of the latter, but perhaps a little more recording of the former would be helpful.

6 Replies to “How I got into open science – a tale of opportunism and serendipity”

  1. Great to meet you yesterday, Cameron. In the spirit of your call for a historical perspective, here is a distillation of what you and other science bloggers said about online notebooks between February and June of this year — the period when that was a
    major focus of what you were discussing. I put it together originally for my own benefit – to help me make sense of the key themes that were emerging in the discussion and to tease out the different perspectives that different people had on different aspects of the issue.

    Although it is all things that you and the others said (and in that sense obviously old hat), you might find it interesting to see someone else’s efforts to organize what I saw as the key threads in a discussion that ranged over a variety of blogs into a ‘coherent’ (!) dialog.

    The title I gave my notes was:

    The Science Bloggers on the Future of Research in Biology and Chemistry (February – June 2008)

    The topic headings are

    1. Objective/focus
    2. Vision for what ‘it’ looks like: From ELN to ‘Authoring Tool’ to ‘Virtual Research Environment’
    3. How to get started/build it
    4. Technical/desired features and capabilities
    5. Databases and combining the ‘ELN’ with a database
    6. Data aggregation, automation and lifestreaming
    7. Adoption: Barriers and drivers
    8. Attitudes to commercial ELNs, other products and approaches

    And here is the entire distillation:

    The Science Bloggers on the Future of Research in Biology and Chemistry (February – June 2008)

    1. Objective/Focus

    Martin Fenner

    At least three different goals here: 1) creating a common data format, 2) building a software platform that does these things and 3) have most or all data openly available. 3) is a desired goal, but is not necessarily connected to 1) and 2). You propose an open repository, but technically there is no reason to do this (although it might be easier).

    Interesting that most of the suggestions are about data aggregation. And David’s detailed blog post has some promising examples. My own thinking was rather along the lines of facilitating the daily tasks of a scientist: doing experiments, searching the literature, writing papers, going to seminars, etc.

    Cameron Neylon

    Its true that there is no technical reason for making the respositories open, it just happens that that is my agenda. For me, that is the end game. My aim is to make these tools available because they will help to expand the data commons. It is the case, however, that by making these tools you are also making a good data storage and processing system for a closed repository. And someone could make money from that as a way to recoup the development costs.

    We use blogs to record experiments but that is at a very crude level at the moment really but you can see where it is going. The key to me as David says is bring it all together and processing it where and how you like. That’s why I focus on aggregation (of anything, whats going on, who’s said what, where I’m supposed to be [oops, in the lab :)], and what people are writing and thinking). These kind of work, but the filtering bit is where the real work needs to be done in my view.

    Jean-Claude Bradley

    If I dare speak for Cameron – openness is central to his objectives. . . . Some of us think that there will be a qualitative change in what science can accomplish and how its gets done when data are truly open and free. Distributed intelligence is extremely difficult with limited access to information.

    Deepak Singh

    Where are the mashups (something like EpiSpider) that make our lives better, where are the APIs that make it easy for us to use the web as a platform?

    David Crotty

    Focus on creating tools that increase efficiency, rather than just copying things that have been popular in other arena (Myspace).

    Neil Saunders

    You might argue that web 2.0 is a “way of doing”, rather than software per se. However, it’s apps that make things happen. What we need is more input from biologists. Web programmers have no end of ideas for apps, but they need to be told what’s useful and what isn’t. Alternatively, biologists need to program and as you suggest, it doesn’t take too much work to get your head around an API, read a few tutorials and give it a go yourself.

    My personal view is that we need less social, “Facebook for scientists” ideas and more apps aimed at data: aggregation, interoperability, transfer, search, online analysis.

    2. Vision for what ‘it’ looks like: From ELN to ‘Authoring Tool’ to ‘Virtual Research Environment’

    Neil

    All experiments share common features such as: (1) a date/time when they were performed; (2) an aim (”generate PCR product”, “run crystal screen for protein X”); (3) the use of protocols and instruments; (4) a result (correct size band on a gel, crystals in well plate A2). The only free-form part is the interpretation. Is the result good, bad, expected? What to do next? My simplistic view is that an XML element named “notes” of data type “string” covers anything free-form that somebody might want to say about their experiment. Now we just have to design the schema, build a nice forms-based web interface and force everyone in the lab to use it (March 08)

    Martin, indeed, many of the popular tools are very useful for science and perhaps one goal is to use those APIs to bring them together. You can imagine an electronic lab notebook cobbled together from an online wiki, editor, calendar and source code repository, for example. I’m hoping to see some great science apps out of Google App Engine any day now… http://network.nature.com/blogs/user/mfenner/2008/05/10/web
    -2-0-for-scientists-where-are-the-applications

    Cameron

    What I want here is a marker on a web document that says ‘I am a scientific experiment’ (page was a more term to use – I simply mean any web document, generally accessed through discrete web pages). This will allow aggregation and distribution of the notebook a la PostGenomic or Chemical Blogspace. To me this is more important than the format of the underlying data.

    I lean towards lightweight approaches and very free form capture.

    What we need are systems that help us capture what a happened (which is much more than the data) and then help us to report it in a coherent way.

    I believe we can propose something much more ambitious with the tools we have available. We can build the systems, or at least the prototypes of systems, that will be so useful for recording and sharing data that people will want to use them . . .

    The fundamental problem is how to capture the data, associated metadata, and do the wrapping in a way that is an easy and convenient enough that scientists will actually do it . . .

    Need to build a lab book system, or a virtual research environment if you prefer, because this will encompass much more than the traditional paper notebook. It will also encompass much more than the current crop of electronic lab notebooks. . . . this system will consist of a feed aggregator (not unlike Friendfeed) where the user can subscribe to various data streams, some from people, some from instruments or sensors, and some that they have generated themselves, pictures at Flickr, comments on Twitter. The lab book component itself will look more or less like a word processor, or a blog, or a wiki, depending on what the user wants. In this system the user will generate documents (or pages); some will describe samples, some procedures, and some ideas. Critically this authoring system will have an open plugin architecture supported by an open source community that enables the user to grab extra functionality automatically. . .

    This environment will have a proper versioning system, web integration to push documents into the wider world, and a WYSIWYG interface. It will expose human readable and machine readable versions of the files, and will almost certainly use an underlying XML document format. It will probably expose both microformats, rdf, and free text tags.
    The system at core is an authoring tool (or tools) that can access a common plugin architecture. These tools do two things. They take the record that we create by ordering the datastreams we generate and to wrap that and place it in an appropriate repository or repositories. And then it also enables us to do the same thing with external datastreams (literature, data, social network tips, conversations etc). It is both an authoring and planning tool, and a layer through which we can interact with our own data (wherever it happens to be) and anyone elses data (wherever it happens to be) . . .
    In my mind it looks a bit like the offspring of a liasion between Word and Google Home page but I think its important that it should be flexibile. Ideally it should work as a layer on top of anything you want, Word, Wiki, Blog, Text editor. It’s just a set of panels in your web browser if you like. . . .
    The ideal situation would be everything loosely coupled with people able to use whatever authoring and ordering tools they like. This is at the core of the ‘web as platform’ concept.
    Our lab book is not quite unstructured, it does have metadata and it does define different aspects of the process. We are also very in favour of the use of templates for repeated processes which can be used to structure specific aspects of the capture process. Don’t get too focussed on the notion that a wiki or blog is free text. It is just like FuGE. It can be completely free or it can be a highly structured XML document. It depends on how you are interacting with it and what layers are between you and the underlying database.

    The key to our blog system is that the structure is arbitrary; the system itself is semantically unaware. How you choose to process the information is up to you. New keys and key values can be added at any time. We have found this crucial and it is this which makes me sceptical about any semantically structure data model in this context.

    Martin

    What will your proposed system look like at its core? Is it a wiki? Or a platform with plugins à la facebook? Or a set of network protocols used my many different applications?

    Jean-Claude

    There are already many commercial Electronic Notebook Systems and Content Management Systems out there but they are designed with precisely the opposite intent: to keep people out.
    This was made very clear a few weeks ago when I met with a scientific software distributor. There was not even an option to store data in a format not encrypted in a proprietary format.

    3. How to get started/build it

    Martin

    Where do we start? Probably with one of the existing tools you mentioned. But you could also start with a publishing platform such as TOPAZ used by PLoS and then work yourself backwards
    We can wait that either one of the big players or a clever startup has a great idea. But one of the attractive features of Web 2.0 is user participation. We need more discussions between scientists and software developers on what is needed and what can be done. These discussions are of course already taking place, but science bloggers can do more to collect interesting ideas and articulate them. We want the integration of reference managers in online writing tools such as Google Docs or Buzzword, but how do we make our voice heard?
    Secondly, we can write applications ourselves. The barriers of entry have become really low, and one reason are the APIs (application programming interfaces) of both science applications or conventional Web 2.0 apps:
    • Pubmed
    • Scopus
    • Connotea
    • Google Maps
    • Google Docs
    • YouTube
    • Flickr
    • Technorati
    • Facebook
    • eBay
    • Amazon
    • Yahoo

    Deepak

    I think the key is to keep it simple. There is a lot that can be done, but we need to go with the basic building blocks and use the existing fabric.

    Cameron

    Deepak, absolutely it needs to be kept simple and as far as possible it should involve tools that either people are already using or that are very easy to transfer to. I’m thinking less of an application and more of a platform that gives people a way of pulling things together.

    4. Technical/Desired features and capabilities

    Cameron

    One of the problems we have with our lab books is that they can never be detailed enough to capture everything that somebody might be interested in one day. However at the same time they are too detailed for easy reading by third parties. I think there is general agreement that on top of the lab book you need an interpretation layer, an extra blog that explains what is going on to the general public. Perhaps by capturing all the detailed bits automatically we can focus on planning and thinking about the experiments rather than worrying about how to capture everything manually. Then anyone can mash up the results, or the discussion, or the average speed of the stirrer bar, any way they like.

    Another issue is the tables. My original thinking was that if we had a data model for tables then most of our problems would go away. If there was a bit of code that could be used to describe the columns and rows then this would enable both the presentation of a table to input data as well as the rendering of the table once complete. This would reduce the ‘time suck’ associated with data entry. However there are problems with this data model being associated with the rendering of the table. Some people like tables going one way, some another. There are may different possible formats for representing the same information and some are better for some purposes than another. Therefore the data model for the table should encode the meaning behind the table, usually input materials (or data), amounts, and products (materials or data). . . .
    The table: Again this is a central user interface issue for us. Capturing an experiment in the wet lab, whether noting it as it happens or planning what you are going to in advance, is often most easily done with a table. Tables are not well implemented in the wiki and blog frameworks we are using for these systems. Therefore providing a table to capture the experiment is critical if you actually want anyone to use your system. Our users consistently identify this as the single biggest barrier to them using our system.
    Now the heavyweight approach to this is to say; ‘That’s why you need a data model. Once you have that you can generate a nice web form to capture the necessary data’. The problem with this comes when you do something slightly different. As an example I had a template set up in our system for capturing the setup of SDS-PAGE gels. This would go and look for anything that tagged as ‘protein’ as potential samples and present these in a drop down menu. This was fine until the day I wanted to run a DNA-protein conjugate on the gel. Essentially I had broken my own data model. This could be fixed, and I did fix it, by changing the way my template looked for potential samples. But in the cut and thrust of real lab work (as opposed to an academic pottering under sufferance of his students) this isn’t feasible. We can’t extend the data model every time we do something new – we are always doing something new.
    Bottom line is that tables are a familiar and easy way for users to structure information visually. If you can’t provide them people will walk away.

    5. Databases and Combining the ‘ELN’ with a database

    Cameron

    One of the features that people liked about our Blog based framework at the workshop last week was the idea that they got a catalogue of various different items (chemicals, oligonucleotides, compound types) for free once the information was in the system and properly tagged. Now this is true but you don’t get the full benefits of a database for searching, organisation, presentation etc. We have been using DabbleDB to handle a database of lab materials and one of our future goals has been to automatically update the database. What I hadn’t realised before last week was the potential to use user configured RSS feeds to set up multiple databases within DabbleDB to provide more sophisticated laboratory stocks database.

    DabbleDB can be set up to read RSS or other XML or JSON feeds to update as was pointed out to me by Lucy Powers at the workshop. To update a database all we need is a properly configured RSS feed. As long as our templates are stable the rest of the process is reasonably straightforward and we can generate databases of materials of all sorts along with expiry dates, lot numbers, ID numbers, safety data etc etc. The key to this is rich feeds that carry as much information as possible, and in particular as much of the information we have chosen to structure as possible. We don’t even need the feeds to be user configurable within the system itself as we can use Pipes to easily configure custom feeds.

    dabbledb (dabbledb.com)s a not too bad point and click web based database system that can upload excel spreadsheets and then do some simple search and relationship type queries on the data. You might find that a reasonable halfway house en route. We use it for our stocks and strains database (neylonlaboratory.dabbledb.com – can give you access to the underlying database if you want a look)

    Neil

    Oh boy, there’s a lot to get through here. Starting with “what Pierre said”. It’s entirely possible and indeed preferable to analyse genome-wide screens without Excel spreadsheets.
    It’s natural to look at data and try to force it into the tools with which we’re familiar. You see rows and columns, you think “spreadsheet”. Other people see rows and columns, they think “database table”, or “R data frame”. First point then: if you know that there’s a better way to perform a task, don’t you owe it to yourself to at least try and find out about it?
    To answer Jennifer’s questions: getting your data into a database table (such as MySQL) is incredibly beneficial, because it allows you to query it in all sorts of ways. So yes, you would be able to ask questions such as “show me all orthologs of gene X from species Y and list the paralogs too”. If your data are structured, designing relevant queries is easier.

    Deepak

    The challenge is always authoring tools. People don’t necessarily need to learn SQL, but they do need to learn to understand how to conceptualize queries and relationships.
    Can life scientists get by using excel? No, it is far to limited, designed for finance people and not complex scientific data mining. Now on the other hand, with tools like dabbledb and Blist you have the ability to use excel like front ends, i.e. familiar UI’s with relational backends that permit querying. These tools are still young, but you get the picture.
    Of course, we could do a much better job of building applications that we can mine through simple query builders. That’s where Cameron’s notions of collaboration come into play. You don’t have to have a formal collaboration. There are enough people out there who will do it for you.

    6. Data aggregation, Automation and Lifestreaming

    Neil

    particularly with respect to tracking people in the lab and connecting a datafeed from an instrument (say a PCR machine or an autoclave) to a person. The guys at Southampton have already put together the blogging lab (see http://chemtools.chem.soton.ac.uk/projects/blog/blogs.php/blog_id/24 for an example). Now if we could connect that up to, say, a Twitter feed (’I am doing a PCR reaction:Date, time’ then we may be able to reconstruct what happens in the lab at a very detailed level. We are thinking of applications in safety (recording the autoclave is working properly) but this will also be great for general trouble shooting (did the PCR machine do what it was supposed to, has someone fiddled with my programme, what was the temperature in the incubator etc etc)
    In many ways the technical problems of providing data feeds are largely solved. There are two outstanding problems. Getting more people to generate interesting feeds. And developing clever and easy to use tools to filter them (which might well involve building better tools that help scientists to generate them with proper tagging).

    Martin

    When I can query the data web with a question like ‘what is the relationship between protein purity and crystal size’, or ‘has anyone made a novel compound that might bind to my drug target based on an in silico screen’, or as Neil says ‘These people also searched for GO accession GO:0050421 (nitrite reductase (cytochrome) activity)’ then we will have the tools that make ‘Web2.0’ relevant for science. . . .

    Cameron

    We use blogs to record experiments but that is at a very crude level at the moment really but you can see where it is going. The key to me as David says is bring it all together and processing it where and how you like. That’s why I focus on aggregation (of anything, whats going on, who’s said what, where I’m supposed to be [oops, in the lab :)], and what people are writing and thinking). These kind of work, but the filtering bit is where the real work needs to be done in my view.

    The system at core is an authoring tool (or tools) that can access a common plugin architecture. These tools do two things. They take the record that we create by ordering the datastreams we generate and to wrap that and place it in an appropriate repository or repositories. And then it also enables us to do the same thing with external datastreams (literature, data, social network tips, conversations etc). It is both an authoring and planning tool, and a layer through which we can interact with our own data (wherever it happens to be) and anyone elses data (wherever it happens to be) . . .

    Neil

    Perhaps the web at large defines “social” differently to scientists. Most social web2.0 apps mean “keeping in touch with your friends”. Perhaps web2.0 for scientists apps means “keeping in touch with data”.

    Michael Barton

    Combined with an RSS WordPress plugin, my most recent research activity from Twitter is displayed as a stream on my blog. Taking this a step further I included feeds for my research tagged Figures on Flickr, my paper bibliography on CiteULike, and discussion of my research on my blog. This stream is available on http://www.michaelbarton.me.uk/research-stream/, and shows the general idea of what I’m trying to do. I like this because in bioinformatics its sometimes difficult to know what other people are doing, but, now I hope that other people in my group can have a quick glance to see what I’ve been up to. Furthermore this all works passively, where I’m already using these services in my research anyway, and the only thing I had to do, was use yahoo pipes to aggregate the already existing information.
    Because bioinformatics work is amenable to being displayed, shared, and edited on the web I think that the field should be at the bleeding edge of using Web 2.0 services like this.

    Martin

    My Space DataAvailability, Facebook Connect and Google Friend Connect were announced just the last few days and also promise better data integration.
    My suggestion: we hold a Yahoo Pipes contest for the most interesting science mashup.

    7. Adoption: Barriers and Drivers

    Cameron

    I don’t think we’ll convince the general science community to buy in unless they see the network effects that we know and love in Web2.0. And I don’t think we’ll see that until the data is available.

    Neil

    My experience of discussing electronic lab notebooks, which is mostly from biochemistry/molecular biology labs, is that many biologists are quite resistant to the idea of structured data. I think one reason that the paper notebook persists is that people like free-form notes. You may believe that a lab notebook is a highly-ordered record of experiments but trust me, it’s not uncommon to see notes such as “Bollocks! Failed again! I’m so sick of this purification…” scrawled in the margins.
    My take on the problem is that biologists spend a lot of time generating, analysing and presenting data, but they don’t spend much time thinking about the nature of their data. When people bring me data for analysis I ask questions such as: what kind of data is this? ASCII text? Binary images? Is it delimited? Can we use primary keys? Not surprisingly this is usually met with blank stares, followed by “well…I ran a gel…”.

    David Crotty

    “I can barely keep up with . . . what my labmates are doing.”

    Also consider ways for them to export their information from your site. If you can provide functionality so the effort they put in at your site can be re-used elsewhere, your site suddenly becomes much more attractive.

    For the lab notebook–which is easier, going to the lab’s shared drive, grabbing a Word document used to keep the protocols and annotating as needed, or going to the Wiki site, creating an account, signing in and editing there? Add in the worries of publicly exposing that Wiki to both annoying trolls (easily fixed, but annoying nonetheless), and more importantly, to competing labs. Factor in the need for someone to monitor and do upkeep on the Wiki. How is the Wiki any better than the Word document for an average lab member? Why should they spend the time and effort in learning something new if the payoff is not going to be significant?

    Heather Etchevers

    None of my lab mates wants to bother, and I don’t have the time to build the entire site single-handedly. What do they not want to bother with?
    • learning wiki (or html, or any other) markup – despite what I consider some enormous advantages
    • writing or retyping lab protocols – I insist here, but it’s always pulling teeth and is always done in Word
    • updating information in writing that they can get orally from the other end of the bench from one another or by the phone from me
    • potentially revealing scoopable information to other competitor labs (despite the fact that no one can really directly compete with us anyhow on most of our subjects)

    8. Attitudes to commercial ELNs, other products and approaches

    Jean Claude

    There are already many commercial Electronic Notebook Systems and Content Management Systems out there but they are designed with precisely the opposite intent: to keep people out.
    This was made very clear a few weeks ago when I met with a scientific software distributor. There was not even an option to store data in a format not encrypted in a proprietary format.

    Cameron

    Technically I agree the build would be easier in a closed ecosystem with a defined repository architecture, but as JC says, that to a certain extent already exists. The technical challenges in building (and getting people to buy into) an open system are much larger, but both JC and myself, amongst others, believe that the benefits are exponentially larger as well.
    But also there is no technical reason why this system shouldn’t be able to cope with internal or closed data. The only problem is the additional security and authentication issues. My argument is that we will get buy in by actually paying people to make their data open. At the moment they’re not interested in making data available, closed, open, or whatever. By creating a significant carrot to do so, and thus making it part of the mainstream of science practice (as is anything that generates money) I think we take the agenda forward faster than it would otherwise naturally go.
    We have been using DabbleDB to handle a database of lab materials and one of our future goals has been to automatically update the database. What I hadn’t realised before last week was the potential to use user configured RSS feeds to set up multiple databases within DabbleDB to provide more sophisticated laboratory stocks database.

    One of the problems with Wikis is that they don’t in general provide an aggregated or user configurable feed of the site in general or a name space such as a single lab book. They also don’t readily provide a means of tagging or adding metadata. Neither Wikis nor Blogs provide immediately accessible tools that provide the ability to configure multiple RSS feeds, at least not in the world of freely hosted systems.

    That little exercise in historical recovery and analysis demonstrates–I think– that there is lasting value in at least some blog discussions — all of the points that were discussed back in February – June are still relevant to the discussion about online tools and approaches that Jean-Claude, Heather and Bob O’Hara led yesterday at Sciblog 2008.

    Anyway, it was good to make contact. Now that e-CAT is ready for beta testing (and we at Axiope are in a position to show that we are not all talk and no action!) I look forward to participating in the discussion as it continues to evolve.

  2. Great to meet you yesterday, Cameron. In the spirit of your call for a historical perspective, here is a distillation of what you and other science bloggers said about online notebooks between February and June of this year — the period when that was a
    major focus of what you were discussing. I put it together originally for my own benefit – to help me make sense of the key themes that were emerging in the discussion and to tease out the different perspectives that different people had on different aspects of the issue.

    Although it is all things that you and the others said (and in that sense obviously old hat), you might find it interesting to see someone else’s efforts to organize what I saw as the key threads in a discussion that ranged over a variety of blogs into a ‘coherent’ (!) dialog.

    The title I gave my notes was:

    The Science Bloggers on the Future of Research in Biology and Chemistry (February – June 2008)

    The topic headings are

    1. Objective/focus
    2. Vision for what ‘it’ looks like: From ELN to ‘Authoring Tool’ to ‘Virtual Research Environment’
    3. How to get started/build it
    4. Technical/desired features and capabilities
    5. Databases and combining the ‘ELN’ with a database
    6. Data aggregation, automation and lifestreaming
    7. Adoption: Barriers and drivers
    8. Attitudes to commercial ELNs, other products and approaches

    And here is the entire distillation:

    The Science Bloggers on the Future of Research in Biology and Chemistry (February – June 2008)

    1. Objective/Focus

    Martin Fenner

    At least three different goals here: 1) creating a common data format, 2) building a software platform that does these things and 3) have most or all data openly available. 3) is a desired goal, but is not necessarily connected to 1) and 2). You propose an open repository, but technically there is no reason to do this (although it might be easier).

    Interesting that most of the suggestions are about data aggregation. And David’s detailed blog post has some promising examples. My own thinking was rather along the lines of facilitating the daily tasks of a scientist: doing experiments, searching the literature, writing papers, going to seminars, etc.

    Cameron Neylon

    Its true that there is no technical reason for making the respositories open, it just happens that that is my agenda. For me, that is the end game. My aim is to make these tools available because they will help to expand the data commons. It is the case, however, that by making these tools you are also making a good data storage and processing system for a closed repository. And someone could make money from that as a way to recoup the development costs.

    We use blogs to record experiments but that is at a very crude level at the moment really but you can see where it is going. The key to me as David says is bring it all together and processing it where and how you like. That’s why I focus on aggregation (of anything, whats going on, who’s said what, where I’m supposed to be [oops, in the lab :)], and what people are writing and thinking). These kind of work, but the filtering bit is where the real work needs to be done in my view.

    Jean-Claude Bradley

    If I dare speak for Cameron – openness is central to his objectives. . . . Some of us think that there will be a qualitative change in what science can accomplish and how its gets done when data are truly open and free. Distributed intelligence is extremely difficult with limited access to information.

    Deepak Singh

    Where are the mashups (something like EpiSpider) that make our lives better, where are the APIs that make it easy for us to use the web as a platform?

    David Crotty

    Focus on creating tools that increase efficiency, rather than just copying things that have been popular in other arena (Myspace).

    Neil Saunders

    You might argue that web 2.0 is a “way of doing”, rather than software per se. However, it’s apps that make things happen. What we need is more input from biologists. Web programmers have no end of ideas for apps, but they need to be told what’s useful and what isn’t. Alternatively, biologists need to program and as you suggest, it doesn’t take too much work to get your head around an API, read a few tutorials and give it a go yourself.

    My personal view is that we need less social, “Facebook for scientists” ideas and more apps aimed at data: aggregation, interoperability, transfer, search, online analysis.

    2. Vision for what ‘it’ looks like: From ELN to ‘Authoring Tool’ to ‘Virtual Research Environment’

    Neil

    All experiments share common features such as: (1) a date/time when they were performed; (2) an aim (”generate PCR product”, “run crystal screen for protein X”); (3) the use of protocols and instruments; (4) a result (correct size band on a gel, crystals in well plate A2). The only free-form part is the interpretation. Is the result good, bad, expected? What to do next? My simplistic view is that an XML element named “notes” of data type “string” covers anything free-form that somebody might want to say about their experiment. Now we just have to design the schema, build a nice forms-based web interface and force everyone in the lab to use it (March 08)

    Martin, indeed, many of the popular tools are very useful for science and perhaps one goal is to use those APIs to bring them together. You can imagine an electronic lab notebook cobbled together from an online wiki, editor, calendar and source code repository, for example. I’m hoping to see some great science apps out of Google App Engine any day now… http://network.nature.com/blogs/user/mfenner/2008/05/10/web
    -2-0-for-scientists-where-are-the-applications

    Cameron

    What I want here is a marker on a web document that says ‘I am a scientific experiment’ (page was a more term to use – I simply mean any web document, generally accessed through discrete web pages). This will allow aggregation and distribution of the notebook a la PostGenomic or Chemical Blogspace. To me this is more important than the format of the underlying data.

    I lean towards lightweight approaches and very free form capture.

    What we need are systems that help us capture what a happened (which is much more than the data) and then help us to report it in a coherent way.

    I believe we can propose something much more ambitious with the tools we have available. We can build the systems, or at least the prototypes of systems, that will be so useful for recording and sharing data that people will want to use them . . .

    The fundamental problem is how to capture the data, associated metadata, and do the wrapping in a way that is an easy and convenient enough that scientists will actually do it . . .

    Need to build a lab book system, or a virtual research environment if you prefer, because this will encompass much more than the traditional paper notebook. It will also encompass much more than the current crop of electronic lab notebooks. . . . this system will consist of a feed aggregator (not unlike Friendfeed) where the user can subscribe to various data streams, some from people, some from instruments or sensors, and some that they have generated themselves, pictures at Flickr, comments on Twitter. The lab book component itself will look more or less like a word processor, or a blog, or a wiki, depending on what the user wants. In this system the user will generate documents (or pages); some will describe samples, some procedures, and some ideas. Critically this authoring system will have an open plugin architecture supported by an open source community that enables the user to grab extra functionality automatically. . .

    This environment will have a proper versioning system, web integration to push documents into the wider world, and a WYSIWYG interface. It will expose human readable and machine readable versions of the files, and will almost certainly use an underlying XML document format. It will probably expose both microformats, rdf, and free text tags.
    The system at core is an authoring tool (or tools) that can access a common plugin architecture. These tools do two things. They take the record that we create by ordering the datastreams we generate and to wrap that and place it in an appropriate repository or repositories. And then it also enables us to do the same thing with external datastreams (literature, data, social network tips, conversations etc). It is both an authoring and planning tool, and a layer through which we can interact with our own data (wherever it happens to be) and anyone elses data (wherever it happens to be) . . .
    In my mind it looks a bit like the offspring of a liasion between Word and Google Home page but I think its important that it should be flexibile. Ideally it should work as a layer on top of anything you want, Word, Wiki, Blog, Text editor. It’s just a set of panels in your web browser if you like. . . .
    The ideal situation would be everything loosely coupled with people able to use whatever authoring and ordering tools they like. This is at the core of the ‘web as platform’ concept.
    Our lab book is not quite unstructured, it does have metadata and it does define different aspects of the process. We are also very in favour of the use of templates for repeated processes which can be used to structure specific aspects of the capture process. Don’t get too focussed on the notion that a wiki or blog is free text. It is just like FuGE. It can be completely free or it can be a highly structured XML document. It depends on how you are interacting with it and what layers are between you and the underlying database.

    The key to our blog system is that the structure is arbitrary; the system itself is semantically unaware. How you choose to process the information is up to you. New keys and key values can be added at any time. We have found this crucial and it is this which makes me sceptical about any semantically structure data model in this context.

    Martin

    What will your proposed system look like at its core? Is it a wiki? Or a platform with plugins à la facebook? Or a set of network protocols used my many different applications?

    Jean-Claude

    There are already many commercial Electronic Notebook Systems and Content Management Systems out there but they are designed with precisely the opposite intent: to keep people out.
    This was made very clear a few weeks ago when I met with a scientific software distributor. There was not even an option to store data in a format not encrypted in a proprietary format.

    3. How to get started/build it

    Martin

    Where do we start? Probably with one of the existing tools you mentioned. But you could also start with a publishing platform such as TOPAZ used by PLoS and then work yourself backwards
    We can wait that either one of the big players or a clever startup has a great idea. But one of the attractive features of Web 2.0 is user participation. We need more discussions between scientists and software developers on what is needed and what can be done. These discussions are of course already taking place, but science bloggers can do more to collect interesting ideas and articulate them. We want the integration of reference managers in online writing tools such as Google Docs or Buzzword, but how do we make our voice heard?
    Secondly, we can write applications ourselves. The barriers of entry have become really low, and one reason are the APIs (application programming interfaces) of both science applications or conventional Web 2.0 apps:
    • Pubmed
    • Scopus
    • Connotea
    • Google Maps
    • Google Docs
    • YouTube
    • Flickr
    • Technorati
    • Facebook
    • eBay
    • Amazon
    • Yahoo

    Deepak

    I think the key is to keep it simple. There is a lot that can be done, but we need to go with the basic building blocks and use the existing fabric.

    Cameron

    Deepak, absolutely it needs to be kept simple and as far as possible it should involve tools that either people are already using or that are very easy to transfer to. I’m thinking less of an application and more of a platform that gives people a way of pulling things together.

    4. Technical/Desired features and capabilities

    Cameron

    One of the problems we have with our lab books is that they can never be detailed enough to capture everything that somebody might be interested in one day. However at the same time they are too detailed for easy reading by third parties. I think there is general agreement that on top of the lab book you need an interpretation layer, an extra blog that explains what is going on to the general public. Perhaps by capturing all the detailed bits automatically we can focus on planning and thinking about the experiments rather than worrying about how to capture everything manually. Then anyone can mash up the results, or the discussion, or the average speed of the stirrer bar, any way they like.

    Another issue is the tables. My original thinking was that if we had a data model for tables then most of our problems would go away. If there was a bit of code that could be used to describe the columns and rows then this would enable both the presentation of a table to input data as well as the rendering of the table once complete. This would reduce the ‘time suck’ associated with data entry. However there are problems with this data model being associated with the rendering of the table. Some people like tables going one way, some another. There are may different possible formats for representing the same information and some are better for some purposes than another. Therefore the data model for the table should encode the meaning behind the table, usually input materials (or data), amounts, and products (materials or data). . . .
    The table: Again this is a central user interface issue for us. Capturing an experiment in the wet lab, whether noting it as it happens or planning what you are going to in advance, is often most easily done with a table. Tables are not well implemented in the wiki and blog frameworks we are using for these systems. Therefore providing a table to capture the experiment is critical if you actually want anyone to use your system. Our users consistently identify this as the single biggest barrier to them using our system.
    Now the heavyweight approach to this is to say; ‘That’s why you need a data model. Once you have that you can generate a nice web form to capture the necessary data’. The problem with this comes when you do something slightly different. As an example I had a template set up in our system for capturing the setup of SDS-PAGE gels. This would go and look for anything that tagged as ‘protein’ as potential samples and present these in a drop down menu. This was fine until the day I wanted to run a DNA-protein conjugate on the gel. Essentially I had broken my own data model. This could be fixed, and I did fix it, by changing the way my template looked for potential samples. But in the cut and thrust of real lab work (as opposed to an academic pottering under sufferance of his students) this isn’t feasible. We can’t extend the data model every time we do something new – we are always doing something new.
    Bottom line is that tables are a familiar and easy way for users to structure information visually. If you can’t provide them people will walk away.

    5. Databases and Combining the ‘ELN’ with a database

    Cameron

    One of the features that people liked about our Blog based framework at the workshop last week was the idea that they got a catalogue of various different items (chemicals, oligonucleotides, compound types) for free once the information was in the system and properly tagged. Now this is true but you don’t get the full benefits of a database for searching, organisation, presentation etc. We have been using DabbleDB to handle a database of lab materials and one of our future goals has been to automatically update the database. What I hadn’t realised before last week was the potential to use user configured RSS feeds to set up multiple databases within DabbleDB to provide more sophisticated laboratory stocks database.

    DabbleDB can be set up to read RSS or other XML or JSON feeds to update as was pointed out to me by Lucy Powers at the workshop. To update a database all we need is a properly configured RSS feed. As long as our templates are stable the rest of the process is reasonably straightforward and we can generate databases of materials of all sorts along with expiry dates, lot numbers, ID numbers, safety data etc etc. The key to this is rich feeds that carry as much information as possible, and in particular as much of the information we have chosen to structure as possible. We don’t even need the feeds to be user configurable within the system itself as we can use Pipes to easily configure custom feeds.

    dabbledb (dabbledb.com)s a not too bad point and click web based database system that can upload excel spreadsheets and then do some simple search and relationship type queries on the data. You might find that a reasonable halfway house en route. We use it for our stocks and strains database (neylonlaboratory.dabbledb.com – can give you access to the underlying database if you want a look)

    Neil

    Oh boy, there’s a lot to get through here. Starting with “what Pierre said”. It’s entirely possible and indeed preferable to analyse genome-wide screens without Excel spreadsheets.
    It’s natural to look at data and try to force it into the tools with which we’re familiar. You see rows and columns, you think “spreadsheet”. Other people see rows and columns, they think “database table”, or “R data frame”. First point then: if you know that there’s a better way to perform a task, don’t you owe it to yourself to at least try and find out about it?
    To answer Jennifer’s questions: getting your data into a database table (such as MySQL) is incredibly beneficial, because it allows you to query it in all sorts of ways. So yes, you would be able to ask questions such as “show me all orthologs of gene X from species Y and list the paralogs too”. If your data are structured, designing relevant queries is easier.

    Deepak

    The challenge is always authoring tools. People don’t necessarily need to learn SQL, but they do need to learn to understand how to conceptualize queries and relationships.
    Can life scientists get by using excel? No, it is far to limited, designed for finance people and not complex scientific data mining. Now on the other hand, with tools like dabbledb and Blist you have the ability to use excel like front ends, i.e. familiar UI’s with relational backends that permit querying. These tools are still young, but you get the picture.
    Of course, we could do a much better job of building applications that we can mine through simple query builders. That’s where Cameron’s notions of collaboration come into play. You don’t have to have a formal collaboration. There are enough people out there who will do it for you.

    6. Data aggregation, Automation and Lifestreaming

    Neil

    particularly with respect to tracking people in the lab and connecting a datafeed from an instrument (say a PCR machine or an autoclave) to a person. The guys at Southampton have already put together the blogging lab (see http://chemtools.chem.soton.ac.uk/projects/blog/blogs.php/blog_id/24 for an example). Now if we could connect that up to, say, a Twitter feed (’I am doing a PCR reaction:Date, time’ then we may be able to reconstruct what happens in the lab at a very detailed level. We are thinking of applications in safety (recording the autoclave is working properly) but this will also be great for general trouble shooting (did the PCR machine do what it was supposed to, has someone fiddled with my programme, what was the temperature in the incubator etc etc)
    In many ways the technical problems of providing data feeds are largely solved. There are two outstanding problems. Getting more people to generate interesting feeds. And developing clever and easy to use tools to filter them (which might well involve building better tools that help scientists to generate them with proper tagging).

    Martin

    When I can query the data web with a question like ‘what is the relationship between protein purity and crystal size’, or ‘has anyone made a novel compound that might bind to my drug target based on an in silico screen’, or as Neil says ‘These people also searched for GO accession GO:0050421 (nitrite reductase (cytochrome) activity)’ then we will have the tools that make ‘Web2.0’ relevant for science. . . .

    Cameron

    We use blogs to record experiments but that is at a very crude level at the moment really but you can see where it is going. The key to me as David says is bring it all together and processing it where and how you like. That’s why I focus on aggregation (of anything, whats going on, who’s said what, where I’m supposed to be [oops, in the lab :)], and what people are writing and thinking). These kind of work, but the filtering bit is where the real work needs to be done in my view.

    The system at core is an authoring tool (or tools) that can access a common plugin architecture. These tools do two things. They take the record that we create by ordering the datastreams we generate and to wrap that and place it in an appropriate repository or repositories. And then it also enables us to do the same thing with external datastreams (literature, data, social network tips, conversations etc). It is both an authoring and planning tool, and a layer through which we can interact with our own data (wherever it happens to be) and anyone elses data (wherever it happens to be) . . .

    Neil

    Perhaps the web at large defines “social” differently to scientists. Most social web2.0 apps mean “keeping in touch with your friends”. Perhaps web2.0 for scientists apps means “keeping in touch with data”.

    Michael Barton

    Combined with an RSS WordPress plugin, my most recent research activity from Twitter is displayed as a stream on my blog. Taking this a step further I included feeds for my research tagged Figures on Flickr, my paper bibliography on CiteULike, and discussion of my research on my blog. This stream is available on http://www.michaelbarton.me.uk/research-stream/, and shows the general idea of what I’m trying to do. I like this because in bioinformatics its sometimes difficult to know what other people are doing, but, now I hope that other people in my group can have a quick glance to see what I’ve been up to. Furthermore this all works passively, where I’m already using these services in my research anyway, and the only thing I had to do, was use yahoo pipes to aggregate the already existing information.
    Because bioinformatics work is amenable to being displayed, shared, and edited on the web I think that the field should be at the bleeding edge of using Web 2.0 services like this.

    Martin

    My Space DataAvailability, Facebook Connect and Google Friend Connect were announced just the last few days and also promise better data integration.
    My suggestion: we hold a Yahoo Pipes contest for the most interesting science mashup.

    7. Adoption: Barriers and Drivers

    Cameron

    I don’t think we’ll convince the general science community to buy in unless they see the network effects that we know and love in Web2.0. And I don’t think we’ll see that until the data is available.

    Neil

    My experience of discussing electronic lab notebooks, which is mostly from biochemistry/molecular biology labs, is that many biologists are quite resistant to the idea of structured data. I think one reason that the paper notebook persists is that people like free-form notes. You may believe that a lab notebook is a highly-ordered record of experiments but trust me, it’s not uncommon to see notes such as “Bollocks! Failed again! I’m so sick of this purification…” scrawled in the margins.
    My take on the problem is that biologists spend a lot of time generating, analysing and presenting data, but they don’t spend much time thinking about the nature of their data. When people bring me data for analysis I ask questions such as: what kind of data is this? ASCII text? Binary images? Is it delimited? Can we use primary keys? Not surprisingly this is usually met with blank stares, followed by “well…I ran a gel…”.

    David Crotty

    “I can barely keep up with . . . what my labmates are doing.”

    Also consider ways for them to export their information from your site. If you can provide functionality so the effort they put in at your site can be re-used elsewhere, your site suddenly becomes much more attractive.

    For the lab notebook–which is easier, going to the lab’s shared drive, grabbing a Word document used to keep the protocols and annotating as needed, or going to the Wiki site, creating an account, signing in and editing there? Add in the worries of publicly exposing that Wiki to both annoying trolls (easily fixed, but annoying nonetheless), and more importantly, to competing labs. Factor in the need for someone to monitor and do upkeep on the Wiki. How is the Wiki any better than the Word document for an average lab member? Why should they spend the time and effort in learning something new if the payoff is not going to be significant?

    Heather Etchevers

    None of my lab mates wants to bother, and I don’t have the time to build the entire site single-handedly. What do they not want to bother with?
    • learning wiki (or html, or any other) markup – despite what I consider some enormous advantages
    • writing or retyping lab protocols – I insist here, but it’s always pulling teeth and is always done in Word
    • updating information in writing that they can get orally from the other end of the bench from one another or by the phone from me
    • potentially revealing scoopable information to other competitor labs (despite the fact that no one can really directly compete with us anyhow on most of our subjects)

    8. Attitudes to commercial ELNs, other products and approaches

    Jean Claude

    There are already many commercial Electronic Notebook Systems and Content Management Systems out there but they are designed with precisely the opposite intent: to keep people out.
    This was made very clear a few weeks ago when I met with a scientific software distributor. There was not even an option to store data in a format not encrypted in a proprietary format.

    Cameron

    Technically I agree the build would be easier in a closed ecosystem with a defined repository architecture, but as JC says, that to a certain extent already exists. The technical challenges in building (and getting people to buy into) an open system are much larger, but both JC and myself, amongst others, believe that the benefits are exponentially larger as well.
    But also there is no technical reason why this system shouldn’t be able to cope with internal or closed data. The only problem is the additional security and authentication issues. My argument is that we will get buy in by actually paying people to make their data open. At the moment they’re not interested in making data available, closed, open, or whatever. By creating a significant carrot to do so, and thus making it part of the mainstream of science practice (as is anything that generates money) I think we take the agenda forward faster than it would otherwise naturally go.
    We have been using DabbleDB to handle a database of lab materials and one of our future goals has been to automatically update the database. What I hadn’t realised before last week was the potential to use user configured RSS feeds to set up multiple databases within DabbleDB to provide more sophisticated laboratory stocks database.

    One of the problems with Wikis is that they don’t in general provide an aggregated or user configurable feed of the site in general or a name space such as a single lab book. They also don’t readily provide a means of tagging or adding metadata. Neither Wikis nor Blogs provide immediately accessible tools that provide the ability to configure multiple RSS feeds, at least not in the world of freely hosted systems.

    That little exercise in historical recovery and analysis demonstrates–I think– that there is lasting value in at least some blog discussions — all of the points that were discussed back in February – June are still relevant to the discussion about online tools and approaches that Jean-Claude, Heather and Bob O’Hara led yesterday at Sciblog 2008.

    Anyway, it was good to make contact. Now that e-CAT is ready for beta testing (and we at Axiope are in a position to show that we are not all talk and no action!) I look forward to participating in the discussion as it continues to evolve.

Comments are closed.