Reflections on Science 2.0 from a distance – Part I

Some months ago now I gave a talk at very exciting symposium organized by Greg Wilson as a closer for the Software Carpentry course he was running at Toronto University. It was exciting because of the lineup but also because it represented a real coming together of views on how developments in computer science and infrastructure as well as new social capabilities brought about by computer networks are changing scientific research.I talked, as I have several times recently, about the idea of a web-native laboratory record, thinking about what the paper notebook would look like if it were re-invented with today’s technology. Jon Udell gave a two tweet summary of my talk which I think captured the two key aspects of my view point perfectly. In this post I want to explore the first of these.

@cameronneylon: “The minimal publishable unit of science — the paper — is too big, too monolithic. The useful unit: a blog post.”#osci20

The key to the semantic web, linked open data, and indeed the web and the internet in general, is the ability to be able to address objects. URLs in and of themselves provide an amazing resource making it possible to identify and relate digital objects and resources. The “web of things” expands this idea to include addresses that identify physical objects. In science we aim to connect physical objects in the real world (samples, instruments) to data (digital objects) via concepts and models. All of these can be made addressable at any level of granularity we choose. But the level of detail is important. From a practical perspective too much detail means that the researcher won’t, or even can’t, record it properly. Too little detail and the objects aren’t flexible enough to allow re-wiring when we discover we’ve got something wrong.

A single sample deserves an identity. A single data file requires an identity, although it may be wrapped up within a larger object. The challenge comes when we look at process, descriptions of methodology and claims. A traditionally published paper is too big an object, something that is shown clearly by the failure of citations to papers to be clear. A paper will generally contain multiple claims, and multiple processes. A citation could  refer to any of these. At the other end I have argued that a tweet, 140 characters, is too small, because while you can make a statement it is difficult to provide context in the space available. To be a unit of science a tweet really needs to contain a statement and two references or citations, providing the relationship between two objects. It can be done but its just a bit too tight in my view.

So I proposed that the natural unit of science research is the blog post. There are many reasons for this. Firstly the length is elastic, accommodating something (nearly) as short as a tweet, to thousands of lines of data, code, or script. But equally there is a broad convention of approximate length, ranging from a few hundred to a few thousand words, about the length in fact of of a single lab notebook page, and about the length of a simple procedure. The second key aspect of a blog post is that it natively comes with a unique URL. The blog post is a first class object on the web, something that can be pointed at, scraped, and indexed. And crucially the blog post comes with a feed, and a feed that can contain rich and flexible metadata, again in agreed and accessible formats.

If we are to embrace the power of the web to transform the laboratory and scientific record then we need to think carefully about what the atomic components of that record are. Get this wrong and we make a record which is inaccessible, and which doesn’t take advantage of the advanced tooling that the consumer web now provides. Get it right and the ability to Google for scientific facts will come for free. And that would just be the beginning.

If you would like to read more about these ideas I have a paper just out in the BMC Journal Automated Experimentation.

Where is the best place in the Research Stack for the human API?

Interesting conversation yesterday on Twitter with Evgeniy Meyke of EarthCape prompted in part by my last post. We started talking about what a Friendfeed replacement might look like and how it might integrate more directly into scientific data. Is it possible to build something general or will it always need to be domain specific. Might this in fact be an advantage? Evgeniy asked:

@CameronNeylon do you think that “something new” could be more vertically oriented rather then for “research community” in general?

His thinking being, as I understand it that to get at domain specific underlying data is always likely to take local knowledge. As he said in his next tweet:

@CameronNeylon It might be that the broader the coverage the shallower is integration with underlining research data, unless api is good

This lead me to thinking about integration layers between data and people and recalled something that I said in jest to someone some time ago;

“If you’re using a human as your API then you need to work on your user interface.”

Thinking about the way Friendfeed works there is a real sense in which the system talks to a wide range of automated APIs but at the core there is a human layer that firstly selects feeds of interest and then when presented with other feeds selects from them specific items. What Friendfeed does very well in some senses is provide a flexible API between feeds and the human brain. But Evegeniy made the point that this “works only 4 ‘discussion based’ collaboration (as in FF), not 4 e.g. collab. taxonomic research that needs specific data inegration with taxonomic databases”.

Following from this was an interesting conversation [Webcite Archived Version] about how we might best integrate the “human API” for some imaginary “Science Stream” with domain specific machine APIs that work at the data level. In a sense this is the core problem of scientific informatics. How do you optimise the ability of machines to abstract and use data and meaning while at the same time fully exploiting the ability of the human scientist to contribute their own unique skills, pattern recognition, insight, lateral thinking. And how do you keep these in step with each other so both are optimally utilised? Thinking in computational terms about the human as a layer in the system with its own APIs could be a useful way to design systems.

Friendfeed in this view is a peer to peer system for pushing curated and annotated data streams. It mediates interactions with the underlying stream but also with other known and unknown users. Friendfeed seems to get three things very right: 1) Optimising the interaction with the incoming data stream; 2) Facilitating the curation and republication of data into a new stream for consumption by others, creating a virtuous feedback look in fact; and 3) Facilitating discovery of new peers. Friendfeed is actually a bittorrent for sharing conversational objects.

This conversational layer, a research discourse layer if you like, is at the very top of the stack, keeping the humans to a high level abstracted level of conversation, where we are probably still at our best. And my guess is that something rather like Friendfeed is pretty good at being the next layer down, the API to feeds of interesting items.  But Evgeniy’s question was more about the bottom of the stack, where the data is being generated and needs to be turned into a useful and meaningful feed, ready to be consumed. The devil is always in the details and vertical integration is likely to help her. So what do these vertical segments look like?

In some domains these might be lab notebooks, in some they might be specific databases, or they might be a mixture of both and of other things. At the coal face it is likely to be difficult to find a way of describing the detail in a way that is both generic enough to be comprehensible and detailed enough to be useful. The needs of the data generator are likely to be very different to those of a generic data consumer. But if there is a curation layer, perhaps human or machine mediated, that partly abstracts this then we may be on the way to generating the generic feeds that will be finally consumed at the top layer.  This curation layer would enable semantic markup, ideally automatically, would require domain specific tooling to translate from the specific to the generic, and provide a publishing mechanism. In short it sounds (again) quite a bit like Wave. Actually it might just as easily be Chem4Word or any other domain specific semantic authoring tool, or just a translation engine that takes in detailed domain specific info and correlates it with a wider vocabulary.

One of the things that appeals to me about Wave, and Chem4Word, is that they can (or at least have the potential to) hide the complexities of the semantics within a straightforward and comprehensible authoring environment. Wave can be integrated into domain specific systems via purpose built Robots making it highly extensible. Both are capable of “speaking web” and generating feeds that can be consumed and processed in other places and by other services. At the bottom layer we can chew the problem off one piece at a tim, including human processing where it is appropriate and avoiding it where we can.

The middleware is of coures, as always, the problem. The middleware is agreed and standardised vocabularies and data formats. While in the past I have thought this near intractable actually it seems as though many of the pieces are actually falling into place. There is still a great need for standardisation and perhaps a need for more meta-standards but it seems like a lot of this is in fact on the way. I’m still not convinced that we have a useful vocabulary for actually describing experiments but enough smart people disagree with me that I’m going to shut up on that one until I’ve found the time to have a closer look at the various things out there in more detail.

These are half baked thoughts – but I think the idea of where we optimally place the human in the system is a useful question. It also hasn’t escaped my notice that I’m talking about something very similar to the architecture that Simon Coles of Amphora Research Systems always puts up in his presentations on Electronic Lab Notebooks. Fundamentally because the same core drivers are there.

Google Wave in Research – Part II – The Lab Record

In the previous post  I discussed a workflow using Wave to author and publish a paper. In this post I want to look at the possibility of using it as a laboratory record, or more specifically as a human interface to the laboratory record. There has been much work in recent years on research portals and Virtual Research Environments. While this work will remain useful in defining use patterns and interface design my feeling is that Wave will become the environment of choice, a framework for a virtual research environment that rewrites the rules, not so much of what is possible, but of what is easy.

Again I will work through a use case but I want to skip over a lot of what is by now I think becoming fairly obvious. Wave provides an excellent collaborative authoring environment. We can explicitly state and register licenses using a robot. The authoring environment has all the functionality of a wiki already built in so we can assume that and granular access control means that different elements of a record can be made accessible to different groups of people. We can easily generate feeds from a single wave and aggregate content in from other feeds. The modular nature of the Wave, made up of Wavelets, themselves made up of Blips, may well make it easier to generate comprehensible RSS feeds from a wiki-like environment. Something which has up until now proven challenging. I will also assume that, as seems likely, both spreadsheet and graphing capabilities are soon available as embedded objects within a Wave.

Let us imagine an experiment of the type that I do reasonably regularly, where we use a large facility instrument to study the structure of a protein in solution. We set up the experiment by including the instrument as a participant in the wave. This participant is a Robot which fronts a web service that can speak to the data repository for the instrument. It drops into the Wave a formatted table which provides options and spaces for inputs based on a previously defined structured description of the experiment. In this case it calls for a role for this particular run(is it a background or an experimental sample?) and asks where the description of the sample is.

The purification of the protein has already been described in another wave. As part of this process a wavelet was created that represents the specific sample we are going to use. This sample can be directly referenced via a URL that points at the wavelet itself making the sample a full member of the semantic web of objects. While the free text of the purification was being typed in another Robot, this one representing a web service interface to appropriate ontologies, automatically suggested using specific terms adding links back to the ontology where suggestions were accepted, and creating the wavelets that describe specific samples.

The wavelet that defines the sample is dragged and dropped into the table for the experiment. This copying process is captured by the internal versioning system and creates in effect an embedded link back to the purification wave, linking the sample to the process that it is being used in. It is rather too much at this stage to expect the instrument control to be driven from the Wave itself but the Robot will sit and wait for the appropriate dataset to be generated and check with the user it has got the right one.

Once everyone is happy the Robot will populate the table with additional metadata captured as part of the instrumental process, create a new wavelet (creating a new addressable object) and drop in the data in the default format. The robot naturally also writes a description of the relationships between all the objects in an appropriate machine readable form (RDFa, XML, or all of the above) in a part of the Wave that the user doesn’t necessarily need to see. It may also populate any other databases or repositories as appropriate. Because the Robot knows who the user is it can also automatically link the experimental record back to the proposal for the experiment. Valuable information for the facility but not of sufficient interest to the user for them to be bothered keeping a good record of it.

The raw data is not yet in a useful form though, we need to process it, remove backgrounds, that kind of thing. To do this we add the Reduction Robot as a participant. This Robot looks within the wave for a wavelet containing raw data, asks the user for any necessary information (where is the background data to be subtracted) and then runs a web service that does the subtraction. It then writes out two new wavelets, one describing what it has done (with all appropriate links to the appropriate controlled vocab obviously), and a second with the processed data in it.

I need to do some more analysis on this data, perhaps fit a model to start with, so again I add another Robot that looks for a wavelet with the correct data type, does the line fit, once again writes out a wavelet that describes what it has done, and a wavelet with the result in it. I might do this several times, using a range of different analysis approaches, perhaps doing some structural modelling and deriving some parameter from the structure which I can compare to my analytical model fit. Creating a wavelet with a spreadsheet embedded I drag and drop the parameter from the model fit and from the structure and format the cells so that it shows green if they are within 5% of each other.

Ok, so far so cool. Lots of drag and drop and using of groovy web services but nothing that couldn’t be done with a bit of work with a Workflow engine like Taverna and a properly set up spreadsheet. I now make a copy of the wave (properly versioned, its history is clear as a branch off the original Wave) and I delete the sample from the top of the table. The Robots re-process and realize there is not sufficient data to do any processing so all the data wavelets and any graphs and visualizations, including my colour-coded spreadsheet  go blank. What have I done here? What I have just created is a versioned, provenanced, and shareable workflow. I can pass the Wave to a new student or collaborator simply by adding them as a participant. I can then work with them, watching as they add data, point out any mistakes they might make and discuss the results with them, even if they are on the opposite side of the globe. Most importantly I can be reasonably confident that it will work for them, they have no need to download software or configure anything. All that really remains to make this truly powerful is to wrap this workflow into a new Robot so that we can pass multiple datasets to it for processing.

When we’ve finished the experiment we can aggregate the data by dragging and dropping the final results into a new wave to create a summary we can share with a different group of people. We can tweak the figure that shows the data until we are happy and then drop it into the paper I talked about in the previous post. I’ve spent a lot of time over the past 18 months thinking and talking about how we capture what is going and at the same time create granular web-native objects  and then link them together to describe relationships between them. Wave does all of that natively and it can do it just by capturing what the user does. The real power will lie in the web services behind the robots but the beauty of these is that the robots will make using those existing web services much easier for the average user. The robots will observe and annotate what the user is doing, helping them to format and link up their data and samples.

Wave brings three key things; proper collaborative documents which will encourage referring rather than cutting and pasting; proper version control for documents; and document automation through easy access to webservices. Commenting, version control and provenance, and making a cut and paste operation actually a fully functional and intelligent embed are key to building a framework for a web-native lab notebook. Wave delivers on these. The real power comes with the functionality added by Robots and Gadgets that can be relatively easily configured to do analysis. The ideas above are just what I have though of in the last week or so. Imagination really is the limit I suspect.

OMG! This changes EVERYTHING! – or – Yet Another Wave of Adulation

Yes, I’m afraid it’s yet another over the top response to yesterday’s big announcement of Google Wave, the latest paradigm shifting gob-smackingly brilliant piece of technology (or PR depending on your viewpoint) out of Google. My interest, however is pretty specific, how can we leverage it to help us capture, communicate, and publish research? And my opinion is that this is absolutely game changing – it makes a whole series of problems simply go away, and potentially provides a route to solving many of the problems that I was struggling to see how to manage.

Firstly, lets look at the grab bag of generic issues that I’ve been thinking about. Most recently I wrote about how I thought “real time” wasn’t the big deal but giving the user control back over the timeframe in which streams came into them. I had some vague ideas about how this might look but Wave has working code. When the people who you are in conversation with are online and looking at the same wave they will see modifications in real time. If they are not in the same document they will see the comments or changes later, but can also “re-play” changes. But a lot of thought has clearly gone into thinking about the default views based on when and how a person first comes into contact with a document.

Another issue that has frustrated me is the divide between wikis and blogs. Wikis have generally better editing functionality, but blogs have workable RSS feeds, Wikis have more plugins, blogs map better onto the diary style of a lab notebook. None of these were ever fundamental philosophical differences but just historical differences of implementations and developer priorities. Wave makes most of these differences irrelevant by creating a collaborative document framework that easily incorporates much of the best of all of these tools within a high quality rich text and media authoring platform. Bringing in content looks relatively easy and pushing content out in different forms also seems to be pretty straightforward. Streams, feeds, and other outputs, if not native, look to be easily generated either directly or by passing material to other services. The Waves themselves are XML which should enable straightforward parsing and tweaking with existing tools as well.

One thing I haven’t written much about but have been thinking about is the process of converting lab records into reports and onto papers. While there wasn’t much on display about complex documents a lot of just nice functionality, drag and drop links, options for incorporating and embedding content was at least touched on. Looking a little closer into the documentation there seems to be quite a strong provenance model, built on a code repository style framework for handling document versioning and forking. All good steps in the right direction and with the open APIs and multitouch as standard on the horizon there will no doubt be excellent visual organization and authoring tools along very soon now. For those worried about security and control, a 30 second mention in the keynote basically made it clear that they have it sorted. Private messages (documents? mecuments?) need never leave your local server.

Finally the big issue for me has for some time been bridging the gap between unstructured capture of streams of events and making it easy to convert those to structured descriptions of the intepretation of experiments.  The audience was clearly wowed by the demonstration of inline real time contextual spell checking and translation. My first thought was – I want to see that real-time engine attached to an ontology browser or DbPedia and automatically generating links back to the URIs for concepts and objects. What really struck me most was the use of Waves with a few additional tools to provide authoring tools that help us to build the semantic web, the web of data, and the web of things.

For me, the central challenges for a laboratory recording system are capturing objects, whether digital or physical, as they are created, and then serve those back to the user, as they need them to describe the connections between them. As we connect up these objects we will create the semantic web. As we build structured knowledge against those records we will build a machine-parseable record of what happened that will help us to plan for the future. As I understand it each wave, and indeed each part of a wave, can be a URL endpoint; an object on the semantic web. If they aren’t already it will be easy to make them that. As much as anything it is the web native collaborative authoring tool that will make embedding and pass by reference the default approach rather than cut and past that will make the difference. Google don’t necessarily do semantic web but they do links and they do embedding, and they’ve provided a framework that should make it easy to add meaning to the links. Google just blew the door off the ELN market, and they probably didn’t even notice.

Those of us interested in web-based and electronic recording and communication of science have spent a lot of the last few years trying to describe how we need to glue the existing tools together, mailing lists, wikis, blogs, documents, databases, papers. The framework was never right so a lot of attention was focused on moving things backwards and forwards, how to connect one thing to another. That problem, as far as I can see has now ceased to exist. The challenge now is in building the right plugins and making sure the architecture is compatible with existing tools. But fundamentally the framework seems to be there. It seems like it’s time to build.

A more sober reflection will probably follow in a few days ;-)

What are the services we really want for recording science online?

There is an interesting meta-discussion going on in a variety of places at the moment which touch very strongly on my post and talk (slides, screencast) from last week about “web native” lab notebooks. Over at Depth First, Rich Apodaca has a post with the following little gem of a soundbite:

Could it be that both Open Access and Electronic Laboratory Notebooks are examples of telephone-like capabilities being used to make a better telegraph?

Web-Centric Science offers a set of features orthoginal to those of paper-centric science. Creating the new system in the image of the old one misses the point, and the opportunity, entirely.

Meanwhile a discussion on The Realm of Organic Synthesis blog was sparked off by a post about the need for a Wikipedia inspired chemistry resource (thanks again to Rich and Tony Williams for pointing the post and discussion out respectively). The initial idea here was something along the lines of;

I envision a hybrid of Doug Taber’s Organic Chemistry Portal, Wikipedia and a condensed version of SciFinder.  I’ll gladly contribute!  How do we get the ball rolling?

This in turn has led into a discussion of whether Chemspider and ChemMantis partly fill this role already. The key point being made here is the problem of actually finding and aggregating the relevant information. Tony Williams makes the point in the comments that Chemspider is not about being a central repository in the way that J proposes in the original TROS blog post but that if there are resources out there they can be aggregated into Chemspider. There are two problems here, capturing the knowledge into one “place” and then aggregating.

Finally there is an ongoing discussion in the margins of a post at RealClimate. The argument here is over the level of “working” that should be shown when doing analyses of data, something very close to my heart. In this case both the data and the MatLab routines used to process the data have been made available. What I believe is missing is the detailed record, or log, of how those routines were used to process the data. The argument rages over the value of providing this, the amount of work involved, and whether it could actually have a chilling effect on people doing independent validation of the results. In this case there is also the political issue of providing more material for antagonistic climate change skeptics to pore over and look for minor mistakes that they will then trumpet as holes. It will come as no suprise that I think the benefits of making the log available outweigh the problems. But that we need the tools to do it. This is beautifully summed up in one comment by Tobias Martin at number 64:

So there are really two main questions: if this [making the full record available – CN] is hard work for the scientist, for heaven’s sake, why is it hard work? (And the corrolary: how are you confident that your results are correct?)

And the answer – it is hard work because we still think in terms of a paper notebook paradigm, which isn’t well matched to the data analysis being done within MatLab. When people actually do data analysis using computational systems they very rarely keep a systematic log of the process. It is actually a rather difficult thing to do – even though in principle the system could (and in some cases does) keep that entire record for you.

My point is that if we completely re-imagine the shape and functionality of the laboratory record, in the way Rich and I, and others, have suggested; if the tools are built to capture what happens and then provide that to the outside world in a useful form (when the researcher chooses), then not only will this record exist but it will provide the detailed “clickstream” records that Richard Akerman refers to in answer to a Twitter proposal from Michael Barton:

Michael Barton: Website idea: Rank scientific articles by relevance to your research; get uptakes on them via citations and pubmed “related articles”

Richard Akerman: This is a problem of data, not of technology. Amazon has millions of people with a clear clickstream through a website. We’ve got people with PDFs on their desktops.

Exchange “data files” and “Matlab scripts” for PDFs and you have a statement of the same problem that they guys at RealClimate face. Yes it is there somewhere, but it is a pain in the backside to get it out and give it to someone.

If that “clickstream” and “file use stream” and “relationship” stream was automatically captured then we get closer to the thing that I think many of us a yearning for (and have been for some time). The Amazon recommendation tool for science.

The integrated lab record – or the web native lab notebook

At Science Online 09 and at the Smi Electronic Laboratory Notebook meeting in London later in January I talked about how laboratory notebooks might evolve. At Science Online 09 the session was about Open Notebook Science and here I wanted to take the idea of what a “web native” lab record could look like and show that if you go down this road you will get the most out if you are open. At the ELN meeting which was aimed mainly at traditional database backed ELN systems for industry I wanted to show the potential of a web native way of looking at the laboratory record, and in passing to show that these approaches work best when they are open, before beating a retreat back to the position of “but if you’re really paranoid you can implement all of this behind your firewall” so as not to scare them off too much. The talks are quite similar in outline and content and I wanted to work through some of the ideas here.The central premise is one that is similar to that of many web-service start ups: “The traditional paper notebook is to the fully integrated web based lab record as a card index is to Google”. Or to put it another way, if you think in a “web-native” way then you can look to leverage the power of interlinked networks, tagging, social network effects, and other things that don’t exist on a piece of paper, or indeed in most databases. This means stripping back the lab record to basics and re-imagining it as thought it were built around web based functionality.

So what is a lab notebook? At core it is a journal of events, a record of what has happened. Very similar to a Blog in many ways. An episodic record containing dates, times, bits and pieces of often disparate material, cut and pasted into a paper notebook. It is interesting that in fact most people who use online notebooks based on existing services use Wikis rather than blogs. This is for a number of reasons; better user interfaces, sometimes better services and functionality, proper versioning, or just personal preference. But there is one thing that Wikis tend to do very badly that I feel is crucial to thinking about the lab record in a web native way; they generate at best very ropey RSS feeds. Wikis are well suited to report writing and formalising and sharing procedures but they don’t make very good diaries. At the end of the day it ought to possible to do clever things with a single back end database being presented as both blog and wiki but I’ve yet to see anything really impressive in this space so for the moment I am going to stick with the idea of blog as lab notebook because I want to focus on feeds.

So we have the idea of a blog as the record – a minute to minute and day to day record. We will assume we have a wonderful backend and API and a wide range of clients that suit different approaches to writing things down and different situations where this is being done. Witness the plethora of clients for Twittering in every circumstance and mode of interaction for instance. We’ll assume tagging functionality as well as key value pairs that are exposed as microformats and RDF as appropriate. Widgets for ontology look up and autocompleton if it is desired and the ability to automatically generate input forms from any formal description of what an experiment should look like. But above all, this will be exposed in a rich machine readable format in an RSS/Atom feed.What we don’t need is the ability to upload data. Why not? Because we’re thinking web native. On a blog you don’t generally upload images and video directly, you host them on an appropriate service and embed them on the blog page. All of the issues are handled for you and a nice viewer is put in place. The hosting service is optimised for handling the kind of content you need; Flickr for photos, YouTube (Viddler, Bioscreencast) for video, Slideshare for presentations etc. In a properly built ecosystem there would be a data hosting service, ideally one optimised for your type of data, that would provide cut and paste embed codes providing the appropiate visualisations. The lab notebook only needs to point at the data; doesn’t need to know anything much about that data beyond the fact that it is related to the stuff going on around it and that it comes with some html code to embed a visualisation of some sort.

That pointing is the next thing we need to think about. In the way I use the Chemtools LaBLog I use a one post, one item system. This means that every object gets its own post. Each sample, each bottle of material, should have its own post and its own identity. This creates a network of posts that I have written about before. What it also means is that it is possible to apply page rank style algorithms and link analysis more generally in looking at large quantities of posts. Most importantly it encodes the relationship between objects, samples, procedures, data, and analysis in the way the web is tooled up to understand: the relationships are encoded in links. This is a lightweight way of starting to build up a web of data – it doesn’t matter so much to start with whether this is in hardcore RDF as long as there is enough contextual data to make it useful. Some tagging or key-value pairs would be a good start. Most importantly it means that it doesn’t matter at all where our data files are as long as we can point at them with sufficient precision.

But if we’re moving the datafiles off the main record then what about the information about samples? Wouldn’t it be better to use the existing Laboratory Information Management System, or sample management system or database? Well again, as long as you can point at each sample independently with the precision you need then it doesn’t matter. You can use a GoogleSpreadsheet if you want to – you can give URL for each cell, there is a powerful API that would let you build services to make putting the links in easy. We use the LaBLog to keep information on our samples because we have such a wide variety of different materials put to different uses, that the flexibility of using that system rather than a database with a defined schema is important for our way of working. But for other people this may not be the case. It might even be better to use multiple different systems, a database for oligonucleotides, a spreadsheet for environmental samples, and a full blown LIMS for barcoding and following the samples through preparation for sequencing. As long; as it can be pointed at, it can be used. Similar to the data case, it is best to use the system that is best suited to the specific samples. These systems are better developed than they are for data – but many of the existing systems don’t allow a good way of pointing at specific samples from an external document – and very few make it possible to do this via a simple http compliant URL.

So we’ve passed off the data, we’ve passed off the sample management. What we’re left with is the procedures which after all is the core of the record, right? Well no. Procedures are also just documents. Maybe they are text documents, but perhaps they are better expressed as spreadsheets or workflows (or rather the record of running a workflow). Again these may well be better handled by external services, be they word processors, spreadsheets, or specialist services. They just need to be somewhere where we can point at them.

What we are left with is the links themselves, arranged along a timeline. The laboratory record is reduced to a feed which describes the relationships between samples, procedures, and data. This could be a simple feed containing links or a sophisticated and rich XML feed which points out in turn to one or more formal vocabularies to describe the semantic relationship between items. It can all be wired together, some parts less tightly coupled than others, but in principle it can at least be connected. And that takes us one significant step towards wiring up the data web that many of us dream of

The beauty of this approach is that it doesn’t require users to shift from the applications and services that they are already using, like, and understand. What it does require is intelligent and specific repositories for the objects they generate that know enough about the object type to provide useful information and context. What it also requires is good plugins, applications, and services to help people generate the lab record feed. It also requires a minimal and arbitrarily extensible way of describing the relationships. This could be as simple html links with tagging of the objects (once you know an object is a sample and it is linked to a procedure you know a lot about what is going on) but there is a logic in having a minimal vocabulary that describes relationships (what you don’t know explicitly in the tagging version is whether the sample is an input or an output). But it can also be fully semantic if that is what people want. And while the loosely tagged material won’t be easily and tightly coupled to the fully semantic material the connections will at least be there. A combination of both is not perfect, but it’s a step on the way towards the global data graph.

Recording the fiddly bits of experimental and data analysis work

We are in the slow process of gearing up within my group at RAL to adopting the Chemtools LaBLog system and in the process moving properly to an Open Notebook status. This has taken much longer than I had hoped but there have been some interesting lessons along the way. Here I want to think a bit about a problem that has been troubling me for a while.

I haven’t done a very good job of recording what I’ve been doing in the times that I have been in a lab over the past couple of months. Anyone who has been following along will have seen small bursts of apparently unrelated activity where nothing much ever seems to come to a conclusion. This has been divided up mainly into a) a SANS experiment we did in early November which has now moved into a data analysis phase, b) some preliminary, and thus far fairly unconvincing experiments, attempting to use a very new laser tweezers setup at the Central Laser Facility to measure protein-DNA interactions at the single molecule level and c) other random odds and sods that have come by. None of these have been very well recorded for a variety of reasons.

Data analysis, particularly when it uses a variety of specialist software tools, is something I find very challenging to record. A common approach is to take some relatively raw data, run it through some software, and repeat, while fiddling with parameters to get a feel for what is going on. Eventually the analysis is run “for real” and the finalised (at least for the moment) structure/number/graph is generated. The temptation is obviously just to formally record the last step but while this might be ok as a minimum standard if only one person is involved, when more people are working through data sets it makes sense to try and keep track of exactly what has been done and which data has been partially processes in which ways. This helps us both in terms of being able to quickly track where we are with the process but also reduces the risk of replicating effort.

The laser tweezers experiment involves a lot of optimising of buffer conditions, bead loading levels, instrumental parameters and whatnot. Essentially a lot of fiddling, rapid shifts from one thing to another and not always being too sure exactly what is going on. We are still at the stage of getting a feel for things rather than stepping through a well ordered experiment. Again the recording tends to be haphazard as you try on thing  and then another. We’re not even absolutely sure what we should be recording for each “run” or indeed really what a “run” is yet.

The common theme here is “fiddling” and the difficulty of recording it efficiently, accurately, and usefully. What I would prefer to be doing is somehow capturing the important aspects of what we’re doing as we do it. What is less clear is what the best way to do that is. In the case of data analysis we have good model for how to do this well. Good use of repositories and the use of versioned scripts for handling data conversions, in the way the Michael Barton in particular has talked about provide an example of good practice. Unfortunately it is good practice that is almost totally alien to experimental biochemists and is also not easily compatible with a lot of the software we use.

The ideal would be a work bench using a graphical representation of data analysis tools and data repositories that would automatically generate scripts and deposit these and versioned data files into an appropriate repository. This would enable the “docking” of arbitrary web services, software packages and whatever, as well as connection to shared data stores. The purpose of the workbench would be to record what is done, although it might also provide some automation tools. In many ways this is what I think of when look at work flow engines like Taverna and platforms for sharing workflows like MyExperiment.

Its harder in the real world. Here the workbench is, well, the workbench but the idea of recording everything along with contextual metadata is pretty similar. The challenge lies in recording enough different aspects of what is going on to capture the important stuff without generating a huge quantity or data that can never be searched effectively. It is possible to record multiple video streams, audio, screencast any control computers , but it will be almost impossible to find anything in these data streams.

A challenge that emerges over and over again in laboratory recording is that you always seem to not be recording the thing that you really now need to have. Yet if you record everything you still won’t have it because you won’t be able to find it. Video, image, and audio search will one day make a huge difference to this but in the meantime I think we’re just going to keep muddling on.

The distinction between recording and presenting – and what it means for an online lab notebook

Something that has been bothering me for quite some time fell into place for me in the last few weeks. I had always been slightly confused by my reaction to the fact that on UsefulChem Jean-Claude actively works to improve and polish the description of the experiments on the wiki. Indeed this is one of the reasons he uses a wiki as the process of making modifications to posts on blogs is generally less convenient and in most cases there isn’t a robust record of the different versions. I have always felt uncomfortable about this because to me a lab book is about the record of what happened – including any mistakes in recording you make along the way. There is some more nebulous object (probably called a report) which aggregates and polishes the description of the experiments together.

Now this is fine, but point is that the full history of a UsefulChem page is immediately available from the history. So the full record is very clearly there – it is just not what is displayed. In our system we tend to capture a warts and all view of what was recorded at the time and only correct typos or append comments or observations to a post. This tends not be very human readable in most cases – to understand the point of what is going on you have to step above this to a higher level – one which we are arguably not very good at describing at the moment.

I had thought for a long time that this was a difference between our respective fields. The synthetic chemistry of UsefulChem lends itself to a slightly higher level description where the process of a chemical reaction is described in a fairly well defined, community accepted, style. Our biochemistry is more a set of multistep processes where each of those steps is quite stereotyped. In fact for us it is difficult to define where the ‘experiment’ begins and end. This is at least partly true, but actually if you delve a little deeper and also have a look at Jean-Claude’s recent efforts to use a controlled vocabulary to describe the synthetic procedures a different view arises. Each line of one these ‘machine readable’ descriptions actually maps very well onto each of our posts in the LaBLog. Something that maps on even better is the log that appears near the bottom of each UsefulChem page. What we are actually recording is rather similar. It is simply that Jean-Claude is presenting it at a different level of abstraction.

And that I think is the key. It is true that synthetic chemistry lends itself to a slightly different level of abstraction than biochemistry and molecular biology, but the key difference actually comes in motivation. Jean-Claude’s motivation from the beginning has been to make the research record fully available to other scientists; to present that information to potential users. My focus has always been on recording the process that occurs in the lab and particular to capture the connections between objects and data files. Hence we have adopted a fine grained approach that provides a good record, but does not necessarily make it easy for someone to follow the process through. On UsefulChem the ideal final product contains a clear description of how to repeat the experiment. On the LaBLog this will require tracking through several posts to pick up the thread.

This also plays into the discussion I had some months ago with Frank Gibson about the use of data models. There is a lot to be said for using a data model to present the description of an experiment. It provides all sorts of added value to have an agreed model of what these descriptions look like. However it is less clear to me that it provides a useful way of recording or capturing the research process as it happen, at least in a general case. Stream of consciousness recording of what has happened, rather than stopping halfway through to figure out how what you are doing fits into the data model, is what is required at the recording stage. One of the reasons people feel uncomfortable with electronic lab notebooks is that they feel they will lose the ability to scribble such ‘free form’ notes – the lack of any presuppositions about what the page should loook like is one of the strengths of pen and paper.

However, once the record, or records, have been made then it is appropriate to pull these together and make sense of them – to present the description of an experiment in a structured and sensible fashion. This can of course be linked back to the primary records and specific data files but it provides a comprehensible and fine grained descriptionof the rationale for and conduct of the experiment as well as placing the results in context. This ‘presentation layer’ is something that is missing from our LaBLog but could relatively easily be pulled together by writing up the methodology section for a report. This would be good for us and good for people coming into the system looking for specific information.

Person Frank Gibson

Right click for SmartMenu shortcuts

The trouble with semantics…

…is knowing what you mean…

I posted last week about the spontaneous CMLReact hackfest held around Peter Murray-Rust’s dining room table the day after Science Blogging in London. There were a number of interesting things that came out of the exercise for me. The first was that it would be relatively easy to design a moderately strict, but pretty standard, description format for a synthetic chemistry lab notebook that could be automatically scraped into CMLReact.

Automatic conversions from lab book to machine readable XML

CMLReact files have (roughly) three sections. In the first, all the molecules that are relevant to the description are described, or in the ideal semantic web world pointed to at an external authority such as Chemspider, PubChem, or other source. In the second section the relationships between input materials, solvents, products, and samples are described. In general all of these will be molecules which are referred to in the first session but this is not absolutely required (and this will be important later). The final section describes observables, procedures, yields, and other descriptions of what happened or what was measured.

If we take a look at the UsefulChem experiment that we converted to CMLReact you can see that most of this information is available in one form or another. The molecules are described via InChi/InChiKey at the bottom of the page. This could be used as they are to populate the molecules section. A little additional markup to distinguish between reactants, solvents, reagents, and products would make it possible to start populating the second section describing the relationships between these molecules.

The third section is the most tricky, and this will always be an 80:20 game. The object is to abstract as much information as can be reasonably garnered without putting in the vast amount of work required to get close to 100% retrieval. At the end of the day, if someone wants the real detail they can go back to the lab book. Peter has demonstrated text scraping tools that do a pretty good job of extracting a lot of this information. In combination with a bit of markup it is reasonable to expect that some basic information (amounts of reagents, yield, temperature of reaction, some descriptive terms) could reasonably be extracted. Again, getting 80-90% of a subset of regularly used terms  would be very powerful.

But what are we describing?

There is a problem with grabbing this descriptive information from the lab notebook however, and it is a problem that is very general and something I believe we need to grapple with urgently. There is a fundamental question as to what it is that this file is describing. Does it describe the plan of the experiment? The record of carrying out a specific example of this experiment? An ‘averaged’ description of a set of equivalent experiments? A general description of the reaction? Or a description of a model of what we expect or think is happening?

If you look closely at the current version of the CMLReact file you will see that the yield is expressed as a percentage with a standard deviation. This is actually describing the average of three independent reactions but that is not actually made explicit anywhere in this file. Is this important? Well I think it is because it has an effect on what any outward links back to the lab book mean. There is a significant difference between – ‘this link points to an example of this kind of reaction’ (which might in fact be significantly different in the details) and ‘this link points to this exact experiment’ or indeed ‘this link points to an index of relevant experimental results’. Those distinctions need to be encoded in the links, or perhaps more likely made explicit in the abstracted file.

The CMLReact file is an abstraction of the experimental record. It is therefore important to make it clear what the level of abtraction is and what has been abstracted out of that description. This relates to the distinction I have made before between the flexibility required to record an experiment versus the ability to use a more structured vocabulary to describe the experiment after it has happened. My impression is that people who work in developing these controlled vocabularies are focussed on description rather than recording and don’t often make the distinction between the two. There is also often a lack of distinction between describing an experiment and describing a model of what happened in that experiment.  This is important because the model may need to be modified in the future whereas the description of the experiment should be accurate.

Summary

My view remains that when recording an experiment the system used should be as flexible as possible. Structure can be added to this primary record when convenient to make the process of abstracting from this primary record to a controlled vocabulary easier. The primary goal for me, for the moment, remains making a human readable record available. The process of converting the primary record into a controlled vocabulary, such as CMLReact, FuGE, or workflow system such as Taverna, should be enabled via domain specific automated or semi-automated tools that help the user to structure their description of the experiment in a way that makes it more directly useful to them but maintains the links with the primary record. Where the same controlled vocabulary is used for more abstracted descriptions of studies, experiments, or the models that purport to describe them, this distinction must be made clear.

Semantics depends absolutely on being clear about what you are describing. There is absolutely no point in having absolute clarity about the description of an object if the nature of that object is fuzzy. Get it right and we could have a very sophisticated description of the scientific record. Get it wrong and that description could be at best unclear and at worst downright misleading.

Another lab notebook framework

OpenWetWare logo, designed by Jennifer Cook-ChrysosImage via Wikipedia

Just a very brief note, which really follows on from the vigorous discussion in Jennifer Rohn’s blog at Nature Networks this week, to say that the guys at OpenWetWare appear to have gone live with some of the new functionality for laboratory notebooks on the wiki. Check it out from the OWW main page. I will have a closer look and make some comments as and when I have a bit of time but it looks like a good start.