I’ve been interested for some time in capturing information and the context in which that information is created in the lab. The question of how to build an efficient and useable laboratory recording system is fundamentally one of how much information is necessary to record and how much of that can be recorded while bothering the researcher themselves as little as possible.
The Beyond the PDF mailing list has, since the meeting a few weeks ago, been partly focused on attempts to analyse human written text and to annotate these as structured assertions, or nanopublications. This is also the approach that many Electronic Lab Notebook systems attempt to take, capturing an electronic version of the paper notebook and in some cases trying to capture all the information in it in a structured form. I can’t help but feel that, while this is important, it’s almost precisely backwards. By definition any summary of a written text will throw away information, the only question is how much. Rather than trying to capture arbitrary and complex assertions in written text, it seems better to me to ask what simple vocabulary can be provided that can express enough of what people want to say to be useful.
In classic 80/20 style we ask what is useful enough to interest researchers, how much would we lose, and what would that be? This neatly sidesteps the questions of truth (though not of likelihood) and context that are the real challenge of structuring human authored text via annotation because the limited vocabulary and the collection of structured statements made provides an explicit context.
This kind of approach turns out to work quite well in the lab. In our blog based notebook we use a one item-one post approach where every research artifact gets its own URL. Both the verbs, the procedures, and the nouns, the data and materials, all have a unique identifier. The relationships between verbs and nouns is provided by simple links. Thus the structured vocabulary of the lab notebook is [Material] was input to [Process] which generated [Data] (where Material and Data can be interchanged depending on the process). This is not so much 80/20 as 30/70 but even in this very basic form in can be quite useful. Along with records of who did something and when, and some basic tagging this actually makes a quite an effective lab notebook system.
The question is, how can we move beyond this to create a record which is richer enough to provide a real step up, but doesn’t bother the user any more than is necessary and justified by the extra functionality that they’re getting. In fact, ideally we’d capture a richer and more useful record while bothering the user less. A part of the solution lies in the work that Jeremy Frey’s group have done with blogging instruments. By having an instrument create a record of it’s state, inputs and outputs, the user is freed to focus on what their doing, and only needs to link into that record when they start to do their analysis.
Another route is the approach that Peter Murray-Rust’s group are exploring with interactive lab equipment, particularly a fume cupboard that can record spoken instructions and comments and track where objects are, monitoring an entire process in detail. The challenge in this approach lies in translating that information into something that is easy to use downstream. Audio and video remain difficult to search and worth with. Speech recognition isn’t great for formatting and clear presentation.
In the spirit of a limited vocabulary another approach is to use a lightweight infrastructure to record short comments, either structured, or free text. A bakery in London has a switch on its wall which can be turned to one of a small number of baked good as a batch goes into the oven. This is connected to a very basic twitter client then tells the world that there are fresh baked baguettes coming in about twenty minutes. Because this output data is structured it would in principle be possible to track the different baking times and preferences for muffins vs doughnuts over the day and over the year.
The lab is slightly more complex than a bakery. Different processes would take different inputs. Our hypothetical structured vocabulary would need to enable the construction of sentences with subjects, predicates, and objects, but as we’ve learnt with the lab notebook, even the simple predicate “is input toâ€, “is output of†can be very useful. “I am doing X” where X is one of a relatively small set of options provides real time bounds on when important events happened. A little more sophistication could go a long way. A very simple twitter client that provided a relatively small range of structured statements could be very useful. These statements could be processed downstream into a more directly useable record.
Last week I recorded the steps that I carried out in the lab via the hashtag #tweetthelab. These free text tweets make a serviceable, if not perfect, record of the days work. What is missing is a URI for each sample and output data file, and links between the inputs, the processes, and the outputs. But this wouldn’t be too hard to generate, particularly if instruments themselves were actually blogging or tweeting its outputs. A simple client on a tablet, phone, or locally placed computer would make it easy to both capture and to structure the lab record. There is still a need for free text comments and any structured description will not be able to capture everything but the potential for capturing a lot of the detail of what is happening in a lab, as it happens, is significant. And it’s the detail that often isn’t recorded terribly well, the little bits and pieces of exactly when something was done, what did the balance really read, which particular bottle of chemical was picked up.
Twitter is often derided as trivial, as lowering the barrier to shouting banal fragments to the world, but in the lab we need tools that will help us collect, aggregate and structure exactly those banal pieces so that we have them when we need them. Add a little bit of structure to that, but not too much, and we could have a winner. Starting from human discourse always seemed too hard for me, but starting with identifying the simplest things we can say that are also useful to the scientist on the ground seems like a viable route forward.
Cameron, thanks for an excellent piece about how your lab and others are thinking about capturing a “make a serviceable, if not perfect, record of the days work.” I use my lab notebook entries to quite intentionally summarize/prune/process of what I’ve attempted that day, so I’ve also started using twitter and flickr to capture such a more detailed version of my days work.
My simulations can automatically push their status and results to twitter, which I have found more useful than conventional logs for many reasons. I discuss a bit how I’m using machines that twitter here: http://www.carlboettiger.info/archives/375
I’m Interested in learning how to better capture the flow of ideas -> inputs -> process -> outputs. Thanks for your thought-provoking post!
Michael Barton was also doing something ages ago when he had an automatic
tweet (or was it Friendfeed post?) go off each time he committed data or
code to his repository. But any of these services provide an interesting
framework for capturing those little bits and pieces.
I agree the challenge is moving from pieces to summaries, or in the other
direction from ideas to action…that’s still a big unsolved issue for me.
Totally love the idea of a lightweight client on all my lab computers (that are connected to instruments etc.) where I could tag and write notes, bug reports, observations, ideas, proposals (and other braindumps). Would make documentation much less a pain – and searchable easier.
I talked with PMR about recording lab activity in Excel, and what could be done with that. Essentially all you need for a lab book are events with corresponding times. If you have that correspondence, you can build the record of an experiment any way you like, and you could reconstruct a full English Language account automatically, without having to write it. Your Tweets were useful for this reason – they were time-stamped events.
Agreed, you only need a few standard items, who, when, content, and then you can layer specific terms on top of that as appropriate to the domain or group. I’d argue that actually Excel is a poor tool to use for this. It would be very easy to build something simple enough to use that would take away some of the uncertainty (eg why rely on the user for the time? or ideally even their identity?) but the basic concept is very general. I like having things that generate a feed of some description because this makes things very easy to process and there are a lot of tools out there. Less so for Excel.
In any case it would probably be worth mocking something up along these lines…
At some level most operating systems now offer something along these lines but yes, I’ve been trying to get my head around some sort of scrobbler system plus online bookmarking system that lets you bring objects into a space and tag, categorize, and push them to other places. I don’t even think that a basic version would be too hard to build.
Yes – let’s do it. I’d obviously like to describe one of our current open experiments in a form that relies less on a Full English Sentence, and is merely a collection of events and times. I’d then like to see how those things could be rearranged or aggregated. Question is, how to collect action:time couples. The time has to be an automatic stamp. Let’s include photos, also time stamped. What about an audio recording also, describing the conclusion of the experiments.
Err, did I just volunteer to do another project… :-) What would be a good place to start would be a mockup of what would be useful and some ideas about what kind of device to use. It shouldn’t be too hard to throw up some sort of RoR app to do something but if you’ve got a clear idea of what you want up front. Also do you have personal ID cards or something like that that could be used to easily identify the person?
This is a very interesting discussion. Arxspan’s ArxLab product is web based (i.e. platform-neutral; only a browser is required) and allows for just this sort of input. It captures the relevant data for IP protection such as user name and time stamp and allows the scientist to “just type” instead of navigating to the correct notebook/experiment/page.Currently that is only through the user interface but we are working to extend it to SMS and other steams like Twitter. Check it out at http://www.arxspan.com – user feedback is always welcomed!
Here’s a bit of a start. https://github.com/mr01/Labtrove-Misc
PHP script reads the tweets from a certain account looking for a hashtag and a yfrog image. If/once found proceeds to post to Labtrove.
Posting images works fine, actual tweets tend to end up being miscoded somewhere along the line (Latin1 database, UTF8 XML).
That’s cool. Do you have a target blog functioning anywhere at the moment? I think the character encoding has been an issue on a number of fronts which we should probably sort out. Not sure why UTF8 isn’t good for everything?
I’ve only got a blog running on localhost with some modifications I’m afraid, nothing yet hosted on a domain. The encoding is a tricky one I tried re-encoding to Latin1 but that means the XML isn’t any longer in UTF8. I’m not sure if my only option is to write some regex strings and then PHP replace some of the characters (shudder).
Found a solution to the issue after a bit of trial and error. The Labtrove API (api/rest/rest.php line 32) encodes UTF8 twice, as opposed to just once resulting in the malformed characters. The first time when simplexml loads the string, and the second within the utf8_encode call. Removing the call to the utf8_encode function results in all being well.