Evolving usage patterns on the Southampton Lab Blog Book
I am in the process of preparing the talk I am giving at Drexel next month and have been going over the early versions of our Lab Blog and getting a clearer picture of how our usage has evolved. I wanted to record this so will write some notes as I go.
To re-iterate our key aims were to capture and record the data and to make it available in a way that would allow machine processing. For example it was our aim to be able to link automatically from the data for an enzyme assay through to the purification of the enzyme, the characterisation of its DNA sequence, right through to the QA data for the DNA primer used to prepare that specific mutant gene.
Our usage of the blog has evolved through a series of clearly defined stages, some of the changes have been consciously thought through but several seem to have just occured. Here I want to record these phases and try to capture the thought processes that lead to them.
1. Using the blog as a simple journal
In the first phases of use the blog was used as a simple journal. The original blog journal that was used at this stage is ‘Beta-Glu‘ (look at entries in Nov/Dec 06). Experimental procedures were recorded either in free text or in tables. Data, for the most part pictures, was uploaded into the procedure. At this stage there was no means of modifying a post once it had been submitted. Therefore comments were often used to add detail as the course of the experiment proceeded. See for example:
Each post at this stage covers a much larger set of actions than is the case at later stages. In some cases this leads to confusion as to what is what. It is also often not obvious where multiple analyses occur as to which reaction is which e.g. which lane on a gel is which reaction. Stability and usability issues also meant that much of the data was not actually uploaded at all.
2. Starting to capture metadata
In the first phase the metadata ‘section’ was used in a freeform way to describe the overall experiment. Two different processes were being carried out in parallel. One was labelled ‘beta-galactosidase preparation and assays‘ and the other ‘Beta glucuronidase‘. This makes it easy to define and separate the two experiments – the first real example of moving beyond the capabilities of a paper notebook. Specific classes of material can also have their own sections (Buffers, Cell strain, Enzymes, Primers).
However, none of this provides a link from one procedure to the next. For example it is not immediately obvious whether this transformation used the plasmid DNA prepared from this ligation or this ligation. To achieve this a new piece of data called ‘Sample parent’ was created. It quickly became evident that this would not work as the system only allows one entry per post for each piece of metadata (i.e. For a given post ‘Sample Parent’ can only have one value).
While this is a limitaton of the system it raises a more general issue for procedures carried out in parallel. If a procedure involves several parallel actions on different, but equivalent materials (e.g. five digestions on different PCR products) then associating multiple ‘sample parents’ with the procedure will not make it (directly) possible to tell which reaction is being done in which sample. As the object was to make these kind of links machine readable this was not the best approach to use.
A similar problem became clear with commonly used input materials such as buffers or primers. One of our aims was to provide a link between each procedure and the input materials so as to enable later analysis to check for example whether a specific buffer had gone off, and if so when. Without having a very large number of metadata types it would not be possible under the current system to provide those links.
At the end of this phase it was clear that the current category system was not functioning well and that there was a need to more specifically identify specific samples. These realisations led to a radical re-organisation of the system and the commencing of an entirely new blog so as to start with a clean sheet.
3. Implementing a ‘One item-one post’ system
Phase 3 represented a complete departure from the original approach and was therefore implemented in a new Blog, ‘Investigations into neutral drift‘.
The key realisation made in phase 2 was that to enable a clear indication of what materials were inputs into a procedure it was necessary for each such material to have a clear identity. For many generic input materials this identity was already provided by a specific post (e.g. XhoI enzyme second batch). What was missing was equivalent posts for each product generated as part of the experiment. Generating these posts lead to the ‘one item-one post’ system that was then adopted. Broadly speaking the philosophy adopted was that if there was a tube (bottle, container etc.) of material then it should have its own post.
If every possible input material has its own post then these can easily be referred to by links between posts. This provides a structure in which it is possible to say ‘procedure A took materials B, C, and D to generate product X‘. In addition, by providing categories for the posts it is possible to add context: ‘PCR reaction A used primers B and C, on the template plasmid D to generate the PCR product X‘. This is now potentially very powerful as the above statement can be decomposed into triples which may make it possible to deploy all the available technology for machine reasoning. e.g. ‘PCR reaction A utilised primer B’ ;’PCR reaction A generated PCR product X’; Therefore ‘Product X was generated using primer B’ ; ‘if there is a problem with B there is potentially a problem with X’.
The value of this depends a great deal on how well the posts are categorised. The initial scheme involved two pieces of Metadata. The ‘Section’ described the general class of post as Procedure, Material, Product, or Note. Materials are generally things that are bought in and Products are things made in the laboratory. Procedures take Materials and Products and generate further Products. Notes are anything else; a general description of what is going on, a specific comment, or just an explanation of why nothing has happened. Templates are discussed below and Safety was added at a later stage to contain Safety information.
In this initial scheme the metadata ‘Post Type’ then had to carry all the additional information as to what kind of procedure, what sort of material, a post was. This rapidly extended to a long list that was not easily categorised. An attempt to use Post Type to create nested categories can be seen in the ‘Sortase Cloning’ blog where for e.g. there are series of Post Types labelled ‘DNA_####’. This is less than ideal and the current phase addresses this by using a larger series of metadata.
The central problem with the one item one post approach is that it requires the generation of a large number of posts. This combined with the difficulty of editing and generating easily readable posts (generating tables is a nightmare) led to development being focussed on improving ease of use. The first development was the generation of templates that could provide a rapid way of generating common types of post and easily generating the correct set of links between them. This lead to a significant change in usage patterns and thus into Phase 4, the current phase.
4. Using and improving templates and their effect on blog organisation
Text based web documents such as blogs and wikis work well for plain text with the odd incorporated picture but are very cumbersome for the introduction of tables. In many biochemical procedures, especially where a series of reactions are carried out in parallel, a table is the natural way to present the input materials. This combined with the need to improve the ease of inserting links between posts led to the development of templates.
Biochemical procedures are often quite stereotyped, PCR reactions, restriction digestions, gel electrophoresis are generally carried out the same way many times over. It is therefore appropriate to provide templates the enable the rapid generation of procedures. Such a system would ideally provide the table of reactions ready made and make it easy to select and insert links to the appropriate input materials. The template system implemented within the Southampton Blog provides a flexible and powerful way of rapidly inserting the appropriate links.
A template is prepared similarly to a normal procedure post with markers where specific materials or amounts need to be inserted. When the template is then selected for use these markers are rendered, either as text boxes for the insertion of amounts of post ID codes, or as drop down menus displaying possible posts for insertion. The marker [[XX:yy]] provides a drop down menu of posts for which metadata XX=yy (% provides a wild card). The generated post can also bestamped with specific metadata via the tag [[XX>yy]].
This is tremendously powerful as it enables posts to be automatically tagged with appropriate metadata, provides a rapid and easy way for generating posts with a consistent format, and reduces the need for the user to interact with the blog though the raw coding language. However to be effective the categorisation of posts need to be well organised.
In the ‘Neutral Drift‘ blog templates are generally written so that input materials are introduced via their blog post ID. For e.g. see the gel template. In this case the input DNA material is introduced by giving the appropriate blog ID ([[blog]] tag in template). It would be preferable provide the link via a drop down menu. However this can’t be achieved because a DNA gel might include samples in the categories Plasmid or PCR_product or Digestion_product. Using the Blog ID means that anything can be added but it requires the user knows the ID number.
In the ‘Sortase cloning‘ blog this is handled slightly differently as the categories were organised after the introduction of templates. The gel template here looks for any post where Post Type starts DNA ([[Post_type:DNA%]] the % is a wildcard) allowing any thing to be characterised as DNA to be included. This works well except in cases where an item may fall into two categories (e.g. Protein-DNA conjugates which might be run on both protein and DNA gels). The central problem here is that ‘Post Type’ is carrying too much weight for a single piece of metadata.
The currently preferred approach, which is still to be implemented in an ‘active’ blog is that seen in the BioBlog Sandpit. Here the gel template looks for any post which has the metadata ‘DNA’ set to a non-empty value. The potential problem is that the categories will multiply. Our current pattern of use is one blog per project, which generally one blog per person, but it remains to be established what the best approach is for a significant sized lab using this system.
A problem with the use of the one item one post approach is the sheer number of posts that are generated. This ultimately creates problems for readability. However the metadata approach that we have now adopted makes it possible, at least in principle, to generate readable forms of a set of experiments. Further issues of usability and readability need to be addressed but significant progress has been made.
The use of tagging and metadata has evolved significantly over the time the Blog has been used as a lab book. It is now approaching a point where it is possible to be parsed by a machine although this has led to issues with the readability from a human perspective. However the metadata approach makes it possible to ‘refine and remix’ the material in the blog to provide different versions for different readers from the human to the machine and from the graduate student to the supervisor (or funding agency).