This is the second of two posts discussing the talk I gave at the Science 2.0 Symposium organized by Greg Wilson in Toronto in July. As I described in the last post Jon Udell pulled out the two key points from my talk and tweeted them. The first suggested some ideas about what the limiting unit of science, or rather science communication, might be. The second takes me in to rather more controversial areas:
@cameronneylon uses tags to classify records in a bio lab wiki. When emergent ontology doesn’t match the standard, it’s useful info. #osci20
It may surprise many to know that I am a great believer in ontologies and controlled vocabularies. This is because I am a great believer in effectively communicating science and without agreed language effective communication isn’t possible. Where I differ with many is that assumption that because an ontology exists it provides the best means of recording my research. This is borne out of my experiences trying to figure out how to apply existing data models and structured vocabularies to my own research work. Very often the fit isn’t very good, and more seriously, it is rarely clear why or how to go about adapting or choosing the right ontology or vocabulary.
What I was talking about in Toronto was the use of key-value pairs within the Chemtools LaBLog system and the way we use them in templates. To re-cap briefly the templates were initially developed so that users can avoid having to manually mark up posts, particularly ones with tables, for common procedures. The fact that we were using a one item-one post system meant that we knew that important inputs into that table would have their own post and that the entry in the table could link to that post. This in turn meant that we could provide the user of a template with a drop down menu populated with post titles. We filter those on the basis of tags, in the form of key-value pairs, so as to provide the right set of possible items to the user. This creates a remarkably flexible, user-driven, system that has a strong positive reinforcement cycle. To make the templates work well, and to make your life easier you need to have the metadata properly recorded for research objects, but in turn you can create templates for your objects that make sure that the metadata is recorded correctly.
The effectiveness of the templates clearly depends very strongly on the organization of the metadata. The more the pattern of organization maps on to the realities of how substances and data files are used in the local research process, and the more the templates reflect the details of that process the more effective they are. We went though a number of cycles of template and metadata re-organization. We would re-organize, thinking we had things settled and then we would come across another instance of a template breaking, or not working effectively. The motivation to re-organize was to make the templates work well, and save effort. The system aided us in this by allowing us to make organizational changes without breaking any of the previous schemes.
Through repeated cycles of modification and adaption we identified an organizational scheme that worked effectively. Essentially this is a scheme that categorizes objects based on what they can be used for. A sample may be in the material form of a solution, but it may also be some form of DNA. Some procedures can usefully be applied to any solution, some are only usefully applied to DNA. If it is a form of DNA then we can ask whether it is a specific form, such as an oligonucleotide, that can be used in specific types of procedure, such as a PCR. So we ended up with a classification of DNA types based on what they might be used for (any DNA can be a PCR templates, only a relatively short single stranded DNA can be used as a – conventional – PCR primer). However in my work I also had to allow for the fact that something that was DNA might also be protein; I have done work on protein-DNA conjugates and I might want to run these on both a protein gel and a DNA gel.
We had, in fact, built our own, small scale laboratory ontology that maps onto what we actually do in our laboratory. There was little or no design that went into this, only thinking of how to make our templates work. What was interesting was the process of then mapping our terms and metadata onto designed vocabularies. The example I used in the talk was the Sequence Ontology terms relating to categories of DNA. We could map the SO term plasmid on to our key value pair DNA:plasmid, meaning a double stranded circular DNA capable in principle of transforming bacteria. SO:ss_oligo maps onto DNA:oligonucleotide (kind of, I’ve just noticed that synthetic oligo is another term in SO).
But we ran into problems with our type DNA:double_stranded_linear. In SO there is more than one term, including restriction fragments and PCR products. This distinction was not useful to us. In fact it would create a problem. For our purposes restriction fragments and PCR products were equivalent in terms of what we could do with them. The distinction the SO makes is in where they come from, not what they can do. Our schema is driven by what we can do with them. Where they came from and how they were generated is also implicit in our schema but it is separated from what an object can be used for.
There is another distinction here. The drop down menus in our templates do not have an “or†logic in the current implementation. This drives us to classify the possible use of objects in as general a way as possible. We might wish to distinguish between “flat ended†linear double stranded DNA (most PCR products) and “sticky ended†or overhanging linear ds DNA (many restriction fragments) but we are currently obliged to have at least one key value pair places these together as many standard procedures can be applied to both. In ontology construction there is a desire to describe as much detail as possible. Our framework drives us towards being as general as possible. Both approaches have their uses and neither is correct. They are built for different purposes.
The bottom line is that for a structured vocabulary to be useful and used it has to map well onto two things. The processes that the user is operating and the inputs and outputs of those processes. That is it must match the mental model of the user. Secondly it must map well onto the tools that the user has to work with. Most existing biological ontologies do not map well onto our LaBLog system, although we can usually map to them relatively easy for specific purposes in a post-hoc fashion. However I think our system is mapped quite well by some upper ontologies.
I’m currently very intrigued by an idea that I heard from Allyson Lister, which matches well onto some other work I’ve recently heard about that involves “just in time†and “per-use†data integration. It also maps onto the argument I made in my recent paper that we need to separate the issues of capturing research from those involved in describing and communicating research. The idea was that for any given document or piece of work, rather than trying to fit it into a detailed existing ontology you build a single-use local ontology based on what is happening in this specific case based on a more general ontology, perhaps OBO, perhaps something even more general. Then this local description can be mapped onto more widely used and more detailed ontologies for specific purposes.
At the end of the day the key is effective communication. We don’t all speak the same language and we’re not going to. But if we had the tools to help us capture our research in an appropriate local dialect in a way that makes it easy for us, and others, to translate into whatever lingua franca is best for a given purpose, then we will make progress.