Home » Blog

Semantics in the real world? Part II – Probabilistic reasoning on contingent and dynamic vocabularies

8 April 2008 6 Comments

Rendering of human brain.And other big words I learnt from mathematicians…

The observant amongst you will have realised that the title of my previous post pushing a boat out into the area of semantics and RDF implied there was more to come. Those of you who followed the reaction [comments in original post, 1, 2, 3] will also be aware that there are much smarter and more knowledgeable people out there thinking about these problems. Nonetheless, in the spirit of thinking aloud I want to explore these ideas a little further because they underpin the way I think about the LaBLog and its organization. As with the last post this comes with the health warning that I don’t really know what I’m talking about.

If the problem with RDF is that it requires a controlled vocabulary then this is also going to be the case with ontologies. These take years of very careful work to put together, requiring a lot of effort and a lot of people’s time to generate something which is still likely to be very domain specific. Very often what you have to hand is a lot of semi-structured data, or multiple sets of differently structured data that you want to use together.

If I look at one of our LaBLogs we have quite a lot of fairly structured metadata. But if we want to use all the tools out there that people have developed we need to convert it to something useful like XML or RDF. But to do this we need a vocabulary. Now we could attempt to adopt something that exists but as I have posted I have concerns about this. My background is (at least partially) in evolution and our experience of the development of our data organisation system has certainly gone through periods of punctuated equilibrium so I would like to see a more organic approach where we adapt our controlled vocabulary over time to suit what we need to do with it (as we figure out what that is).

A good vocabulary, used by lots of people will be useful. Tags, hashtags, and all their variants work when people agree on them. But they carry very little information with them. I am still using Zemanta to write these posts, and it does seem to be getting better, as I guess more people use it and the backend starts to build up a better understanding of relationships between names and items. Controlled vocabularies are great, but they require way too much effort in my view to deploy effectively (maybe this will change as instrument manufacturers in particular come on board).

So what if there were a half way house. The structure of a properly defined ontology/controlled vocabulary is (I am assuming) well known and machine checkable. Is it possible to mine all of our highly structured, partially structured, and unstructured data and information to put together best guess ontologies for particular uses? Can those then be deployed in systems, like Zemanta, which offer you a guess as to what you mean. Below my editing screen Zemanta has come up with a range of possible links for this post. I’ve picked six of the seven it offered me. The first time I used it (last week) I picked none.

I’m not suggesting text mining or natural language processing here per se but making an attempt to pull out from structured and partially structured information and more specifically links between documents on the web. This could include grabbing identifies, microformats, RDF, XML, and ‘real’ ontologies where they exist and attempting to build a best guess model of how they all relate together.

This could be built as an ecosystem with competing, contingent, ontologies all trying their best to represent specific domains. By using these offerings and confirming or denying the relevance of specific links or descriptions branches would be chopped or would sprout. By checking what a user chose to do when they rejected the offered description the service could find new options to refine itself. In a sense this would look like the bastard offspring of a spell checker and a greasemonkey script. Offering you possible descriptions of existing pages, and checking to see whether you found it useful; offering you links and ways of categorising your own offerings when you are editing documents and checking to see whether you used them, or chose something else.

The underlying system to run this would involve hideous storage and processing issues. I am essentially proposing to remap the whole web as RDF but with the predicates associated with every link, whether authored or inferred, being held in its own database (something for the new Google offering to chew on perhaps).

Another problem is that such ontologies would be necessarily probablistic rather than absolute and would require that machine reasoning approaches over them took account of this. Again, though this more closely mirrors our own view of the world. Might it be a more natural way of interacting with these systems.

We keep talking about data models as though they are models of our data. But they’re not, they’re models of the way we think about our data. That’s a very different thing.

Image is from Wikipedia via Zemanta


6 Comments »

  • Jean-Claude Bradley said:

    I applaud your philosophy of trying to get things going even if not everything is in place – things never seems to ever get completely in place anyway :)
    In terms of the difficulty of getting lab notebooks represented in a semantically rich format, some things are harder than others. The way I look at it, the easiest part of this is recording the log portion of experiment relating to what was done. We’re simplifying the problems further by limiting our reports to the particular experimental design we’re using now with vortexing vials and filtering products.
    The bottleneck in all this is getting students to carefully convert their freeform writing into machine readable text. I’m not that concerned about the format we’re using because we can always translate it to anything else quickly using a script.

    Recording the interpretation of the experimental results is a much more challenging issue and I’m going to wait on that one until we do the low level recording properly first.

  • Jean-Claude Bradley said:

    I applaud your philosophy of trying to get things going even if not everything is in place – things never seems to ever get completely in place anyway :)
    In terms of the difficulty of getting lab notebooks represented in a semantically rich format, some things are harder than others. The way I look at it, the easiest part of this is recording the log portion of experiment relating to what was done. We’re simplifying the problems further by limiting our reports to the particular experimental design we’re using now with vortexing vials and filtering products.
    The bottleneck in all this is getting students to carefully convert their freeform writing into machine readable text. I’m not that concerned about the format we’re using because we can always translate it to anything else quickly using a script.

    Recording the interpretation of the experimental results is a much more challenging issue and I’m going to wait on that one until we do the low level recording properly first.

  • Anna said:

    I was just presented this morning with a far more complex challenge. ‘Opennotebook. Sounds great. Now do it in Welsh.’

    We have had a small discussion about meta-data translation and term linking to Y Termiadur. It might work, so we’ll have to see whether its practical or not. But I want to get the thing working and loaded first *sigh*

  • Anna said:

    I was just presented this morning with a far more complex challenge. ‘Opennotebook. Sounds great. Now do it in Welsh.’

    We have had a small discussion about meta-data translation and term linking to Y Termiadur. It might work, so we’ll have to see whether its practical or not. But I want to get the thing working and loaded first *sigh*

  • Cameron Neylon said:

    @ Anna: ummmmm not something I had thought about I have to admit

    @ Jean-Claude: absolutely, and this ties into the conversation with Frank Gibson obviously. At some level the vocabulary is unimportant because you can always translate it. Its the structure of the vocabulary that’s crucial. But without nice easy systems to capture what people are doing (and asking people to report on anything twice is not going to be popular) then we’re not going to go anywhere.

    At the end of the day as long as we actually _are_ capturing something we’re doing better than we were in many ways. And I am very much in favour of systems that can adapt or be adapted as you go.

  • Cameron Neylon said:

    @ Anna: ummmmm not something I had thought about I have to admit

    @ Jean-Claude: absolutely, and this ties into the conversation with Frank Gibson obviously. At some level the vocabulary is unimportant because you can always translate it. Its the structure of the vocabulary that’s crucial. But without nice easy systems to capture what people are doing (and asking people to report on anything twice is not going to be popular) then we’re not going to go anywhere.

    At the end of the day as long as we actually _are_ capturing something we’re doing better than we were in many ways. And I am very much in favour of systems that can adapt or be adapted as you go.