Semantics in the real world? Part II – Probabilistic reasoning on contingent and dynamic vocabularies
The observant amongst you will have realised that the title of my previous post pushing a boat out into the area of semantics and RDF implied there was more to come. Those of you who followed the reaction [comments in original post, 1, 2, 3] will also be aware that there are much smarter and more knowledgeable people out there thinking about these problems. Nonetheless, in the spirit of thinking aloud I want to explore these ideas a little further because they underpin the way I think about the LaBLog and its organization. As with the last post this comes with the health warning that I don’t really know what I’m talking about.
If the problem with RDF is that it requires a controlled vocabulary then this is also going to be the case with ontologies. These take years of very careful work to put together, requiring a lot of effort and a lot of people’s time to generate something which is still likely to be very domain specific. Very often what you have to hand is a lot of semi-structured data, or multiple sets of differently structured data that you want to use together.
If I look at one of our LaBLogs we have quite a lot of fairly structured metadata. But if we want to use all the tools out there that people have developed we need to convert it to something useful like XML or RDF. But to do this we need a vocabulary. Now we could attempt to adopt something that exists but as I have posted I have concerns about this. My background is (at least partially) in evolution and our experience of the development of our data organisation system has certainly gone through periods of punctuated equilibrium so I would like to see a more organic approach where we adapt our controlled vocabulary over time to suit what we need to do with it (as we figure out what that is).
A good vocabulary, used by lots of people will be useful. Tags, hashtags, and all their variants work when people agree on them. But they carry very little information with them. I am still using Zemanta to write these posts, and it does seem to be getting better, as I guess more people use it and the backend starts to build up a better understanding of relationships between names and items. Controlled vocabularies are great, but they require way too much effort in my view to deploy effectively (maybe this will change as instrument manufacturers in particular come on board).
So what if there were a half way house. The structure of a properly defined ontology/controlled vocabulary is (I am assuming) well known and machine checkable. Is it possible to mine all of our highly structured, partially structured, and unstructured data and information to put together best guess ontologies for particular uses? Can those then be deployed in systems, like Zemanta, which offer you a guess as to what you mean. Below my editing screen Zemanta has come up with a range of possible links for this post. I’ve picked six of the seven it offered me. The first time I used it (last week) I picked none.
I’m not suggesting text mining or natural language processing here per se but making an attempt to pull out from structured and partially structured information and more specifically links between documents on the web. This could include grabbing identifies, microformats, RDF, XML, and ‘real’ ontologies where they exist and attempting to build a best guess model of how they all relate together.
This could be built as an ecosystem with competing, contingent, ontologies all trying their best to represent specific domains. By using these offerings and confirming or denying the relevance of specific links or descriptions branches would be chopped or would sprout. By checking what a user chose to do when they rejected the offered description the service could find new options to refine itself. In a sense this would look like the bastard offspring of a spell checker and a greasemonkey script. Offering you possible descriptions of existing pages, and checking to see whether you found it useful; offering you links and ways of categorising your own offerings when you are editing documents and checking to see whether you used them, or chose something else.
The underlying system to run this would involve hideous storage and processing issues. I am essentially proposing to remap the whole web as RDF but with the predicates associated with every link, whether authored or inferred, being held in its own database (something for the new Google offering to chew on perhaps).
Another problem is that such ontologies would be necessarily probablistic rather than absolute and would require that machine reasoning approaches over them took account of this. Again, though this more closely mirrors our own view of the world. Might it be a more natural way of interacting with these systems.
We keep talking about data models as though they are models of our data. But they’re not, they’re models of the way we think about our data. That’s a very different thing.
Image is from Wikipedia via Zemanta