I’ve been mulling over this for a while, and seeing as I am home sick (can’t you tell from the rush of posts?) I’m going to give it a go. This definitely comes with a health warning as it goes way beyond what I know much about at any technical level. This is therefore handwaving of the highest order. But I haven’t come across anyone else floating the same ideas so I will have a shot at explaning my thoughts.
The Semantic Web, RDF, and XML are all the product of computer scientists thinking about computers and information. You can tell this because they deal with straightforward declarations that are absolute. X has property Y. Putting aside all the issues with the availability of tools and applications, the fact that triple stores don’t scale well, regardless of all the technical problems a central issue with applying these types of strategy to the real world is that absolutes don’t exist. I may assert that X has property Y, but what hppens when I change my mind, or when I realise I made a mistake, or when I find out that the underlying data wasn’t taken properly. How do we get this to work in the real world?
I am going to try and frame this around RDF but I think the issues are fairly generalisable. The concept of the triple is a natural one that fits with out understading of language and is comprehensible to most people. Within a defined system it is possible to make simple assertions. However in the real world it becomes obvious that authority is an issue. Who are you to make an assertion?
The triple to the quad
I think the following is reasonably well accepted and uncontentious. That is; provenance matters. Any assertion is of limited value if it isn’t clear who who has made that assertion. The concept of the quad (subject predicate object: provenance) I think is reasonably well established. I first saw it in a talk given by Paul Miller at the Unilever Centre in Cambridge (via Peter Murray-Rusts’s blog). If this is to work we need absolute control over identity and respected and useable clearing houses for identity mamangement. It is to be hoped that a combination of foaf and openid will make some of this possible. We already have good systems for rating the value of different people’s contributions. Expanding that is reasonably clear.
The quad to the quint: confidence levels
Provenance isn’t the killer though and the issues associated with it are being worked through. The big problem is that you cannot make absolute assertions. I am very confident that the sun will rise tomorrow, but there are plausible, if hopefully very unlikely, circumstances through which it might not. Closer to home I can’t think of any set of experimental data that has ever been entirely consistent. The skill in science lies in figuring out why the inconsistencies are there. More generally it is fundamental to scientific practice that any assertion is only provisional.
We therefore require a way of providing confidence levels on assertions. This doesn’t fit will within an RDF framework, as I understand it, as you are applying an operator to the predicate. It is straightforward to make assertions about your confidence in an object or subject, to make a comment about a node on the graph (via a predicate). However what we need here is to apply a weight to an edge. The visual analogy is a good one. We have gone from a graph with edges to a graph in which the various edges (predicates) connecting objects (nodes) have different strengths. Indeed if we implement the provenance ideas have a set of graphs with perhaps very different weights depending on who we do, or don’t, trust.
None of this is fatal, although it complicates things a great deal. In principle (again based on my somewhat vague understanding) there is a whole branch of mathematics based on Bayesian approaches that deals with incomplete and even contradictory data by building the most highly likely structure and placing bounds on the confidence or probability of specific states within a general model. Indeed this actually mirrors the practise of science more closely than the notion of perfectly logical machine reasoning on a perfect abstracted data set. We muddle through a set of messy data with the aim of trying to come up with the best model that fits the most data (conflicting protein-protein association data sets anyone?). Using a Bayeisan approach to generate synthetic assertions with precise confidence intervals would be hugh step forward. I shudder to imagine the computational costs but hey, that’s someone else’s problem.
The quint to the sext and on: what does my assertion rely on, and is it out of date?
Another aspect of how we practice science isn’t captured well by the triple. What is the basis on which I am making an assertion? What is the underlying data? Again this is a characteristic of the predicate rather than an object so is not straightforward to capture. If we imagine our perfectly honed scientific machine (with added probabilistic reasoning) then we would want it automatically to reassess assertions, either confidence or correctness, if the body of data these rely on are found to be flawed in some way (e.g. if the confidence level drops because a trusted authority disagrees).
It may be possible to capture the reasoning behind an assertion by creating a synthetic predicate. The properties of X, Y, and Z, when procesed this way give me answer A, which justifies my assertion. But in the real world you can only do this after the event, when you’ve winnowed out the irrelevant or misleading data. The natural language statement is X has relationship A with Y because this set of other data suggests it. This may in fact be very difficult to capture in any reasonably practical framework. Tripes are good at capturing relationships between simple objects and simple concepts. Whether they can cope with the somewhat wooly objects and concepts that make up scientific models is another question.
If it could be achieved this would be tremendously exciting because it would provide a much more objective way of weighing different types of data, that might be contradictory together. It can never be completely objective, because at the end of the day there is an element of trust in the quality of the raw data and its description, but it has the potential to be rigorous and transparent; two things that don’t often feature in clashes over scientific models. If automated reasoning fails when it hits a contradiction it will be close to useless in the real world. Being able to circumvent these problems within the system will be much better than ad hoc discarding of specific contradictory elements within a data set allowing a balancing to take place of different pieces of data.
There is another simple but critical aspect that we need to incorporate into our thinking, and that is a date stamp. A common occurrence in science is that something which is out of date or misleading makes it into a review, and then into a textbook, and from there becomes received wisdom. This is simply because secondary and tertiary literature is always 2-10 years out of date. If we can connect the literature up, ideally ultimately back down to the primary data then it is possible to imagine a future text book system in which the correction or updating of a concept in the primary data sweeps through updating the textbook (possibly even the lectures might one hope?) on the fly. These no doubt already exist, but are the being effectively used?
Conclusions
None of this should be read as suggesting that someone ought to implement this. It is more a thought experiment. How do we make the data structure more closely relate to what we really do in practice? Out there in the wet lab world we rarely have anything as absolute as a sequence or a simple ID number. We have messy samples and messy concepts out of which we try to abstract something a little simpler and hopefully a little bit more beautiful. How might we connect the way we work and think into these systems that have the potential to make our life a lot easier?
But the concept of probabilistic or weighted assertions, and in Part II probabilistic ontologies and vocabularies, is something that I think bears thinking about. My life doesn’t fit inside a triple store, at least not one that doesn’t contain multiple direct contradictions. How can we go about turning that into something useful?
Thanks, but I’ll just stick with triples and let the market prevail as to provenance and confidence.
Thanks, but I’ll just stick with triples and let the market prevail as to provenance and confidence.
There is an easy solution to all these problems. You just replace the object of the triplet by another record that also contains date, author, and whatever additional information you want. For example, instead of writing (N3 code):
:proteinA :interactswith :proteinB;
You write:
:proteinA :hasInteraction :interactionAB;
And then you wrap the interaction partner together with the meta-information:
:interactionAB
:interactionPartner :proteinB;
:experiment someplace:massspec01232007x;
:experiment elsewhere:pulldownXYZ;
:Bayes_confidence “0.87”;
:lastUpdate “01.04.2008”;
.
It’s not as beautiful but it works with the existing RDF. This is how I managed protein interaction data from different in-silico predictions some years ago and it worked like a charm.
(As a technical side note, instead of outsourcing things into a separate “interactionAB” entry, the N3 notation also allows “anonymous” blocks of triplets that are directly embedded into the referring statement.)
There is an easy solution to all these problems. You just replace the object of the triplet by another record that also contains date, author, and whatever additional information you want. For example, instead of writing (N3 code):
:proteinA :interactswith :proteinB;
You write:
:proteinA :hasInteraction :interactionAB;
And then you wrap the interaction partner together with the meta-information:
:interactionAB
:interactionPartner :proteinB;
:experiment someplace:massspec01232007x;
:experiment elsewhere:pulldownXYZ;
:Bayes_confidence “0.87”;
:lastUpdate “01.04.2008”;
.
It’s not as beautiful but it works with the existing RDF. This is how I managed protein interaction data from different in-silico predictions some years ago and it worked like a charm.
(As a technical side note, instead of outsourcing things into a separate “interactionAB” entry, the N3 notation also allows “anonymous” blocks of triplets that are directly embedded into the referring statement.)
This is an issue I’ve been working through for some time. I still don’t have an answer other than to say that the scenarios you describe don’t necessarily want to be attached to the triple.
Others made useful comments to my post here: http://www.dynamicorange.com/blog/archives/semantic-web/reification_tri.html
Some of the things you want to do can be addressed using named graphs (provenance being one) and others by reification (confidence, for example).
rob
This is an issue I’ve been working through for some time. I still don’t have an answer other than to say that the scenarios you describe don’t necessarily want to be attached to the triple.
Others made useful comments to my post here: http://www.dynamicorange.com/blog/archives/semantic-web/reification_tri.html
Some of the things you want to do can be addressed using named graphs (provenance being one) and others by reification (confidence, for example).
rob