Connecting the dots – the well posed question and code as a liability

Just a brief thought prompted by two, partly related, things streaming past my nose. Firstly Michael Nielsen discussed the views of Aristotle and Sunstein on collective intelligence. The thing that caught my attention was the idea that deliberation can make can make group functioning worse, leading to a collective decision that is muddled rather than actually identifying the best answer presented by members of the community. The exception to this is well posed questions, where deliberation can help. In science we are familiar with the idea that getting the question right (correct design of experiment, well organised theory) can be more important than the answer.

The second item was a blog post entitled “Data is good, code is a liability” from Greg Linden that was shared by Deepak Singh. Greg discussed a talk given by Peter Norvig which focusses on the idea that it is better to get a good sized dataset and use very sparing code to get at an answer rather than attempt to get at the answer de novo via complex code. Quoting from the post:

In one of several examples, Peter put up a slide showing an excerpt for a rule-based spelling corrector. The snippet of code, that was just part of a much larger program, contained a nearly impossible to understand let alone verify set of case and if statements that represented rules for spelling correction in English. He then put up a slide containing a few line Python program for a statistical spelling correction program that, given a large data file of documents, learns the likelihood of seeing words and corrects misspellings to their most likely alternative. This version, he said, not only has the benefit of being simple, but also easily can be used in different languages.

What struck me was the connection between being able to write a short, readable snippet of code, and the “well posed question”. The dataset provides the collective intelligence. So is it possible to propose the following?

“A well posed question is one which, given an appropriate dataset, can be answered by easily prepared and comprehensible code”

This could also possibly be turned on its head as “a good programming environment is one in which well posed questions can be readily converted to programs”. But it also raises an important point about how the structure of datasets relates to the questions you want to ask. The challenge in recording data is to structure it in such a way that the widest possible set of questions can be asked of that data. Data models all pre-suppose the kind of questions that will be asked. And any sufficiently general data model will be inefficient for most specific types of query.

Rajarshi Guha and Pierre Lindenbaum have been busy preparing different datastores for the solubility data being generated as part of the Open Notebook Science Challenge announced by Jean-Claude Bradley (more on this later). Rajarshi’s form based input has an SQL backend while Pierre has been working to extract the information as RDF. The point is not that one approach is better than the other, but that we need both, and possibly many more formats – and ideally we need to interconvert between them on the fly. A well posed question can easily founder on an inappropriately structured dataset (this is actually just a rephrasing of the Saunders Principle). It will be by enabling easy conversion between different formats that we might approach a situation where the aphorism I have suggested could become true.