So in the last post I got all abstract about what the record of process might require and what it might look like. In this post I want to describe a concrete implementation that could be built with existing tools. What I want to think about is the architecture that is required to capture all of this information and what it might look like.
The example I am going to use is very simple. We will take some data and do a curve fit to it. We start with a data file, which we assume we can reference with a URI, and load it up into our package. That’s all, keep it simple. What I hope to start working on in Part III is to build a toy package that would do that and maybe fit some data to a model. I am going to assume that we are using some sort of package that utilizes a command line because that is the most natural way of thinking about generating a log file, but there is no reason why a similar framework can’t be applied to something using a GUI.
Our first job is to get our data. This data will naturally be available via a URI, properly annotated and described. In loading the data we will declare it to be of a specific type, in this case something that can be represented as two columns of numbers. So we have created an internal object that contains our data. Assuming we are running some sort of automatic logging program on our command line our logfile will now look something like:
> start data_analysis
...loading data_analysis
...Version 0.1
...Package at: http://mycodeversioningsystem.org/myuserid/data_analysis/0.1
...Date is: 01/01/01
...Local environment is: Mac OS10.5
...Machine: localhost
...Directory: /usr/myuserid/Documents/Data/some_experiment
> data = load_data(URI)
...connecting to URI
...found data
...created two column data object "data"
...pushed "data" to http://myrepository.org/myuserid/new_object_id
..."data" aliased to http://myrepository.org/myuserid/new_object_id
That last couple are important because we want all of our intermediates to be accessible via a URI on the open web. The load_data routine will include the pushing of the newly created object in some useable form to an external repository. Existing services that could provide this functionality include a blog or wiki with an API, a code repository like GitHub, GoogleCode, or SourceForge, an institutional or disciplinary repository, or a service like MyExperiment.org. The key thing is that the repository must then expose the data set in a form which can be readily extracted by the data analysis tool being used. The tool then uses that publicly exposed form (or an internal representation of the same object for offline work).
At the same time a script file is being created that if run within the correct version of data_analysis should generate the same results.
# Script Record
# Package: http://http://mycodeversioningsystem.org/myuserid/data_analysis/0.1
# User: myuserid
# Date: 01/01/01
# System: Mac OS 10.5
data = load_data(URI)
The script might well include some system scripting that would attempt to check whether the correct environment (e.g. Python) for the tool is available and to download and start up the tool itself if the script is directly executed from a GUI or command line environment. The script does not care what the new URI created for the data object was because when it is re-run it will create a new one. The Script should run independently of any previous execution of the same workflow.
Finally there is the graph. What we have done so far is to take one data object and convert it to a new object which is a version of the original. That is then placed online to generate an accessible URI. We want our graph to assert that http://myrepository.org/myuserid/new_object_id is a version of URI (excuse my probably malformed RDF).
<data_analysis:data_object
rdf:about= "http://myrepository.org/myuserid/new_object_id">
<data_analysis:data_type>two_column_data</data_analysis:data_type>
<data_analysis:generated>
<data_analysis:generated_from rdf:resource="URI"/> <data_analysis:generated_by_command>load_data</data_analysis:generated_via> <data_analysis:generated_by_version rdf:resource="http://mycodeversioningsystem.org/myuserid/data_analysis/0.1> <data_analysis:generated_in_system>Max OS 10.5</data_analysis:generated_in_system> <data_analysis:generated_by rdf:resource="http://myuserid.name"/> <data_analysis:generated_on_date dc:date="01/01/01"/> </data_analysis:generated>
</data_analysis:data_object>
Now this is obviously a toy example. It is relatively trivial to set up the data analysis package so as to write out these three different types of descriptive files. Each time a step is taken, that step is then described and appended to each of the three descriptions. Things will get more complicated if a process requires multiple inputs or generates multiple outputs but this is only really a question of setting up a vocabulary that makes reasonable sense. In principle multiple steps can be collapsed by combining a script file and the rdf as follows:
<data_analysis:generated_by_command
rdf:resource="http://myrepository/myuserid/location_of_script"/>
I don’t know anything much about theoretical computer science but it seems to me that any data analysis package that works through a stepwise process running previously defined commands could be described in this way. And that given that this is how computer programs run that this suggests that any data analysis process can be logged this way. It obviously has to be implemented to write out the files but in many cases this may not even be too hard. Building it in at the beginning is obviously better. The hard part is building vocabularies that make sense locally and are specific enough but are appropriately wired into wider and more general vocabularies. It is obvious that the reference to data_analysis:data_type = “two_column_data” above should probably point to some external vocabulary that describes generic data formats and their representations (in this case probably a Python pickled two column array). It is less obvious where that should be, or whether something appropriate already exists.
This then provides a clear set of descriptive files that can be used to characterise a data analysis process. The log file provides a record of exactly what happened, that is reasonably human readable, and can be hacked using regular expressions if desired. There is no reason in principle why this couldn’t be in the form of an XML file with a style sheet appropriate for human readability. The script file provides the concept of what happened as well as the instructions for repeating the process. It could usefully be compared to a plan which would look very similar but might have informative differences. The graph is a record of the relationships between the objects that were generated. It is machine readable and can additionally be used to automate the reproduction of the process, but it is a record of what happened.
The graph is immensely powerful because it can be ultimately used to parse across multiple sets of data generated by different versions of the package and even completely different packages used by different people (provided the vocabularies have some common reference). It enables the comparison of analyses carried out in parallel by different people.
But what is most powerful about the idea of an rdf based graph file of the process is that it can be automated and completely hidden from the user. The file may be presented to the user in some pleasant and readable form but they need never know they are generating rdf. The process of wiring the dataweb up, and the following process of wiring up the web of things in experimental science, will rest on having the connections captured from and not created by the user. This approach seems to provide a way towards making that happen.
What does this tell us about what a data analysis tool should look like? Well ideally it will be open source, but at a minimum there must be a set of versions that can be referenced. Ideally these versions would be available on an appropriate code repository configured to enable an automatic download. They must provide, at a minimum a log file, and preferably both script and graph versions of this log (in principle the script can be derived from either of the other two which can be derived from each other, the log and graph can’t be derived from the script). The local vocabulary must be available online and should preferably be well wired into the wider data web. The majority of this should be trivial to implement for most command line driven tools and not terribly difficult for GUI driven tools. The more complicated aspects lie in the pushing out of intermediate objects and the finalized logs onto appropriate online repositories.
A range of currently available services could play these roles, from code repositories such as Sourceforge and Github, through to the internet archive, and data and process repositories such as MyExperiment.org and Talis Connected Commons, or to locally provided repositories. Many of these have sensible APIs and/or REST interfaces that should make this relatively easy. For new analysis tools this shouldn’t be terribly difficult to implement. Implementing it in existing tools could be more challenging but not impossible. It’s a question of will rather than severe technical barriers as far as I can see. I am going to start trying to implement some of this in a toy data fitting package in Python, which will be hosted at Github, as soon as I get some specific feedback on just how bad that RDF is…