Building the perfect data repository…or the one that might get used

A UV-vis readout of bis(triphenylphosphine) ni...
Image via Wikipedia

While there has been a lot of talk about data repositories and data publication there remains a real lack of good tools that are truly attractive to research scientists and also provide a route to more general and effective data sharing. Peter Murray-Rust has recently discussed the deficiencies of the traditional institutional repository as a research data repository in some depth [1, 2, 3, 4].

Data publication approaches and tools are appearing including Dryad, Figshare, BuzzData and more traditional venues such as GigaScience from BioMedCentral but these are all formal mechanisms that involve significant additional work alongside an active decision to “publish the data”. The research repository of choice remains a haphazard file store and the data sharing mechanism of choice remains email. How do we bridge this gap?

One of the problems with many efforts in this space is how they are conceived and sold as the user. “Making it easy to put your data on the web” and “helping others to find your data” solve problems that most researchers don’t think they have. Most researchers don’t want to share at all, preferring to retain as much of an advantage through secrecy as possible. Those who do see a value in sharing are for the most part highly skeptical that the vast majority of research data can be used outside the lab in which it was generated. The small remainder who see a value in wider research data sharing are painfully aware of how much work it is to make that data useful.

A successful data repository system will start by solving a different problem, a problem that all researchers recognize they have, and will then nudge the users into doing the additional work of recording (or allowing the capture of) the metadata that could make that data useful to other researchers. Finally it will quietly encourage them to make the data accessible to other researchers. Both the nudge and the encouragement will arise by offering back to the user immediate benefits in the form of automated processing, derived data products, or other more incentives.

But first the problem to solve. The problem that researchers recognize they have, in many cases prompted by funder data sharing plan requirements, is to properly store and back up their data. A further problem most PIs realize they have is access to the data of their group in a useful form. So they initial sales pitch for a data repository is going to be local and secure backup and sharing within the group. This has to be dead simple and ideally automated.

Such a repository will capture as much data as possible at source as it is generated. Just grabbing the file and storing it with the minimal contextual data, of who (which directory was it saved in), when (aquired from the initial file datestamp), and what (where has it come from), backing it up and exposing it to the research group (or subsets of it) via a simple web service. It will almost certainly involve some form of Dropbox-style system which synchronises a users own data across their own devices. Here is an immediate benefit. I don’t have to get the pen drive out of my pocket if I’m confident my new data will be on my laptop when I get back to it.

It will allow for simple configuration on each instrument that sets up a target directory and registers a filetype so that the system can recognize what instrument, or computational process, a file came from (the what). The who can be more complex, but a combination of designated directories (where a user has their own directory of data on a specific instrument), login info, and where required low irritant requests or claiming systems can be built. The when is easy. The sell here is in two parts, directory synching across computers means less mucking around with USB keys. And the backup makes everyone feel better, the researcher in the lab, the PI, and the funder.

So far so good, and there are in fact examples of systems like this that exist in one form or another or are being developed including DataStage within the DataFlow project from David Shotton’s group at Oxford, the Clarion project (PMR group at Cambridge) and Smart Research Frameworks (that I have an involvement with) led by Jeremy Frey at Southampton. I’m sure there are dozens of other systems or locally developed tools that do similar things and these are a good starting position.

The question is how do you take systems like this and push it to the next level. How do capture, or encourage the user to provide, enough metadata to actually make the stored data more widely useful? Particularly when they don’t have any real interest in sharing or data publication? I think that there is a significant potential in offering downstream processing of the data.

If This Then That (IFTTT) is a startup that has got quite a bit of attention over the past few weeks as it has come into public beta. The concept is very simple. For a defined set of services there are specific triggers (posting a tweet, favouriting a YouTube video) that can be used to set off another action at another service (send an email, bookmark the URL of the tweet or the video). What if we could offer data processing steps to the user? If the processing steps happen automatically but require a bit more metadata will that provide the incentive to get that data in?

This concept may sound a lot like the functionality provided by workflow engines but there is a difference. Workflow systems are generally very difficult for the general user to setup. This is mostly because they solve a general problem, that of putting any object into any suitable process. IFTTT offers something much simpler, a small set of common actions on common objects, that solves the 80/20 problem. Workflows are hard because they can do anything with any object. And that flexibility comes at a price because it is difficult to know whether that csv file is from a UV-Vis instrument, a small-angle x-ray experiment, or a simulated data set.

But locally, within a research group those there is a more limited set of data objects. With a local (or localized) repository it is possible to imagine plugins that do common single steps on common files. And because the configuration is local there is much less metadata required. But in turn that configuration provides metadata. If a particular filetype from a directory is configured for automated calculation of A280 for protein concentrations then we know that those data files are UV-Vis spectra. What is more, once we know that we can offer an automated protein concentration calculator. This will only work if the system knows what protein you are measuring, an incentive to identify the sample when you do the measurement.

The architecture of such a system would be relatively straightforward. A web service provides the cloud based backup and configuration for captured data files and users. Clients that sit on group users’ personal computers as well as on instruments grab their configuration information from the central repository. They might simply monitor specified directories, or they might pop up with a defined set of questions to capture additional metadata. Users register the instruments that they want to “follow” and when a new data file is generated with their name on it, it is also synchronized back to their registered devices.

The web service provides a plugin architecture where appropriate plugins for the group can be added from some sort of marketplace or online. Plugins that process data to generate additional metadata (e.g. by parsing a log file) can add that to the record of the data file. Those that generate new data will need to be configured as to where that data should go and who should have it synchronised. The plugins will also generate a record of what they did, providing an audit and provenance trail. Finally plugins can provide notifcation back to the users via email, the webservice, or a desktop client, of queued processes that need more information to proceed. The user can mute these but equally the encouragement is there to provide a little more info.

Finally the question of sharing and publication. For individual data file sharing sites like Figshare, plugins might enable straightforward direct submission of files submitted into a specific directory. For collections of data, such as those supported by Dryad, there will need to be to group files together but again this could be as simple as creating a directory. Even if files are copied and pasted or removed from their “proper directories” the system stands a reasonable chance of recognizing files it has already seen and inferring their provenance.

By making it easy to share and easier to satisfy data sharing requests by pushing data to a public space (while still retaining some ability to see how it is being used and by who) the substrate that is required to build better plugins, more functionality, and above all better discovery tools will be provided, and in turn those tools will start to develop. As the tools and functionality develop then the value gained by sharing will rise creating a virtuous circle, encouraging both good data management practice, good metadata provision, and good sharing.

This path starts with things we can build today, and in some cases already exist. It becomes more speculative as it goes forward. There are issues with file synching and security. Things will get messy. The plugin architecture is nothing more than hand waving at the moment and success will require a whole ecosystem of repositories and tools for operating on them. But it is a way forward that looks plausible to me. One that solves the problems researchers have today and guides them towards a tomorrow where best practice is a bit closer to common practice.

 

Enhanced by Zemanta

Capturing and connecting research objects: A pitch for @sciencehackday

Capture and Connect: automated data capture
Image by cameronneylon via Flickr

Jon Eisen asked a question on Friendfeed last week that sparked a really interesting discussion of what an electronic research record should look like. The conversation is worth a look as it illustrates different perspectives and views on what is important. In particular I thought Jon’s statement of what he wanted was very interesting:

I want a system where people record EVERYTHING they are doing in their research with links to all data, analyses, output, etc [CN – my italics]. And I want access to it from anywhere. And I want to be able to search it intelligently. Dropbox won’t cut it.

This is interesting to me because it maps onto my own desires. Simple systems that make it very easy to capture digital research objects as they are created and easy-to-use tools that make it straightforward to connect these objects up. This is in many ways the complement of the Reseach Communication as Aggregation idea that I described previously. By collecting all the pieces and connecting them up correctly we create a Research Record as Aggregation, making it easy to wrap pieces of this up and connect them to communications. It also provides a route towards bridging the divide between research objects that are born digital and those that are physical objects that need to be represented by digital records.

Ok. So so much handwaving – what about building something? What about building something this weekend at ScienceHackDay? My idea is that we can use three pieces that have recently come together to build a demonstrator of how such a system might work. Firstly the DropBox API is now available (and I have a developer key). DropBox is a great tool that delivers on the promise of doing one thing well. It sits on your computer and synchronises directories with the cloud and any other device you put it on. Just works. This makes it a very useful entry point for the capture of digital research objects. So Step One:

Build a web service on the DropBox API that enables users (or instruments) to subscribe and captures new digital objects, creating an exposed feed of resources.

This will enable us to captures and surface research objects with users simply dropping files into directories on local computers. Using DropBox means these can be natively synchronised across multiple user computers which is nice. But then we need to connect these objects up, ideally in an automatic way. To do this we need a robust and general way of describing relationships between them. As part of the OREChem project, a collaboration between Cambridge, Southampton, Indiana, Penn State and Cornell Universities and PubChem, supported by Microsoft, Mark Borkum has developed an ontology that describes experiments (unfortunately there is nothing available on the web as yet – but I am promised there will be soon!). Nothing so new there, been done before. What is new here is that the OREChem vocabulary describes both plans and instances of carrying out those plans. It is very simple, essentially describing each part of a process as a “stage” which takes in inputs and emits outputs. The detailed description of these inputs and outputs is left for other vocabularies. The plan and the record can have a one to one correspondence but don’t need to. It is possible to ask whether a record satisfies a plan and alternately given evidence that a plan has been carried out that all the required inputs must have existed at some point.

Why does this matter? It matters because for a particular experiment we can describe a plan. For instance a UV-Vis spectrophotometer measurement requires a sample, a specific instrument, and emits a digital file, usually in a specific format. If our webservice above knows that a particular DropBox account is associated with a UV-Vis instrument and it sees a new file of the right type it knows that the plan of a UV-Vis measurement must have been carried out. It also knows which instrument was used (based on the DropBox account) and might know who it was who did the measurement (based on the specific folder the file appeared in). The web service is therefore able to infer that there must exist (or have existed) a sample. Knowing this it can attempt to discover a record of this sample from known resources, the public web, or even by emailing the user, asking them for it, and then creating a record for them.

A quick and dirty way of building a data model and linking it to objects on the web is to use Freebase and the Freebase API. This also has the advantage that we can leverage Freebase Gridworks to add records from spreadsheets (e.g. sample lists) into the same data model. So Step Two:

Implement OREChem experiment ontology in Freebase. Describe a small set of plans as examples of particular experimental procedures.

And then Step Three:

Expand the web service built in Step One to annotate digital research objects captured in Freebase and connect them to plans. Attempt to build in automatic discovery of inferred resources from known and unknown resources, and a system to failover to ask the user directly.

Freebase and DropBox may not be the best way to do this but both provide a documented API that could enable something to be lashed up quickly. I’m equally happy to be told that SugarSync, Open Calais, or Talis Connected Commons might be better ways to do this, especially if someone will be at ScienceHackDay with expertise in this. Demonstrating something like this could be extremely valuable as it would actually leverage semantic web technology to do something useful for researchers, linking their data into a wider web, while not actually bothering them with the details of angle brackets

Enhanced by Zemanta