Building the perfect data repository…or the one that might get used

A UV-vis readout of bis(triphenylphosphine) ni... — Image via Wikipedia

While there has been a lot of talk about data repositories and data publication there remains a real lack of good tools that are truly attractive to research scientists and also provide a route to more general and effective data sharing. Peter Murray-Rust has recently discussed the deficiencies of the traditional institutional repository as a research data repository in some depth [1, 2, 3, 4].

Data publication approaches and tools are appearing including Dryad, Figshare, BuzzData and more traditional venues such as GigaScience from BioMedCentralÂ but these are all formal mechanisms that involve significant additional work alongside an active decision to â€œpublish the dataâ€. The research repository of choice remains a haphazard file store and the data sharing mechanism of choice remains email. How do we bridge this gap?

One of the problems with many efforts in this space is how they are conceived and sold as the user. â€œMaking it easy to put your data on the webâ€ and â€œhelping others to find your dataâ€ solve problems that most researchers donâ€™t think they have. Most researchers donâ€™t want to share at all, preferring to retain as much of an advantage through secrecy as possible. Those who do see a value in sharing are for the most part highly skeptical that the vast majority of research data can be used outside the lab in which it was generated. The small remainder who see a value in wider research data sharing are painfully aware of how much work it is to make that data useful.

A successful data repository system will start by solving a different problem, a problem that all researchers recognize they have, and will then nudge the users into doing the additional work of recording (or allowing the capture of) the metadata that could make that data useful to other researchers. Finally it will quietly encourage them to make the data accessible to other researchers. Both the nudge and the encouragement will arise by offering back to the user immediate benefits in the form of automated processing, derived data products, or other more incentives.

But first the problem to solve. The problem that researchers recognize they have, in many cases prompted by funder data sharing plan requirements, is to properly store and back up their data. A further problem most PIs realize they have is access to the data of their group in a useful form. So they initial sales pitch for a data repository is going to be local and secure backup and sharing within the group. This has to be dead simple and ideally automated.

Such a repository will capture as much data as possible at source as it is generated. Just grabbing the file and storing it with the minimal contextual data, of who (which directory was it saved in), when (aquired from the initial file datestamp), and what (where has it come from), backing it up and exposing it to the research group (or subsets of it) via a simple web service. It will almost certainly involve some form of Dropbox-style system which synchronises a users own data across their own devices. Here is an immediate benefit. I donâ€™t have to get the pen drive out of my pocket if Iâ€™m confident my new data will be on my laptop when I get back to it.

It will allow for simple configuration on each instrument that sets up a target directory and registers a filetype so that the system can recognize what instrument, or computational process, a file came from (the what). The who can be more complex, but a combination of designated directories (where a user has their own directory of data on a specific instrument), login info, and where required low irritant requests or claiming systems can be built. The when is easy. The sell here is in two parts, directory synching across computers means less mucking around with USB keys. And the backup makes everyone feel better, the researcher in the lab, the PI, and the funder.

So far so good, and there are in fact examples of systems like this that exist in one form or another or are being developed including DataStage within the DataFlow project from David Shottonâ€™s group at Oxford, the Clarion project (PMR group at Cambridge) and Smart Research Frameworks (that I have an involvement with) led by Jeremy Frey at Southampton. Iâ€™m sure there are dozens of other systems or locally developed tools that do similar things and these are a good starting position.

The question is how do you take systems like this and push it to the next level. How do capture, or encourage the user to provide, enough metadata to actually make the stored data more widely useful? Particularly when they donâ€™t have any real interest in sharing or data publication? I think that there is a significant potential in offering downstream processing of the data.

If This Then That (IFTTT) is a startup that has got quite a bit of attention over the past few weeks as it has come into public beta. The concept is very simple. For a defined set of services there are specific triggers (posting a tweet, favouriting a YouTube video) that can be used to set off another action at another service (send an email, bookmark the URL of the tweet or the video). What if we could offer data processing steps to the user? If the processing steps happen automatically but require a bit more metadata will that provide the incentive to get that data in?

This concept may sound a lot like the functionality provided by workflow engines but there is a difference. Workflow systems are generally very difficult for the general user to setup. This is mostly because they solve a general problem, that of putting any object into any suitable process. IFTTT offers something much simpler, a small set of common actions on common objects, that solves the 80/20 problem. Workflows are hard because they can do anything with any object. And that flexibility comes at a price because it is difficult to know whether that csv file is from a UV-Vis instrument, a small-angle x-ray experiment, or a simulated data set.

But locally, within a research group those there is a more limited set of data objects. With a local (or localized) repository it is possible to imagine plugins that do common single steps on common files. And because the configuration is local there is much less metadata required. But in turn that configuration provides metadata. If a particular filetype from a directory is configured for automated calculation of A280 for protein concentrations then we know that those data files are UV-Vis spectra. What is more, once we know that we can offer an automated protein concentration calculator. This will only work if the system knows what protein you are measuring, an incentive to identify the sample when you do the measurement.

The architecture of such a system would be relatively straightforward. A web service provides the cloud based backup and configuration for captured data files and users. Clients that sit on group usersâ€™ personal computers as well as on instruments grab their configuration information from the central repository. They might simply monitor specified directories, or they might pop up with a defined set of questions to capture additional metadata. Users register the instruments that they want to â€œfollowâ€ and when a new data file is generated with their name on it, it is also synchronized back to their registered devices.

The web service provides a plugin architecture where appropriate plugins for the group can be added from some sort of marketplace or online. Plugins that process data to generate additional metadata (e.g. by parsing a log file) can add that to the record of the data file. Those that generate new data will need to be configured as to where that data should go and who should have it synchronised. The plugins will also generate a record of what they did, providing an audit and provenance trail. Finally plugins can provide notifcation back to the users via email, the webservice, or a desktop client, of queued processes that need more information to proceed. The user can mute these but equally the encouragement is there to provide a little more info.

Finally the question of sharing and publication. For individual data file sharing sites like Figshare, plugins might enable straightforward direct submission of files submitted into a specific directory. For collections of data, such as those supported by Dryad, there will need to be to group files together but again this could be as simple as creating a directory. Even if files are copied and pasted or removed from their â€œproper directoriesâ€ the system stands a reasonable chance of recognizing files it has already seen and inferring their provenance.

By making it easy to share and easier to satisfy data sharing requests by pushing data to a public space (while still retaining some ability to see how it is being used and by who) the substrate that is required to build better plugins, more functionality, and above all better discovery tools will be provided, and in turn those tools will start to develop. As the tools and functionality develop then the value gained by sharing will rise creating a virtuous circle, encouraging both good data management practice, good metadata provision, and good sharing.

This path starts with things we can build today, and in some cases already exist. It becomes more speculative as it goes forward. There are issues with file synching and security. Things will get messy. The plugin architecture is nothing more than hand waving at the moment and success will require a whole ecosystem of repositories and tools for operating on them. But it is a way forward that looks plausible to me. One that solves the problems researchers have today and guides them towards a tomorrow where best practice is a bit closer to common practice.

6 Replies to “Building the perfect data repository…or the one that might get used”

Matthew Todd says:

November 1, 2011 at 11:19 am

Right – agreed. Nice link with ifttt which seems curiously useful in ways I don’t understand. This isÂ a very central issue for open science, and the things you mention are looking very interesting. However, as I’ve argued here:

http://intermolecular.wordpress.com/2011/09/18/the-broader-chemical-communitys-view-of-uploading-data/

the issue (at least in my discipline) is that the natural place to share data is in a lab book, managed by the user. i.e. raw data, timestamped and frequently updated. It is enough to expect the person generating the data to include some metadata since that is to that user’s advantage – e.g. a chemist will want to apply structural information to files so that the lab book can be searched. However it goes one step too far if there is an expectation that the scientist would then spend time sharing/preparing the data more widely than that – there just isn’t the time in the day. The crucial element that’s missing at the moment is a tool to mine data that are being posted in lab books, without being alerted to them being there. A lab book crawler. Like I say in the above post – Imagine if Google had said â€œOnce youâ€™ve created a web page, just send us the details and weâ€™ll put it in our index.”

I don’t know how to implement a lab book crawler, as you know…
Cameron Neylon says:

November 2, 2011 at 10:06 am

Agree absolutely. But I think there is an avenue to getting people to a little more work immediately as long as they do get something back immediately. So to speak to your example. If when dropping an NMR spectrum (raw off the machine) into a folder a little popup asks which experiment it relates to, but in exchange it processes the spectrum, connects it to the lab notebook entry, and provides it in a form that can then be easily accessed from the web in a nice little widget I think that’s plausible.
The problem with crawlers, which I agree are a necessity, is that they rely on some form of structured data. They work for the web because the links are structured data. They won’t, in and of themselves, solve the problem of having good indexes because that requires some structured information (e.g. this compound is an input and this compound is an expected output of this synthesis, and actually this is what we got). A crawler might be able to take a stab at that but it won’t do a good job unless that info is recorded in some form. So going back to your comment about interfaces it is critical to build things that make it very easy and natural to record that information with the lowest possible overhead to the user, which probably means a custom interface for every procedure, which is hopefully what the Plan and Enactment stuff in Blog3 should deliver (coming real soon nowâ€¦)
Pingback: Around the Web: Irreverant scientists, Bookstores & choices, Myths about women in tech and more : Confessions of a Science Librarian
Pingback: Keys to a Successful Data Repository « BioMed 2.0
Pingback: Friday’s reading – Carl Boettiger
Pingback: Software Carpentry » The Best vs. the Good

Comments are closed.