Building the perfect data repository…or the one that might get used

A UV-vis readout of bis(triphenylphosphine) ni...
Image via Wikipedia

While there has been a lot of talk about data repositories and data publication there remains a real lack of good tools that are truly attractive to research scientists and also provide a route to more general and effective data sharing. Peter Murray-Rust has recently discussed the deficiencies of the traditional institutional repository as a research data repository in some depth [1, 2, 3, 4].

Data publication approaches and tools are appearing including Dryad, Figshare, BuzzData and more traditional venues such as GigaScience from BioMedCentral but these are all formal mechanisms that involve significant additional work alongside an active decision to “publish the data”. The research repository of choice remains a haphazard file store and the data sharing mechanism of choice remains email. How do we bridge this gap?

One of the problems with many efforts in this space is how they are conceived and sold as the user. “Making it easy to put your data on the web” and “helping others to find your data” solve problems that most researchers don’t think they have. Most researchers don’t want to share at all, preferring to retain as much of an advantage through secrecy as possible. Those who do see a value in sharing are for the most part highly skeptical that the vast majority of research data can be used outside the lab in which it was generated. The small remainder who see a value in wider research data sharing are painfully aware of how much work it is to make that data useful.

A successful data repository system will start by solving a different problem, a problem that all researchers recognize they have, and will then nudge the users into doing the additional work of recording (or allowing the capture of) the metadata that could make that data useful to other researchers. Finally it will quietly encourage them to make the data accessible to other researchers. Both the nudge and the encouragement will arise by offering back to the user immediate benefits in the form of automated processing, derived data products, or other more incentives.

But first the problem to solve. The problem that researchers recognize they have, in many cases prompted by funder data sharing plan requirements, is to properly store and back up their data. A further problem most PIs realize they have is access to the data of their group in a useful form. So they initial sales pitch for a data repository is going to be local and secure backup and sharing within the group. This has to be dead simple and ideally automated.

Such a repository will capture as much data as possible at source as it is generated. Just grabbing the file and storing it with the minimal contextual data, of who (which directory was it saved in), when (aquired from the initial file datestamp), and what (where has it come from), backing it up and exposing it to the research group (or subsets of it) via a simple web service. It will almost certainly involve some form of Dropbox-style system which synchronises a users own data across their own devices. Here is an immediate benefit. I don’t have to get the pen drive out of my pocket if I’m confident my new data will be on my laptop when I get back to it.

It will allow for simple configuration on each instrument that sets up a target directory and registers a filetype so that the system can recognize what instrument, or computational process, a file came from (the what). The who can be more complex, but a combination of designated directories (where a user has their own directory of data on a specific instrument), login info, and where required low irritant requests or claiming systems can be built. The when is easy. The sell here is in two parts, directory synching across computers means less mucking around with USB keys. And the backup makes everyone feel better, the researcher in the lab, the PI, and the funder.

So far so good, and there are in fact examples of systems like this that exist in one form or another or are being developed including DataStage within the DataFlow project from David Shotton’s group at Oxford, the Clarion project (PMR group at Cambridge) and Smart Research Frameworks (that I have an involvement with) led by Jeremy Frey at Southampton. I’m sure there are dozens of other systems or locally developed tools that do similar things and these are a good starting position.

The question is how do you take systems like this and push it to the next level. How do capture, or encourage the user to provide, enough metadata to actually make the stored data more widely useful? Particularly when they don’t have any real interest in sharing or data publication? I think that there is a significant potential in offering downstream processing of the data.

If This Then That (IFTTT) is a startup that has got quite a bit of attention over the past few weeks as it has come into public beta. The concept is very simple. For a defined set of services there are specific triggers (posting a tweet, favouriting a YouTube video) that can be used to set off another action at another service (send an email, bookmark the URL of the tweet or the video). What if we could offer data processing steps to the user? If the processing steps happen automatically but require a bit more metadata will that provide the incentive to get that data in?

This concept may sound a lot like the functionality provided by workflow engines but there is a difference. Workflow systems are generally very difficult for the general user to setup. This is mostly because they solve a general problem, that of putting any object into any suitable process. IFTTT offers something much simpler, a small set of common actions on common objects, that solves the 80/20 problem. Workflows are hard because they can do anything with any object. And that flexibility comes at a price because it is difficult to know whether that csv file is from a UV-Vis instrument, a small-angle x-ray experiment, or a simulated data set.

But locally, within a research group those there is a more limited set of data objects. With a local (or localized) repository it is possible to imagine plugins that do common single steps on common files. And because the configuration is local there is much less metadata required. But in turn that configuration provides metadata. If a particular filetype from a directory is configured for automated calculation of A280 for protein concentrations then we know that those data files are UV-Vis spectra. What is more, once we know that we can offer an automated protein concentration calculator. This will only work if the system knows what protein you are measuring, an incentive to identify the sample when you do the measurement.

The architecture of such a system would be relatively straightforward. A web service provides the cloud based backup and configuration for captured data files and users. Clients that sit on group users’ personal computers as well as on instruments grab their configuration information from the central repository. They might simply monitor specified directories, or they might pop up with a defined set of questions to capture additional metadata. Users register the instruments that they want to “follow” and when a new data file is generated with their name on it, it is also synchronized back to their registered devices.

The web service provides a plugin architecture where appropriate plugins for the group can be added from some sort of marketplace or online. Plugins that process data to generate additional metadata (e.g. by parsing a log file) can add that to the record of the data file. Those that generate new data will need to be configured as to where that data should go and who should have it synchronised. The plugins will also generate a record of what they did, providing an audit and provenance trail. Finally plugins can provide notifcation back to the users via email, the webservice, or a desktop client, of queued processes that need more information to proceed. The user can mute these but equally the encouragement is there to provide a little more info.

Finally the question of sharing and publication. For individual data file sharing sites like Figshare, plugins might enable straightforward direct submission of files submitted into a specific directory. For collections of data, such as those supported by Dryad, there will need to be to group files together but again this could be as simple as creating a directory. Even if files are copied and pasted or removed from their “proper directories” the system stands a reasonable chance of recognizing files it has already seen and inferring their provenance.

By making it easy to share and easier to satisfy data sharing requests by pushing data to a public space (while still retaining some ability to see how it is being used and by who) the substrate that is required to build better plugins, more functionality, and above all better discovery tools will be provided, and in turn those tools will start to develop. As the tools and functionality develop then the value gained by sharing will rise creating a virtuous circle, encouraging both good data management practice, good metadata provision, and good sharing.

This path starts with things we can build today, and in some cases already exist. It becomes more speculative as it goes forward. There are issues with file synching and security. Things will get messy. The plugin architecture is nothing more than hand waving at the moment and success will require a whole ecosystem of repositories and tools for operating on them. But it is a way forward that looks plausible to me. One that solves the problems researchers have today and guides them towards a tomorrow where best practice is a bit closer to common practice.

 

Enhanced by Zemanta

Wears the passion? Yes it does rather…

Leukemia cells.
Image via Wikipedia

Quite some months ago an article in Cancer Therapy and Biology by Scott Kern of Johns Hopkins kicked up an almighty online stink. The article entitled “Where’s the passion” bemoaned the lack of hard core dedication amongst the younger researchers that the author saw around him…starting with:

It is Sunday afternoon on a sunny, spring day.

I’m walking the halls—all of them—in a modern $59 million building dedicated to cancer research. A half hour ago, I completed a stroll through another, identical building. You see, I’m doing a survey. And the two buildings are largely empty.

The point being that if they really cared, those young researchers would be there day in-day out working their hearts out to get to the key finding. At one level this is risible, expecting everyone to work 24×7 is not a good or efficient way to get results. Furthermore you have to wonder why these younger researchers have “lost their passion”. Why doesn’t the environment create that naturally, what messages are the tenured staff sending through their actions. But I’d be being dishonest if there wasn’t a twinge of sympathy for me as well. Anyone who’s run a group has had that thought that the back of their mind; “if only they’d work harder/smarter/longer we’d be that much further ahead…”.

But all of that has been covered by others. What jumped out of the piece for me at the time were some other passages, ones that really got me angry.

When the mothers of the Mothers March collected dimes, they KNEW that teams, at that minute, were performing difficult, even dangerous, research in the supported labs. Modern cancer advocates walk for a cure down the city streets on Saturday mornings across the land. They can comfortably know that, uh…let’s see here…, some of their donations might receive similar passion. Anyway, the effort should be up to full force by 10 a.m. or so the following Monday.

[…]

During the survey period, off-site laypersons offer comments on my observations. “Don’t the people with families have a right to a career in cancer research also?” I choose not to answer. How would I? Do the patients have a duty to provide this “right”, perhaps by entering suspended animation?

Now these are all worthy statements. We’d all like to see faster development of cures and I’ve no doubt that the people out there pounding the streets are driven to do all they can to see those cures advance. But is the real problem here whether the postdocs are here on a Sunday afternoon or are there things we could do to advance this? Maybe there are other parts of the research enterprise that could be made more efficient…like I don’t know making the results of research widely available and ensuring that others are in the best position possible to build on their results?

It would be easy to pick on Kern’s record on publishing open access papers. Has he made all the efforts that would enable patients and doctors to make the best decisions they can on the basis of his research? His lab generates cell lines that can support further research. Are those freely available for others to use and build on? But to pick on Kern personally is to completely miss the point.

No, the problem is that this is systemic. Researchers across the board seem to have no interest whatsoever in looking closely at how we might deliver outcomes faster. No-one is prepared to think about how the system could be improved so as to deliver more because everyone is too focussed on climbing up the greasy pole; writing the next big paper and landing the next big grant. What is worse is that it is precisely in those areas where there is most public effort to raise money, where there is a desperate need, that attitudes towards making research outputs available are at their worse.

What made me absolutely incandescent about this piece was a small piece of data that some of use have known about for a while but has only just been published. Heather Piwowar, who has done a lot of work on how and where people share, took a close look at the sharing of microarray data. What kind of things are correlated with data sharing. The paper bears close reading (Full disclosure: I was the academic editor for PLoS ONE on this paper) but one thing has stood out from me as shocking since the first time I heard Heather discuss it: microarray data linked to studies of cancer is systematically less shared.

This is not an isolated case. Across the board there are serious questions to be asked about why it seems so difficult to get the data from studies that relate to cancer. I don’t want to speculate on the reasons because whatever they are, they are unnacceptable. I know I’ve recommended this video of Josh Sommer speaking many times before, but watch it again. Then read Heather’s paper. And then decide what you think we need to do about it. Because this can not go on.

Enhanced by Zemanta

It wasn’t supposed to be this way…

I’ve avoided writing about the Climate Research Unit emails leak for a number of reasons. Firstly it is clearly a sensitive issue with personal ramifications for some and for many others just a very highly charged issue. Probably more importantly I simply haven’t had the time or energy to look into the documents myself. I haven’t, as it were, examined the raw data for myself, only other people’s interpretations. So I’ll try to stick to a very general issue here.

There are appear to be broadly two responses from the research community to this saga. One is to close ranks and to a certain extent say “nothing was done wrong here”. This is at some level, the tack taken by the Nature Editorial of 3 December, which was headed up with “Stolen e-mails have revealed no scientific conspiracy…”. The other response is that the scandal has exposed the shambolic way that we deal with collecting, archiving, and making available both data and analysis in science, as well as the endemic issues around the hoarding of data by those who have collected it.

At one level I belong strongly in the latter camp, but I also appreciate the dismay that must be felt by those who have looked at, and understand what the emails actually contain, and their complete inability to communicate this into the howling winds of what seems to a large extent a media beatup. I have long felt that the research community would one day be shocked by the public response when, for whatever reason, the media decided to make a story about the appalling data sharing practices of publicly funded academic researchers like myself. If I’d thought about it more deeply I should have realised that this would most likely be around climate data.

Today the Times reports on its front page that the UK Metererology Office is to review 160 years of climate data and has asked a range of contributing organisations to allow it to make data public. The details of this are hazy but if the UK Met Office is really going to make the data public this is a massive shift. I might be expected to be happy about this but I’m actually profoundly depressed. While it might in the longer term lead to more strongly worded and enforced policies it will also lead to data sharing being forever associated with “making the public happy”. My hope has always been that the sharing of the research record would come about because people started to see the benefits, because they could see the possibilities in partnership with the wider community, and that it made their research more effective. Not because the tabloids told us we should.

Collecting the best climate data and doing the best possible analysis on it is not an option. If we get this wrong and don’t act effectively then with some probability that is significantly above zero our world ends. The opportunity is there to make this the biggest, most important, and most effective research project ever undertaken. To actively involve the wider community in measurement. To get an army of open source coders to re-write, audit, and re-factor the analysis software. Even to involve the (positively engaged) sceptics, to use their interest and ability to look for holes and issues. Whether politicians will act on data is not the issue that the research community can or should address; what we need to be clear on is that we provide the best data, the best analysis, and an honest view of the uncertainties. Along with the ability of anyone to critically analyse the basis for those conclusions.

There is a clear and obvious problem with this path. One of the very few credible objections to open research that I have come across is that by making material available you open your inbox to a vast community of people who will just waste your time. The people who can’t be bothered to read the background literature or learn to use the tools; the ones who just want the right answer. This is nowhere more the case than it is with climate research and it forms the basis for the most reasonable explanation of why the CRU (and every other repository of climate data as far as I am aware) have not made more data or analysis software directly available.

There are no simple answers here, and my concern is that in a kneejerk response to suddenly make things available no-one will think to put in place the social and technical infrastructure that we need to support positive engagement, and to protect active researchers, both professional and amateur from time-wasters. Interestingly I think this infrastructure might look very similar to that which we need to build to effectively share the research we do, and effectively discover the relevant work of others. Infrastructure is never sexy, particularly in the middle of a crisis. But there is one thing in the practice of research that we forget at our peril. Any given researcher needs to earn the right to be taken seriously. No-one ever earns the right to shut people up. Picking out the objection that happens to be important is something we have to at least attempt to build into our systems.

Show us the data now damnit! Excuses are running out.

A very interesting paper from Caroline Savage and Andrew Vickers was published in PLoS ONE last week detailing an empirical study of data sharing of PLoS journal authors. The results themselves, that one out ten corresponding authors provided data, are not particularly surprising, mirroring as they do previous studies, both formal [pdf] and informal (also from Vickers, I assume this is a different data set), of data sharing.

Nor are the reasons why data was not shared particularly new. Two authors couldn’t be tracked down at all. Several did not reply and the remainder came up with the usual excuses; “too hard”, “need more information”, “university policy forbids it”. The numbers in the study are small and it is a shame it wasn’t possible to do a wider study that might have teased out discipline, gender, and age differences in attitude. Such a study really ought to be done but it isn’t clear to me how to do it effectively, properly, or indeed ethically. The reason why small numbers were chosen was both to focus on PLoS authors, who might be expected to have more open attitudes, and to make the request from the authors, that the data was to be used in a Master educational project, plausible.

So while helpful, the paper itself isn’t doesn’t provide much that is new. What will be interesting will be to see how PLoS responds. These authors are clearly violating stated PLoS policy on data sharing (see e.g. PLoS ONE policy). The papers should arguably be publicly pulled from the journals. Most journals have similar policies on data sharing, and most have no corporate interest in actually enforcing them. I am unaware of any cases where a paper has been retracted due to the authors unwillingness to share (if there are examples I’d love to know about them! [Ed. Hilary Spencer from NPG pointed us in the direction of some case studies in a presentation from Philip Campbell).

Is it fair that a small group be used as a scapegoat? Is it really necessary to go for the nuclear option and pull the papers? As was said in a Friendfeed discussion thread on the paper: “IME [In my experience] researchers are reeeeeeeally good at calling bluffs. I think there’s no other way“. I can’t see any other way of raising the profile of this issue. Should PLoS take the risk of being seen as hardline on this? Risking the consequences of people not sending papers there because of the need to reveal data?

The PLoS offering has always been about quality, high profile journals delivering important papers, and at PLoS ONE critical analysis of the quality of the methodology. The perceived value of that quality is compromised by authors who do not make data available. My personal view is that PLoS would win by taking a hard line and the moral high ground. Your paper might be important enough to get into Journal X, but is the data of sufficient quality to make it into PLoS ONE? Other journals would be forced to follow – at least those that take quality seriously.

There will always be cases where data can not or should not be available. But these should be carefully delineated exceptions and not the rule. If you can’t be bothered putting your data into a shape worthy of publication then the conclusions you have based on that data are worthless. You should not be allowed to publish. End of. We are running out of excuses. The time to make the data available is now. If it isn’t backed by the data then it shouldn’t be published.

Update: It is clear from this editorial blog post from the PLoS Medicine editors that PLoS do not in fact know which papers are involved.  As was pointed out by Steve Koch in the friendfeed discussion there is an irony that Savage and Vickers have not, in a sense, provided their own raw data i.e. the emails and names of correspondents. However I would accept that to do so would be a an unethical breach of presumed privacy as the correspondents might reasonably have expected these were private emails and to publish names would effectively be entrapment. Life is never straightforward and this is precisely the kind of grey area we need more explicit guidance on.

Savage CJ, Vickers AJ (2009) Empirical Study of Data Sharing by Authors Publishing in PLoS Journals. PLoS ONE 4(9): e7078. doi:10.1371/journal.pone.0007078

Full disclosure: I am an academic editor for PLoS ONE and have raised the issue of insisting on supporting data for all charts and graphs in PLoS ONE papers in the editors’ forum. There is also a recent paper with my name on in which the words “data not shown” appear. If anyone wants that data I will make sure they get it, and as soon as Nature enable article commenting we’ll try to get something up there. The usual excuses apply, and don’t really cut the mustard.