Home » Blog, Featured

As a researcher…I’m a bit bloody fed up with Data Management

16 June 2017 8 Comments

The following will come across as a rant. Which it is. But it’s a well intentioned rant. Please bear in mind that I care about good practice in data sharing, documentation, and preservation. I know there are many people working to support it, generally under-funded, often having to justify their existence to higher-ups who care more about the next Glam Mag article than whether there’s any evidence to support the findings. But, and its an important but, those political fights won’t become easier until researchers know those people exist, value their opinions and input, and internalise the training and practice they provide. The best way for that to happen is to provide the discovery points, tools and support where researchers will find them, in a way that convinces us that you understand our needs. This rant is an attempt to illustrate how large that gap is at the moment.

As a researcher I…have a problem

I have a bunch of data. It’s a bit of a mess, but its not totally disorganised and I know what it is. I want to package it up neatly and put it in an appropriate data repository because I’m part of the 2% of researchers that actually care enough to do it. I’ve heard of Zenodo. I know that metadata is a thing. I’d like to do it “properly”. I am just about the best case scenario. I know enough to know I need to know more, but not so much I think I know everything.

More concretely I specifically have data from a set of interviews. I have audio and I have notes/transcripts. I have the interview prompt. I have decided this set of around 40 files is a good package to combine into one dataset on Zenodo. So my next step is to search for some guidance on how to organise and document that data. Interviews, notes, must be a common form of data package right? So a quick search for a tutorial, or guidance or best practice?

Nope. Give it a go. You either get a deep dive into metadata schema (and remember I’m one of the 2% who even know what those words mean) or you get very high level generic advice about data management in general. Maybe you get a few pages giving (inconsistent) advice on what audio file formats to use. What you don’t get is set of instructions that says “this is the best way to organise these” or good examples of how other people have done it. The latter would be ideal, just finding an example which is regarded as good, and copying the approach. I’m really trying to get advice on a truly basic question: should I organise by interview (audio files and notes together) or by file type (with interviews split up).

As a researcher trying to do a good job of data deposition, I want an example of my kind of data being done well, so I can copy it and get on with my research

As a researcher…I’m late and I’m in a hurry. I don’t have the time to find you.

Now a logical response to my whining is “get thee to your research data support office and get professional help”. Which isn’t bad advice. Again, I’m one of the 5-10% who know that my institution actually has data support. The fact that I’m in the wrong timezone is perhaps a bit unusual, but the fact that I’m in a hurry is not. I should have done this last week, or a month ago, or before the funder started auditing data sharing. In the UK with the whole country in a panic this is particularly bad at the moment with data and scholarly communications support folks oscillating wildly between trying to get any attention from researchers and being swamped when we all simultaneously panic because reports are due.

Most researchers are going to reach for the web and search. And the resources I could find are woeful as a whole. Many of them are incomprehensible, even to me. But worse, virtually none of them are actually directed at my specific use case. I need step by step instructions, with examples to copy. I’m sure there are good sources out there, but they’re not easy to find. Many of the highest ranked hits are out of date, and populated with dead links (more on that particular problem later). But the organisations providing that information are actually highly ranked by search engines. If those national archives, data support organisations worked together more, kept pages more updated and frankly did a bit of old-fashioned Google Bombing and SEO it would help a lot. Again, remember I kind of know what I’m looking for. User testing on search terms could go a long way.

As a researcher looking for best practice guidance online, I need clear, understandable, and up to date guidance to be at the top of my search results, so I can avoid the frustration that will lead me to just give up.

As a researcher…if I figure out what I should do I want useable tools to help me.

Lets imagine that I find the right advice and get on and figure out what I’m going to do. Lets imagine that I’m going to create a structure with a top level human-readable readme.txt, machine readable DDI metadata, the interview prompt (rtf and txt) and then two directories, one for audio (FLAC, wav), one for notes (rtf) with consistent filenames. I’m going to zip all that up and post it to Zenodo. I’m going to use the Community function at Zenodo to collect up all the packages I create for this project (because that provides an OAI-PMH end point for the project). Easy! Job done.

Right. DDI metadata. There will be a tool for creating this obviously. Go to website…look for a “for researchers” link. Nope. Ok. Tools. That’s what I need. Ummm…ok…any of these called “writer”? No. “Editor”…ok a couple. That link is broken, those tools are windows only. This one I have to pay for and is Windows only. This one has no installation instructions and seems to be hosted on Google Code.

Compared to some other efforts DDI is actually pretty good. The website looks as though it has been updated recently. If you dig down into “getting started” there is at least some outline advice that is appropriate for researchers, but that actually gets more confusing as you get deeper in. Should I just be using Dublin Core? Can’t you just send me to a simple set of instructions for a minimal metadata set? If the aim is for a standard, any standard to get into the hands of the average jobbing researcher, it has to be either associated with tools I can use, or give examples where I can cut and paste to adapt for my own needs.

I’m fully aware that the last thing there sends a chill down the spine of both curators and metadata folks but the reality is either your standard is widely used or it is narrowly implemented for those cases where you have fully signed up professional curators or you have a high quality tool. Both of these will only ever be niche. The majority of researchers will never fit the carefully built tools and pipelines. Most of us generate bitty datasets that don’t quite fit in the large scale specialised repositories. We will always have to adapt patterns and templates and that’s always going to be a bit messy. But I definitely fall on the side of at least doing something reasonably well rather than do nothing because its not perfect.

As a researcher who knows what metadata or documentation standard to use, I need good usable (and discoverable) tools or templates to generate it, so that I a) actually do it and b) get it as right as possible.

As a researcher…I’m a bit confused at this point.

The message we get is that “this data stuff matters”. But when we go looking what we mostly find is badly documented and not well preserved. Proliferating websites with out of date broken links and reference to obsolete tools. It looks bad when the message is that high quality documentation and good preservation matter, but those of us shouting that message don’t seem to follow it ourselves. This isn’t the first time I’ve said this, it wouldn’t hurt for RDM folks to do a better job of managing our own resources to the standards we seek to impose on researchers.

I get exactly why this is, none of this is properly funded or institutionally supported. It’s infrastructure. And worse than that its human and social infrastructure rather than sexy new servers and big iron for computation. As Nature noted in an editorial this week there is a hypocrisy at the heart of funder/government-led data agenda in failing to provide the kinds of support needed for this kind of underpinning infrastructure. It’s no longer the money itself so much as the right forms of funding to support things that are important, but not exciting. I’m less worried about new infrastructures than actually properly preserving and integrating the standards and services we have.

But more than that there’s a big gap. I’ve framed my headings in the form of user stories. Most of the data sharing infrastructure, tools and standards is still failing to meet researchers where we actually are. I have some folder of data. I want to do a good job. What should I do, right now because the funder is on my back about it!?!

Two resources would have solved my problem:

  1. First an easily discoverable example of best practice for this kind of data collection. Something that came to the top of the search results when I search for “best practice for archiving depositing records of interviews”. An associated set of instructions and options would have been useful but not critical.
  2. Having identified what form of machine readable metadata was best practice a simple web-based platform independent tool to generate that metadata, either through a Q&A form based approach or some form of wizard. Failing that at least a good example I could modify.

As a researcher that’s what I really needed and I really failed to find it. I’m sympathetic, moderately knowledgeable, and I can get the worlds best RDM advice by pinging off a tweet. And I still struggled. It’s little wonder that we’re losing most of the mainstream.

As a researcher concerned to develop better RDM practice, I need support to meet me where I am, so ultimately I can support you in making the case that it matters.


  • Mike Taylor said:

    I absolutely feel your pain. This is a horribly familiar story.

    Only one thing …

    > What should I do, right now because the funder is on my back about it!?!

    What you should do now is not just package your data in the format that seems best to you, but document that format — and let that documentation become the zeroth draft of a standard (whether formal or informal) for those who follow. And of course your example will illustrate it.

    Tools are good, too, but start with the spec.

  • terracerulean said:

    Thanks for posting this.
    I’ve worked on EResearch systems and services (and policy) for over a decade, and the issues you raise are endemic.
    It will be no comfort, but there is a systemic gap in how these systems are conceived designed an implemented.
    Basically it’s the result of a conflict in perspective between the customer and users of these projects – where the customer is the person (senior management usually) who pays for the thing to be built and the users are people like yourself to whom these things are delivered.
    The thing that always seems to get lost is the simple principle that while the customer transfers value (by paying for the service) it’s the users who create value.
    Unfortunately the whole conceive, design, build, operate circus is set up to bee more beholden to the customer than the users.
    It’s changing – slowly – but there it is.
    Those of us who do the design work often care a great deal about the users, and we a frustrated that we can’t put their interests ahead on the customers’.
    I will at least promise to save your post as a “user story” to argue for more attention to user needs in any future work I might do.

  • Peter Sefton said:

    Hi Cameron,

    We hear you!

    As it happens we’re currently working on data packaging, at UTS we’re not done, but I am due to talk about an alpha release of a tool that’s designed to make it easy to make a data package based on BagIt (super-simple way of laying out content) with basic re-use and discovery metadata in JSON-LD, as well as HTML. I’d love to try it on your data, and get some requirements from you., you know where to find me if you’re interested.

    I had a go at this before


  • David Mellor said:

    I love a good rant! To answer one question: “I’m really trying to get advice on a truly basic question: should I organise by interview (audio files and notes together) or by file type (with interviews split up).” how do you organize your data? I suspect that the most effective way to share (and encourage others to share) is to simply use the structure that is most useful for you. If you collect with one structure (e.g. by interview, with audio and notes together) but then analyze after reorganizing into file types, simply share the structure that best represents what you have when you are ready to share. Standards exist after competing strategies are used, their relative merits are evaluated, and a big player or a community agrees on a standard, all of which require experience and trial by error.

    As an aside, I am not aware of publicly shared audio transcripts because of privacy issues, but here’s simply dataset shared on the OSF of the scored data (https://osf.io/vj3h6/). Obviously the raw data would be more useful, but this is a good example of sharing what you can and maybe sharing more in the future.

    Here are some general advice articles on file organizing and creating data codebooks. http://help.osf.io/m/bestpractices I doubt any of it is news to you, but perhaps something there is useful or worth building upon.

  • William Kilbride said:

    I have grown used to sitting in sessions where various senior officials refer to ‘carrots’ and ‘sticks’ as if it were the only metaphor for how we might encourage researchers to engage in research data management activities. I have an underlying dislike of cliche, but also wonder what it says about us if we casually and routinely cast the brightest and best in the role of the donkey. And in any case it seems to me that this particular highly-skilled and highly-motivated donkey has worked hard to ensure it is already in the right place. Down with cliches: let’s look after the donkey where it is.

  • Cameron Neylon said:

    Hi David,

    Yes, that kind of practical “just be sensible” advice turned out to be hard to find. For instance that OSF page didn’t show up in my results, partly because I thought I was looking for something more specific, and partly because it was buried in pages of results of metadata schema.

    The more I’ve thought about this the more I think its really about discovering and coordinating resources that actually do exist online but are not well connected and don’t rise to the top of search results. Again, this is the argument for the professional support person who knows how to find this stuff. The challenge is in guiding the user towards them.

  • Cameron Neylon said:

    I always liked the line that “any sufficiently large carrot can also be deployed as a stick”…but yes the point is well taken. It’s a little worse than that in the sense that all of the various groups view all of the others (admin and library as seen by researchers, admin and researchers as seen by library etc etc) as “donkeys” in the sense of not seeing what they ought to as the really important bit. Its the failure of imagination to realise that we all have different skills and figuring out how to bring them together effectively is the key…

  • William Kilbride said:

    This ^