Avoid the pain and embarassment – make all the raw data available

Enzyme

A story of two major retractions from a well known research group has been getting a lot of play over the last few days with a News Feature (1) and Editorial (2) in the 15 May edition of Nature. The story turns on claim that Homme Hellinga’s group was able to convert the E. coli ribose binding protein into a Triose phosphate isomerase (TIM) using a computational design strategy. Two papers on the work appeared, one in Science (3) and one in J Mol Biol (4). However another group, having obtained plasmids for the designed enzymes, could not reproduce the claimed activity. After many months of work the group established that the supposed activity appeared to that of the bacteria’s native TIM and not that of the designed enzyme. The paper’s were retracted and Hellinga went on to accuse the graduate student who did the work of fabricating the results, a charge of which she was completely cleared.

Much of the heat the story is generating is about the characters involved and possible misconduct of various players, but that’s not what I want to cover here. My concern is about how much time, effort, and tears could have been saved if all the relevant raw data was made available in the first place. Demonstrating a new enzymatic activity is very difficult work. It is absolutely critical to rigorously exclude the possibility of any contaminating activity and in practice this is virtually impossible to guarantee. Therefore a negative control experiment is very important. It appears that this control experiment was carried out, but possibly only once, against a background of significant variability in the results. All of this lead to another group wasting on the order of twelve months trying to replicate these results. Well, not wasting, but correcting the record, arguably a very important activity, but one for which they will get little credit in any meaningful sense (an issue for another post and mentioned by Noam Harel in a comment at the News Feature online).

So what might have happened if the original raw data were available? Would it have prevented the publication of the papers in the first place? It’s very hard to tell. The referees were apparently convinced by the quality of the data. But if this was ‘typical data’ (using the special scientific meaning of typical vis ‘the best we’ve got’) and the referees had seen the raw data with greater variability then maybe they would have wanted to see more or better controls; perhaps not. Certainly if the raw data were available the second group would have realised much sooner that something was wrong.

And this is a story we see over and over again. The selective publication of results without reference to the full set of data; a slight shortcut taken or potential issues with the data somewhere that is not revealed to referees or to the readers of the paper; other groups spending months or years attempting to replicate results or simply use a method described by another group. And in the meantime graduate students and postdocs get burnt on the pyre of scientific ‘progress’ discovering that something isn’t reproducible.

The Nature editorial is subtitled ‘Retracted papers require a thorough explanation of what went wrong in the experiments’. In my view this goes nowhere near far enough. There is no longer any excuse for not providing all the raw and processed data as part of the supplementary information for published papers. Even in the form of scanned lab book pages this could have made a big difference in this case, immediately indicating the degree of variability and the purity of the proteins. Many may say that this is too much effort, that the data cannot be found. But if this is the case then serious questions need to be asked about the publication of the work. Publishers also need to play a role by providing more flexible and better indexed facilities for supplementary information, and making sure they are indexed by search engines.

Some of us go much further than this, and believe that making the raw data immediately available is a better way to do science. Certainly in this case it might have reduced the pressure to rush to publish, might have forced a more open and more thorough scrutiny of the underlying data. This kind of radical openness is not for everyone perhaps but it should be less prone to gaffes of the sort described here. I know I can have more faith in the work of my group where I can put my fingers on the raw data and check through the detail. We are still going through the process of implementing this move to complete (or as complete as we can be) openness and its not easy. But it helps.

Science has moved on from the days where the paper could only contain what would fit on the printed pages. It has moved on from the days when an informal circle of contacts would tell you which group’s work was repeatable and which was not. The pressures are high and potential for career disaster probably higher. In this world the reliability and completeness of the scientific record is crucial. Yes there are technical difficulties in making it all available. Yes it takes effort, and yes it will involve more work, and possibly less papers. But the only thing that ultimately can really be relied on is the raw data (putting aside deliberate fraud). If the raw data doesn’t form a central part of the scientific record then we perhaps need to start asking whether the usefulness of that record in its current form is starting to run out.

  1. Editorial Nature 453, 258 (2008)
  2. Wenner M. Nature 453, 271-275 (2008)
  3. Dwyer, M. A. , Looger, L. L. & Hellinga, H. W. Science 304, 1967–1971 (2004).
  4. Allert, M. , Dwyer, M. A. & Hellinga, H. W. J. Mol. Biol. 366, 945–953 (2007).

A new type of chemistry journal: Nature Chemistry requests input

As has been noted in a few places, Neil Withers, one of the editors of soon to be newest Nature journal, Nature Chemistry put out a request last week for input on a range of issues to do with how people use journals, formats, and technical widgets. Egon Willighagen, Rich Apodaca, and Oscar the Journal Munching Robot (masquerading as Peter Murray-Rust, or is that the other way around?) have already posted responses. Here I want to add my own thoughts and possibly amplify some of the points others have made. Continue reading “A new type of chemistry journal: Nature Chemistry requests input”

More on the science exchance – or building and capitalising a data commons

Image from Wikipedia via ZemantaBanknotes from all around the World donated by visitors to the British Museum, London

Following on from the discussion a few weeks back kicked off by Shirley at One Big Lab and continued here I’ve been thinking about how to actually turn what was a throwaway comment into reality:

What is being generated here is new science, and science isn’t paid for per se. The resources that generate science are supported by governments, charities, and industry but the actual production of science is not supported. The truly radical approach to this would be to turn the system on its head. Don’t fund the universities to do science, fund the journals to buy science; then the system would reward increased efficiency.

There is a problem at the core of this. For someone to pay for access to the results, there has to be a monetary benefit to them. This may be through increased efficiency of their research funding but that’s a rather vague benefit. For a serious charitable or commercial funder there has to be the potential to either make money, or at least see that the enterprise could become self sufficient. But surely this means monetizing the data somehow? Which would require restrictive licences, which is not at the end what we’re about.

The other story of the week has been the, in the end very useful, kerfuffle caused by ChemSpider moving to a CC-BY-SA licence, and the confusion that has been revealed regarding data, licencing, and the public domain. John Wilbanks, whose comments on the ChemSpider licence, sparked the discussion has written two posts [1, 2] which I found illuminating and have made things much clearer for me. His point is that data naturally belongs in the public domain and that the public domain and the freedom of the data itself needs to be protected from erosion, both legal, and conceptual that could be caused by our obsession with licences. What does this mean for making an effective data commons, and the Science Exchange that could arise from it, financially viable? Continue reading “More on the science exchance – or building and capitalising a data commons”

Attribution for all! Mechanisms for citation are the key to changing the academic credit culture

A reviewer at the National Institutes of Health evaluates a grant proposal.Image via Wikipedia

Once again a range of conversations in different places have collided in my feed reader. Over on Nature Networks, Martin Fenner posted on Researcher ID which lead to a discussion about attribution and in particular Martin’s comment that there was a need to be able to link to comments and the necessity of timestamps. Then DrugMonkey posted a thoughtful blog about the issue of funding body staff introducing ideas from unsuccessful grant proposals they have handled to projects which they have a responsibility in guiding. Continue reading “Attribution for all! Mechanisms for citation are the key to changing the academic credit culture”

The serious amateur and the cult of ignorance

The opening of the six-part fugue from The Musical Offering, in Bach’s handAmongst the other things that I do I am a fairly serious amateur musician. I sing regularly and irregularly in choirs, have occassionally done some solo vocal work, conduct a bit, and in the past written fairly substantial pieces of music for orchestra and choir. When I started university I made a choice between doing music or doing science. Like a lot of other scientists I suspect I chose to go down the science route because it is much easier to be an amateur musician than and amateur scientists. I don’t regret the decisions I made then but like anyone I do think back to what might have been.

One of the criticisms of open practice in science and Open Notebook Science in particular is that we open up ourselves to harassment by ‘nutjobs’, ‘ignorant plebs’, and assorted other people who don’t appreciate a) how clever we are or b) how busy we are. There are two sides to this argument with merits on both. It is possible to get bogged down continually dealing with people who genuinely wish to explain to you how universal crystal harmonics explain the periodicity of the elements, or how their understanding of the interstitial spiritual lamina demonstrates the inadvisability of human cloning. There is no getting over the fact that there are nutters out there. On the other hand we do little to encourage the amateur scientist beyond allowing them occasional access to our hallowed existence through TV, NewScientist, and Wired. I wondered whether an exploration of the parallels between amateur music and amateur science might be interesting. I should note that I am using the term professional in rather a loose way here, not to mean whether someone that gets paid to do something, but someone who can devote the majority of their time to a specific pursuit, be that music, science, or anything else. Continue reading “The serious amateur and the cult of ignorance”

Protocols for Open Science

interior detail, stata center, MIT. just outside science commons offices.

One of the strong messages that came back from the workshop we held at the BioSysBio meeting was that protocols and standards of behaviour were something that people would appreciate having available. There are many potential issues that are raised by the idea of a ‘charter’ or ‘protocol’ for open science but these are definitely things that are worth talking about. I thought I would through a few ideas out and see where they go. There are some potentially serious contradictions to be worked through. Continue reading “Protocols for Open Science”

The economic case for Open Science

I am thinking about how to present the case for Open Science, Open Notebook Science, and Open Data at Science in the 21st Century, the meeting being organised by Sabine Hossenfelder and Michael Nielsen at the Perimeter Institute for Theoretical Physics. I’ve put up a draft abstract and as you might guess from this I wanted to make an economic case that the waste of resources, both human and monetary is not something that is sustainable for the future. Here I want to rehearse that argument a bit further as well as explore the business case that could be presented to Google/Gates Foundation as a package that would include the development of the Science Exchange ideas that I blogged about last week. Continue reading “The economic case for Open Science”

Somewhat more complete report on BioSysBio workshop

The Queen's Tower, Imperial CollegeImage via Wikipedia

This has taken me longer than expected to write up. Julius Lucks, John Cumbers, and myself lead a workshop on Open Science on Monday 21st at the BioSysBio meeting at Imperial College London.  I had hoped to record screencast, audio, and possibly video as well but in the end the laptop I am working off couldn’t cope with both running the projector and Camtasia at the same time with reasonable response rates (its a long story but in theory I get my ‘proper’ laptop back tomorrow so hopefully better luck next time). We had somewhere between 25 and 35 people throughout most of the workshop and the feedback was all pretty positive. What I found particularly exciting was that, although the usual issues of scooping, attribution, and the general dishonestly of the scientific community were raised, they were only in passing, with a lot more of the discussion focussing on practical issues. Continue reading “Somewhat more complete report on BioSysBio workshop”

BioSysBio conference and workshop

Tomorrow myself and a few of the usual suspects, who I have finally met in person are giving a workshop on ‘Open Science’ as part of BioSysBio 2008. If anyone else who I haven’t met yet is about at the meeting then feel free to introduce yourself, even if you can’t make it to the workshop. The workshop abstract is up on OpenWetWare if you want to have a look. I hope to be able to record screencast and video of the session to make it available to all of you who can’t make it. If you want to make comments in advance or raise any issues then drop a comment here or in the usual places.