Avoid the pain and embarassment – make all the raw data available

Enzyme

A story of two major retractions from a well known research group has been getting a lot of play over the last few days with a News Feature (1) and Editorial (2) in the 15 May edition of Nature. The story turns on claim that Homme Hellinga’s group was able to convert the E. coli ribose binding protein into a Triose phosphate isomerase (TIM) using a computational design strategy. Two papers on the work appeared, one in Science (3) and one in J Mol Biol (4). However another group, having obtained plasmids for the designed enzymes, could not reproduce the claimed activity. After many months of work the group established that the supposed activity appeared to that of the bacteria’s native TIM and not that of the designed enzyme. The paper’s were retracted and Hellinga went on to accuse the graduate student who did the work of fabricating the results, a charge of which she was completely cleared.

Much of the heat the story is generating is about the characters involved and possible misconduct of various players, but that’s not what I want to cover here. My concern is about how much time, effort, and tears could have been saved if all the relevant raw data was made available in the first place. Demonstrating a new enzymatic activity is very difficult work. It is absolutely critical to rigorously exclude the possibility of any contaminating activity and in practice this is virtually impossible to guarantee. Therefore a negative control experiment is very important. It appears that this control experiment was carried out, but possibly only once, against a background of significant variability in the results. All of this lead to another group wasting on the order of twelve months trying to replicate these results. Well, not wasting, but correcting the record, arguably a very important activity, but one for which they will get little credit in any meaningful sense (an issue for another post and mentioned by Noam Harel in a comment at the News Feature online).

So what might have happened if the original raw data were available? Would it have prevented the publication of the papers in the first place? It’s very hard to tell. The referees were apparently convinced by the quality of the data. But if this was ‘typical data’ (using the special scientific meaning of typical vis ‘the best we’ve got’) and the referees had seen the raw data with greater variability then maybe they would have wanted to see more or better controls; perhaps not. Certainly if the raw data were available the second group would have realised much sooner that something was wrong.

And this is a story we see over and over again. The selective publication of results without reference to the full set of data; a slight shortcut taken or potential issues with the data somewhere that is not revealed to referees or to the readers of the paper; other groups spending months or years attempting to replicate results or simply use a method described by another group. And in the meantime graduate students and postdocs get burnt on the pyre of scientific ‘progress’ discovering that something isn’t reproducible.

The Nature editorial is subtitled ‘Retracted papers require a thorough explanation of what went wrong in the experiments’. In my view this goes nowhere near far enough. There is no longer any excuse for not providing all the raw and processed data as part of the supplementary information for published papers. Even in the form of scanned lab book pages this could have made a big difference in this case, immediately indicating the degree of variability and the purity of the proteins. Many may say that this is too much effort, that the data cannot be found. But if this is the case then serious questions need to be asked about the publication of the work. Publishers also need to play a role by providing more flexible and better indexed facilities for supplementary information, and making sure they are indexed by search engines.

Some of us go much further than this, and believe that making the raw data immediately available is a better way to do science. Certainly in this case it might have reduced the pressure to rush to publish, might have forced a more open and more thorough scrutiny of the underlying data. This kind of radical openness is not for everyone perhaps but it should be less prone to gaffes of the sort described here. I know I can have more faith in the work of my group where I can put my fingers on the raw data and check through the detail. We are still going through the process of implementing this move to complete (or as complete as we can be) openness and its not easy. But it helps.

Science has moved on from the days where the paper could only contain what would fit on the printed pages. It has moved on from the days when an informal circle of contacts would tell you which group’s work was repeatable and which was not. The pressures are high and potential for career disaster probably higher. In this world the reliability and completeness of the scientific record is crucial. Yes there are technical difficulties in making it all available. Yes it takes effort, and yes it will involve more work, and possibly less papers. But the only thing that ultimately can really be relied on is the raw data (putting aside deliberate fraud). If the raw data doesn’t form a central part of the scientific record then we perhaps need to start asking whether the usefulness of that record in its current form is starting to run out.

  1. Editorial Nature 453, 258 (2008)
  2. Wenner M. Nature 453, 271-275 (2008)
  3. Dwyer, M. A. , Looger, L. L. & Hellinga, H. W. Science 304, 1967–1971 (2004).
  4. Allert, M. , Dwyer, M. A. & Hellinga, H. W. J. Mol. Biol. 366, 945–953 (2007).

6 Replies to “Avoid the pain and embarassment – make all the raw data available”

  1. Cameron – thanks for taking the time to write about this.

    One of the things I noticed is that after reading that article I don’t have a good sense of what really happened here. There are references to closed investigations and interview statements but we still can’t see the original lab notebooks to judge for ourselves. I don’t think the reporter had a choice in the matter but wouldn’t it be better all around if those documents were available?

    It is easy to focus on the scandalous aspects and I am sure many reports will focus on that. But you are right the more important story is the role of openness in the current scientific process.

    And it is certainly more fair to start research knowing that your lab notebook is going to be public rather than having to disclose it at a later date.

    I don’t know if you are completely on the same page with me about this but I think any efforts to force people to do Open Notebook Science will result in behaviors to circumvent the process – for example “double notekeeping”. Researchers have to do it because they buy into the idea that they will benefit from doing it. And this is a good example of where it would have benefited.

  2. Cameron – thanks for taking the time to write about this.

    One of the things I noticed is that after reading that article I don’t have a good sense of what really happened here. There are references to closed investigations and interview statements but we still can’t see the original lab notebooks to judge for ourselves. I don’t think the reporter had a choice in the matter but wouldn’t it be better all around if those documents were available?

    It is easy to focus on the scandalous aspects and I am sure many reports will focus on that. But you are right the more important story is the role of openness in the current scientific process.

    And it is certainly more fair to start research knowing that your lab notebook is going to be public rather than having to disclose it at a later date.

    I don’t know if you are completely on the same page with me about this but I think any efforts to force people to do Open Notebook Science will result in behaviors to circumvent the process – for example “double notekeeping”. Researchers have to do it because they buy into the idea that they will benefit from doing it. And this is a good example of where it would have benefited.

  3. Its a good point Jean-Claude. To me the whole thing hinges on this negative control and whether it was done thoroughly enough. Without having the raw data it is very hard to see through what is largely a set of accusations of misconduct and/or behaviour that falls short of ‘best practice’ of various different types.

    I’m certainly not in favour of forcing people to make raw data available and rather more on the side of making the case as to why this should be the social norm. However I am conscious that progress on these issues is usually driven by journal submission requirements. These in turn need to be driven by community requests.

    In a conversation a few weeks ago Timo Hannay (Nature Publishing Group) made the point that journals, even those such as Nature, cannot take it upon themselves to dictate behaviour, and indeed to do so would potentially be commercially inadvisable. They are in a ‘market’ for papers and if the costs go to high people will go elsewhere.

    On the other hand there is a thin line between setting new ‘community norms’ which are supported by journals and compulsion. In most cases journals actually require that authors be prepared to make the data available already anyway.

    My argument would be that we should be encouraging and supporting the deposition of raw data with papers, and developing the tools to make it easy. Over time this will develop to be the expected standard giving us a better record, a better literature, and more data in the public domain.

    But I agree that simply setting a policy like this (however that might be done) would lead to the sort of double book keeping you suggest and would be a bad route to go down.

  4. Its a good point Jean-Claude. To me the whole thing hinges on this negative control and whether it was done thoroughly enough. Without having the raw data it is very hard to see through what is largely a set of accusations of misconduct and/or behaviour that falls short of ‘best practice’ of various different types.

    I’m certainly not in favour of forcing people to make raw data available and rather more on the side of making the case as to why this should be the social norm. However I am conscious that progress on these issues is usually driven by journal submission requirements. These in turn need to be driven by community requests.

    In a conversation a few weeks ago Timo Hannay (Nature Publishing Group) made the point that journals, even those such as Nature, cannot take it upon themselves to dictate behaviour, and indeed to do so would potentially be commercially inadvisable. They are in a ‘market’ for papers and if the costs go to high people will go elsewhere.

    On the other hand there is a thin line between setting new ‘community norms’ which are supported by journals and compulsion. In most cases journals actually require that authors be prepared to make the data available already anyway.

    My argument would be that we should be encouraging and supporting the deposition of raw data with papers, and developing the tools to make it easy. Over time this will develop to be the expected standard giving us a better record, a better literature, and more data in the public domain.

    But I agree that simply setting a policy like this (however that might be done) would lead to the sort of double book keeping you suggest and would be a bad route to go down.

  5. Cameron,
    A main objection often heard about this is the additional burden on the researchers to deposit raw data. I have to admit that I would not be enthusiastic to reformat all of our raw data in a form that complies with a journal’s requirements for supplementary information. However it costs me nothing to provide links to our live lab notebook pages (which will be edited as new information comes to bear. Of course wiki page versions keep track of when this new information gets added.)

    This is really a key selling point of Open Notebook Science: researchers already have to keep a lab notebook so there is really no fundamentally new burden involved. In reading the difficulties reported by some people concerning the time burden of ONS, it is generally not about the core ONS concept but about related activities, like re-writing and re-formatting lab data for consumption by larger audiences. Yes, I do write summaries of our projects periodically on the UsefulChem blog but that is not an essential component of ONS – the lab notebook pages on the wiki are sufficient.

    As we move towards more automation it will become increasingly difficult to have people write detailed summaries of what is going on as it happens.

  6. Cameron,
    A main objection often heard about this is the additional burden on the researchers to deposit raw data. I have to admit that I would not be enthusiastic to reformat all of our raw data in a form that complies with a journal’s requirements for supplementary information. However it costs me nothing to provide links to our live lab notebook pages (which will be edited as new information comes to bear. Of course wiki page versions keep track of when this new information gets added.)

    This is really a key selling point of Open Notebook Science: researchers already have to keep a lab notebook so there is really no fundamentally new burden involved. In reading the difficulties reported by some people concerning the time burden of ONS, it is generally not about the core ONS concept but about related activities, like re-writing and re-formatting lab data for consumption by larger audiences. Yes, I do write summaries of our projects periodically on the UsefulChem blog but that is not an essential component of ONS – the lab notebook pages on the wiki are sufficient.

    As we move towards more automation it will become increasingly difficult to have people write detailed summaries of what is going on as it happens.

Comments are closed.