Home » Blog

It wasn’t supposed to be this way…

5 December 2009 23 Comments

I’ve avoided writing about the Climate Research Unit emails leak for a number of reasons. Firstly it is clearly a sensitive issue with personal ramifications for some and for many others just a very highly charged issue. Probably more importantly I simply haven’t had the time or energy to look into the documents myself. I haven’t, as it were, examined the raw data for myself, only other people’s interpretations. So I’ll try to stick to a very general issue here.

There are appear to be broadly two responses from the research community to this saga. One is to close ranks and to a certain extent say “nothing was done wrong here”. This is at some level, the tack taken by the Nature Editorial of 3 December, which was headed up with “Stolen e-mails have revealed no scientific conspiracy…”. The other response is that the scandal has exposed the shambolic way that we deal with collecting, archiving, and making available both data and analysis in science, as well as the endemic issues around the hoarding of data by those who have collected it.

At one level I belong strongly in the latter camp, but I also appreciate the dismay that must be felt by those who have looked at, and understand what the emails actually contain, and their complete inability to communicate this into the howling winds of what seems to a large extent a media beatup. I have long felt that the research community would one day be shocked by the public response when, for whatever reason, the media decided to make a story about the appalling data sharing practices of publicly funded academic researchers like myself. If I’d thought about it more deeply I should have realised that this would most likely be around climate data.

Today the Times reports on its front page that the UK Metererology Office is to review 160 years of climate data and has asked a range of contributing organisations to allow it to make data public. The details of this are hazy but if the UK Met Office is really going to make the data public this is a massive shift. I might be expected to be happy about this but I’m actually profoundly depressed. While it might in the longer term lead to more strongly worded and enforced policies it will also lead to data sharing being forever associated with “making the public happy”. My hope has always been that the sharing of the research record would come about because people started to see the benefits, because they could see the possibilities in partnership with the wider community, and that it made their research more effective. Not because the tabloids told us we should.

Collecting the best climate data and doing the best possible analysis on it is not an option. If we get this wrong and don’t act effectively then with some probability that is significantly above zero our world ends. The opportunity is there to make this the biggest, most important, and most effective research project ever undertaken. To actively involve the wider community in measurement. To get an army of open source coders to re-write, audit, and re-factor the analysis software. Even to involve the (positively engaged) sceptics, to use their interest and ability to look for holes and issues. Whether politicians will act on data is not the issue that the research community can or should address; what we need to be clear on is that we provide the best data, the best analysis, and an honest view of the uncertainties. Along with the ability of anyone to critically analyse the basis for those conclusions.

There is a clear and obvious problem with this path. One of the very few credible objections to open research that I have come across is that by making material available you open your inbox to a vast community of people who will just waste your time. The people who can’t be bothered to read the background literature or learn to use the tools; the ones who just want the right answer. This is nowhere more the case than it is with climate research and it forms the basis for the most reasonable explanation of why the CRU (and every other repository of climate data as far as I am aware) have not made more data or analysis software directly available.

There are no simple answers here, and my concern is that in a kneejerk response to suddenly make things available no-one will think to put in place the social and technical infrastructure that we need to support positive engagement, and to protect active researchers, both professional and amateur from time-wasters. Interestingly I think this infrastructure might look very similar to that which we need to build to effectively share the research we do, and effectively discover the relevant work of others. Infrastructure is never sexy, particularly in the middle of a crisis. But there is one thing in the practice of research that we forget at our peril. Any given researcher needs to earn the right to be taken seriously. No-one ever earns the right to shut people up. Picking out the objection that happens to be important is something we have to at least attempt to build into our systems.


  • Some areas of research do open data (I’m thinking protein data bank and DNA sequencing) and a lot of tools are open. However, I know that for my prior academic research open data would require a large curation effort on my part, and the only people that would practically benefit would be my research competitors. I’d argue that useful open data requires a very large body of something homogeneous. I really can’t imagine someone coming along and helpfully tidying up my analysis code.

    I see the data demands of the climate change contrarians as the equivalent of the demand for intermediate fossil forms by creationists. There was always going to be more they could demand.

    I believe that the IPCC are doing work on central depositories for GCM output.

  • Some areas of research do open data (I’m thinking protein data bank and DNA sequencing) and a lot of tools are open. However, I know that for my prior academic research open data would require a large curation effort on my part, and the only people that would practically benefit would be my research competitors. I’d argue that useful open data requires a very large body of something homogeneous. I really can’t imagine someone coming along and helpfully tidying up my analysis code.

    I see the data demands of the climate change contrarians as the equivalent of the demand for intermediate fossil forms by creationists. There was always going to be more they could demand.

    I believe that the IPCC are doing work on central depositories for GCM output.

  • Hi Ian, and thanks for the retweet as well. My argument would be that we need to actually collect and capture our research process in a way that removes the need for post-hoc curation, and ideally makes sure that the curation we have to do delivers benefits back for us. But the simple answer to the increasing demands of the denialists is just make everything public – do everything in the open.

    But I don’t buy the competitors argument. If we’re configuring the communication of our research for our own benefit career-wise rather than maximising the effectiveness of the public spend on research then we don’t deserve the public money.

    That’s a big social change for the research community, but my feeling is that if you dig into the public response (as opposed to the media beatup) around this and other similar scandals you find real, serious, and justified anger about what we do with publicly funded data. Keep this up and research funding will become an easy political target. That way lies the dark ages.

    Now we’re not going to overnight turn researchers into hippies who do everything out of the goodness of their hearts, we’re human too, so we need effective markets and funding mechanisms that drive behaviour towards the most effective use of public money. Full disclosure – I write grants to get money to build the tools that might lead to these kinds of things :-)

  • Hi Ian, and thanks for the retweet as well. My argument would be that we need to actually collect and capture our research process in a way that removes the need for post-hoc curation, and ideally makes sure that the curation we have to do delivers benefits back for us. But the simple answer to the increasing demands of the denialists is just make everything public – do everything in the open.

    But I don’t buy the competitors argument. If we’re configuring the communication of our research for our own benefit career-wise rather than maximising the effectiveness of the public spend on research then we don’t deserve the public money.

    That’s a big social change for the research community, but my feeling is that if you dig into the public response (as opposed to the media beatup) around this and other similar scandals you find real, serious, and justified anger about what we do with publicly funded data. Keep this up and research funding will become an easy political target. That way lies the dark ages.

    Now we’re not going to overnight turn researchers into hippies who do everything out of the goodness of their hearts, we’re human too, so we need effective markets and funding mechanisms that drive behaviour towards the most effective use of public money. Full disclosure – I write grants to get money to build the tools that might lead to these kinds of things :-)

  • Pingback: Twitter Trackbacks for Science in the open » It wasn’t supposed to be this way… [openwetware.org] on Topsy.com()

  • I think dissociated from climategate this is a rather more interesting question, are you targeting a particular field?

    As an industrial scientist there’s quite a lot of interest in the lab at which I work in systems of work that allow us to collect meta-data enabling data re-use. The big question is at what point does this mechanism kick-in? A lot of science is one shot stuff, you try something out in the lab it works or it doesn’t work: your question is answered either way.

    There’s also the question of uptake: scientists seem surprisingly bad at picking up new technology. For example where I work revision control software and wikis have scarcely made an impact and my experience in academia from 5 years ago was that they had scarcely made great inroads there.

  • I think dissociated from climategate this is a rather more interesting question, are you targeting a particular field?

    As an industrial scientist there’s quite a lot of interest in the lab at which I work in systems of work that allow us to collect meta-data enabling data re-use. The big question is at what point does this mechanism kick-in? A lot of science is one shot stuff, you try something out in the lab it works or it doesn’t work: your question is answered either way.

    There’s also the question of uptake: scientists seem surprisingly bad at picking up new technology. For example where I work revision control software and wikis have scarcely made an impact and my experience in academia from 5 years ago was that they had scarcely made great inroads there.

  • Oh absolutely, the bigger general question you are asking is exactly what interests me. I would turn your question on its head – why don’t we build capture systems that do a good job of capturing metadata with as little effort on the part of the user as possible. Then when there is a story to tell you have the pieces and you can add more structure and more metadata to enable you to communicate it.

    My belief is that people mix up these two phases and then get despondent when they “can’t” persuade people to collect enough metadata to make publication an automatic process. What is better is capturing all that you can, and then providing tools that help the author tell a story, capturing more of the context and ideas along the way as they do that.

    On adoption and uptake absolutely agree. Virtually no inroads – most people use paper notebooks for heavens sake! Even worse, they’re better than most electronic ones. I think there is a combination of poor development practice and poor UI design that contributes here. But that is another adoption issue…so yes it a big problem. Answer in my view is to build tools that solve problems that people recognise they have rather than beating them over the head with a metadata shaped shovel until they fit into your data model…

  • Oh absolutely, the bigger general question you are asking is exactly what interests me. I would turn your question on its head – why don’t we build capture systems that do a good job of capturing metadata with as little effort on the part of the user as possible. Then when there is a story to tell you have the pieces and you can add more structure and more metadata to enable you to communicate it.

    My belief is that people mix up these two phases and then get despondent when they “can’t” persuade people to collect enough metadata to make publication an automatic process. What is better is capturing all that you can, and then providing tools that help the author tell a story, capturing more of the context and ideas along the way as they do that.

    On adoption and uptake absolutely agree. Virtually no inroads – most people use paper notebooks for heavens sake! Even worse, they’re better than most electronic ones. I think there is a combination of poor development practice and poor UI design that contributes here. But that is another adoption issue…so yes it a big problem. Answer in my view is to build tools that solve problems that people recognise they have rather than beating them over the head with a metadata shaped shovel until they fit into your data model…

  • S

    I work in Genomics and a lot of the data we gather is fairly automatically deposited into open databases. I cannot understand why fairly raw climate data is not similarly deposited. This should not take too much effort from researchers– the fact it maybe privileged is no excuse when it is of such import and largely publicly funded.

    On the other hand there is a great deal of secondary genomic data- that involves experimentation – that is badly recorded (e.g. microarray, chIP, SNP) . This is largely because the available databases require complex and time consuming standards and forms. People would rather avoid the soul destroying bureaucracy of populating the tables of someone else database.

    However there seems to be a much simpler solution than an ‘open source project’ — if there are problems with the analysis software– isn’t that because journals have allowed papers to be published based on incomplete reasoning. The software should always have been available for download and checking with any paper as it is a major component of the analysis.

    if the authors then decide to enlist outside help too, then fine.

  • S

    I work in Genomics and a lot of the data we gather is fairly automatically deposited into open databases. I cannot understand why fairly raw climate data is not similarly deposited. This should not take too much effort from researchers– the fact it maybe privileged is no excuse when it is of such import and largely publicly funded.

    On the other hand there is a great deal of secondary genomic data- that involves experimentation – that is badly recorded (e.g. microarray, chIP, SNP) . This is largely because the available databases require complex and time consuming standards and forms. People would rather avoid the soul destroying bureaucracy of populating the tables of someone else database.

    However there seems to be a much simpler solution than an ‘open source project’ — if there are problems with the analysis software– isn’t that because journals have allowed papers to be published based on incomplete reasoning. The software should always have been available for download and checking with any paper as it is a major component of the analysis.

    if the authors then decide to enlist outside help too, then fine.

  • Hi S, in response to your comments. Yes, yes, yes, and yes. We’re working on getting funding to try and make some aspects of this easier at the moment – less forms, more metadata is my mantra. But I have to say I’m getting tired of talking about it and want to get on and build more stuff.

    As for software and poor peer review I can only say that I agree with you. And watch this space – hope to be making an exciting announcement sometime quite soon.

  • Hi S, in response to your comments. Yes, yes, yes, and yes. We’re working on getting funding to try and make some aspects of this easier at the moment – less forms, more metadata is my mantra. But I have to say I’m getting tired of talking about it and want to get on and build more stuff.

    As for software and poor peer review I can only say that I agree with you. And watch this space – hope to be making an exciting announcement sometime quite soon.

  • “One of the very few credible objections to open research that I have come across is that by making material available you open your inbox to a vast community of people who will just waste your time.”

    That strikes me as a straw man argument. The data could be (would have to be for climate data) in a public repository, adequately documented. This is no different than the data at the NCBI or the Bureau of Economic Research. Furthermore, any data that is accrued with public monies should be put in the public domain. Access and use are very separate issues from expertise and understanding.

  • “One of the very few credible objections to open research that I have come across is that by making material available you open your inbox to a vast community of people who will just waste your time.”

    That strikes me as a straw man argument. The data could be (would have to be for climate data) in a public repository, adequately documented. This is no different than the data at the NCBI or the Bureau of Economic Research. Furthermore, any data that is accrued with public monies should be put in the public domain. Access and use are very separate issues from expertise and understanding.

  • It’s a bit of a straw man but it comes down to the fact that data isn’t just data. DNA sequence data is a very special case because the very raw information remains useful (i.e. blast works). Almost all other data has lots of ancilliary stuff and making it (truly) available often means making a commitment to providing other bits and pieces to make it useful by others.

    Now of course, if _everything_ is out in the open in a proper repository this isn’t so hard – but any partial measures become hard to maintain effectively. Partly this is about building better tools and repositories, partly it is about taking a more open approach.

    But my personal view that putting data our there in a way which is useful also involves some personal commitment to helping people use it – and in the case of climate data can involve receiving hundreds of emails a day from people objecting to you giving them MatLab scripts because they don’t know what they are.

    Let me put it this way – its not an argument that I like, but its a point of view that I can understand.

  • It’s a bit of a straw man but it comes down to the fact that data isn’t just data. DNA sequence data is a very special case because the very raw information remains useful (i.e. blast works). Almost all other data has lots of ancilliary stuff and making it (truly) available often means making a commitment to providing other bits and pieces to make it useful by others.

    Now of course, if _everything_ is out in the open in a proper repository this isn’t so hard – but any partial measures become hard to maintain effectively. Partly this is about building better tools and repositories, partly it is about taking a more open approach.

    But my personal view that putting data our there in a way which is useful also involves some personal commitment to helping people use it – and in the case of climate data can involve receiving hundreds of emails a day from people objecting to you giving them MatLab scripts because they don’t know what they are.

    Let me put it this way – its not an argument that I like, but its a point of view that I can understand.

  • I think this piece says it well:

    http://www.edge.org/discourse/digital_maoism.html

    Digital Maosim by Jaron Lanier

  • I think this piece says it well:

    http://www.edge.org/discourse/digital_maoism.html

    Digital Maosim by Jaron Lanier

  • The argument providing open data would waste the time of researchers, because most people wouldn’t know how to use the data, and would require support, are analogous to newbies showing up at free software repositories and wasting the time of hackers by demanding support. This was dealt with by ignoring clueless questions on mailing lists and by building a community infrastructure that enabled contributions at every level. Underlying all that was a desire amongst the free software community to increase mindshare and the use of its products.

    Most forms of scientific enterprise are closed clubs with tightly controlled entry with no such desire. This might be acceptable with endeavours that are not of general interest, but is hardly desirable with research of overarching interest such as climatology.

    It is said that “science is not a democracy”; true, the skills to participate fully are not shared by everyone. But if knowledge is not made public (and every intermediate step along the path that led to its existence) then humanity departs towards some sort of elitist technofascism, which can only end badly.

  • The argument providing open data would waste the time of researchers, because most people wouldn’t know how to use the data, and would require support, are analogous to newbies showing up at free software repositories and wasting the time of hackers by demanding support. This was dealt with by ignoring clueless questions on mailing lists and by building a community infrastructure that enabled contributions at every level. Underlying all that was a desire amongst the free software community to increase mindshare and the use of its products.

    Most forms of scientific enterprise are closed clubs with tightly controlled entry with no such desire. This might be acceptable with endeavours that are not of general interest, but is hardly desirable with research of overarching interest such as climatology.

    It is said that “science is not a democracy”; true, the skills to participate fully are not shared by everyone. But if knowledge is not made public (and every intermediate step along the path that led to its existence) then humanity departs towards some sort of elitist technofascism, which can only end badly.

  • Julius, you won’t get any argument from me on that. But then I try to get funding to build those kinds of community infrastructure. There are some reasons why the analogy to open source doesn’t hold, particularly the cost of entry and the lack of modularity, and John Wilbanks has written a bit on that recently.

    But what we can say is that if we could lower the costs of entry, particularly where they don’t need to high, and build that infrastructure we could not only do a lot more science but a lot better science. And nowhere is that more important than in climate science.

  • Julius, you won’t get any argument from me on that. But then I try to get funding to build those kinds of community infrastructure. There are some reasons why the analogy to open source doesn’t hold, particularly the cost of entry and the lack of modularity, and John Wilbanks has written a bit on that recently.

    But what we can say is that if we could lower the costs of entry, particularly where they don’t need to high, and build that infrastructure we could not only do a lot more science but a lot better science. And nowhere is that more important than in climate science.