The Signal and the Noise: The Problem of Reproducibility
Once again, reproducibility is in the news. Most recently we hear that irreproducibility is irreproducible and thus everything is actually fine. The most recent round was kicked off by a criticism of the Reproducibility Project followed by claim and counter claim on whether one analysis makes more sense than the other. I’m not going to comment on that but I want to tease apart what the disagreement is about, because it shows that the problem with reproducibility goes much deeper than whether or not a particular experiment replicates.
At the centre of the disagreement are two separate issues. The most easy to understand is the claim that the Reproducibility Project did not faithfully replicate the original studies. Of course this raises the question of what “replicate” means. Is a replication seeking to precise re-run the same test or to test the claim more generally? Being fuzzy about which is meant lies at the bottom of many disagreements about whether as experiment is “replicated”.
The second issue is the one that has generated the most interesting commentary, but which is the hardest for a non-expert to tease apart. At its core is a disagreement over how “replication” should be statistically defined. That is, what is the mathematical analysis to be applied to an original test, and its repeat, to provide a Yes/No answer to the question “did it replicate”. These issues were actually raised from the beginning. It is far from obvious what the right analysis would be, or indeed whether the idea of “a right answer” actually makes sense. That’s because, again, there are serious differences of definition in what is meant by replication, and these questions cut to the core of how we use statistics and mathematics to analyse the world in which our experiments are run.
Problems of language I: Levels of replication and re-test
When there is a discussion of terminology it often focusses around this issue. When we speak of “replicating” an experiment do we mean carrying it out exactly the same, at its most extreme simply re-running a computational process in “exactly the same” environment, in the laboratory perhaps using the same strains, primers, chemicals or analytical techniques? Or do we mean doing “the same” experiment but without quite the same level of detail, testing the precise claim but not the precise methodology, perhaps using a different measurement technique, differing antibodies, or an alternative statistical test or computational algorithm. Or perhaps we mean testing the general claim, not that this test in this mouse strain delivers this result, but that chemical X is a regulator of gene Y in general.
Often we make a distinction between “replicating” and “reproducing” a result which corresponds to the distinction between the first and second level. We might need a third term, perhaps “generalising” to describe the third. But the ground between these levels is a quagmire and depending on your concerns you might categorise the same re-test as a replication or a reproduction. Indeed this is part of the disagreement in the current case. Is it reasonable to call a re-test of flag priming in a different culture? One group might say that are testing the general claim made for flag priming but is it the same test if the group is different, even if the test is identical? What is test protocol and what is subject. Would that mean we need to run social psychology replications on the same subjects?
Problems of language II: Reproducible experiments should not (always) reproduce
The question of what a direct replication in social pyschology would require is problematic. It cuts to the heart of the claims that social pyschology makes, that it detects behaviours and modes of thinking that are general. But how general and across what boundaries? Those questions are ones that social pyschologists probably have intuitive answers for (as do stem cell biologists, chemists and economists when faced with similar issues) but not ones that are easy, or perhaps even possible, to explain across disciplinary boundaries.
Layered on top of this are differences of opinion, suspicions about probity and different experiences within each of these three categories. Do some experiments require a “craft” or “green fingers” to get right? How close should a result be to be regarded as “the same”? What does a well designed experiment look like? What is the role of “elegance”? Disagreements about what matters, and ultimately rather different ideas about what the interaction between “reality” and “experiment” looks like also contribute. In the context of strong pseudo- and anti-science agendas having the complex discussion about how we judge which contradictory results to take seriously seems to be challenging. These problems are hard because tackling them requires dealing with the social aspects of science in a way that many scientists are uncomfortable with.
Related to this is the point raised above, that its not clear what we mean by replication. Given that this often relates to claims that are supported by p-values, we should first note that these are often mis-used, or at very least misrepresented. Certainly we need to get away from “p = 0.05 means its 95% reliable”. The current controversy rests in part on the question of the distinction between the replication experiment providing a result in the 95% confidence interval of the initially reported effect size, or that initially reported effect size lying in the 95% confidence interval of the replication. Both arguments can lead to some fairly peculiar results. An early criticism of the initial Reproducibility Project paper suggested a Bayesian approach to testing reproducibility but that had its own problems.
At the core of this is a fundamental problem in science. Even if a claim is “true” we don’t expect to get the same results from the same experiment. Random factors, uncontrolled factors all can play a role. This is why we do experiments and repeat them and do statistical analysis. It’s a way of trying to ensure we don’t fool ourselves. But it increasingly seems that we’re facing a rather toxic mixture of the statistical methods not being up to the job, and the fact that many of us are not up to the job of using them properly. That fact that the vast majority of us researchers using those statistical tools aren’t qualified to figure out which is probably instructive.
Either way it shouldn’t be surprising that if there isn’t a clear definition of what it means for a repeated test “to replicate” a previous one that the way we talk about these processes can get confusing.
Problems in philosophy and language: Turtles all the way down
When people talk about the “reproducibility” of a claim they’re actually talking about at least four different kinds of things. The first is whether it is expected that additional tests will confirm a claim. That is, is the claim “true” in some sense. The second is whether a specific re-run of a specific trial will (or did) give “the same” result. The third is whether the design or a trial (or its analysis) is such that repeated trials would be expected to give the same result (whatever that is). That is, is the experiment properly designed so as to give “reliable” results. Finally we have the question of whether the description of an experiment is sufficiently good to allow it to be re-run.
That the relationship between “truth” and empirical approaches that use repeated testing to generating evidence to support claims is a challenging problem is hardly a new issue in the philosophy of science. The question of whether the universe is reliable is a question asked in a social context by humans, and the tests we run to try and understand that universe are designed by humans with an existing context. But equally those tests play out in an arena that appears to have some consistencies. Technologies, for good or ill, tend to work when built and tested, not when purely designed in the abstract. Maths seems useful. It really is turtles all the way down.
However, even without tackling the philosophical issues there are some avenues for solving problems of the language we use. “Reproducible” is often used as interchangeable with “true” or even “honest”. Sometimes it refers to the description of the experiment, sometimes to the ability of re-tests to get the same result. “Confidence” is often referred to as absolute, not relative, and discussed without clarity as to whether it refers to the claim itself or the predicted result of a repeated test. And all of these are conflated across the various levels of replication, reproduction and generalisation discussed above.
Problems of description vs problems of results
Discussion of whether an article is “replicable” can mean two quite different things. Is the description of the test sufficient to enable it be re-run (or more precisely, does it meet some normative standard of community expectation of sufficient description)? This is a totally different question to the one of what is expected (or happens) when such a test is actually re-run. Both in turn are quite different, although there is a relation, to the question of whether the test is well designed, either to generate consistently the same result, or to provide information on the claim.
Part of the solution may lie in a separation of concerns. We need much greater clarity in the distinction between “described well enough to re-do” and “seems to be consistent in repeated tests”. We need to maintain those distinctions for each of the different levels above: replication, reproduction and generalisation. All of these in turn are separate to the question of whether a claim is true, or the universe (or our senses) reliable, or the social context and power relations that guided the process of claim making and experiment. It is at the intersections between these different issues that things get interesting: does the design of an experiment may answer a general question, or is it much more specific? Is the failure to get the same result, or our expectation of how often we “should” get the same result a question of experimental design, what is true, or the way our experience guides us to miss, or focus on specific issues? In our search for reliable, and clearly communicated, results have we actually tested a different question to the one that really matters?
A big part of the problem is that by using sloppy language we make it unclear what we are talking about. If we’re going to make progress there will need to be a more precise, and more consistent way of talking about exactly what problem is under investigation at any point in time.
This is a working through of some conversations with a range of people, most notably Jennifer Lin. It is certainly not original – and doesn’t contain many answers – but is an effort to start organising thoughts around how to talk about the problems around reproducibility, which is after all the interesting problem. Any missing links and credits entirely my own fault and happy to make corrections.