What measurement does to us…

A thermometer showing −17°C.
A thermometer showing −17°C. (Photo credit: Wikipedia)

Over the past week this tweet was doing the rounds. I’m not sure where it comes from or precisely what its original context was, but it appeared in my feed from folks in various student analytics and big data crowds. The message I took was “measurement looks complicated until you pin it down”.

But what I took from this was something a bit different. Once upon a time the idea of temperature was a complex thing. It was subjective, people could reasonably disagree on whether today was hotter or colder than today. Note those differences between types of “cold” and “hot”; damp, dank, frosty, scalding, humid, prickly. This looks funny to us today because we can look at a digital readout and get a number. But what really happened is that our internal conception of what temperature is changed. What has actually happened is that a much richer and nuanced concept has been collapsed onto a single linear scale. To re-create that richness weather forecasters invent things like “wind chill” and “feels like” to capture different nuances but we have in fact lost something, the idea that different people respond differently to the same conditions.

https://twitter.com/SilverVVulpes/status/850439061273804801

Last year I did some work where I looked at the theoretical underpinnings for the meaning we give to referencing and citation indicators in the academy. What I found was something rather similar. Up until the 70s the idea of what made “quality” or “excellence” in research was much more contextual. The development of much better data sources, basically more reliable thermometers, for citation counting led to intense debates about whether this data had any connection to the qualities of research at all, let alone whether anything could be based on those numbers. The conception of research “quality” was much richer, including the idea that different people might have different responses.

In the 1970s and 80s something peculiar happens. This questioning of whether citations can represent the qualities of research disappears, to be replaced by the assumption that it does. A rear-guard action continues to question this, but it is based on the idea that people are doing many different things when they reference, not the idea that counting such things is fundamentally a questionable activity in and of itself. Suddenly citations became a “gold standard”, the linear scale against which everything was measured, and our ideas about the qualities of research became consequently impoverished.

At the same time it is hard to argue that a simple linear scale of defined temperature has created massive advances, we can track global weather against agreed standards, including how it is changing and quantify the effects of climate change. We can calibrate instruments against each other and control conditions in ways that allow everything from the safekeeping of drugs and vaccines to ensuring that our food is cooked to precisely the right degree. Of course on top of that we have to acknowledge that temperature actually isn’t as simple as concept as its made out to be as well. Definitions always break down somewhere.

https://twitter.com/StephenSerjeant/status/851016277992894464

It seems to me that its important to note that these changes in meaning can affect the way we think and talk about things. Quantitative indicators can help us to share findings and analysis, to argue more effectively, most importantly to share claims and evidence in a way which is much more reliably useful. At the same time if we aren’t careful those indicators can change the very things that we think are important. It can change the underlying concept of what we are talking about.

Ludwig Fleck in The Genesis and Development of a Scientific Fact explains this very effectively in terms of the history of the concept of “syphillis”. He explain how our modern conception (a disease with specific symptoms caused by an infection with a specific transmissible agent) would be totally incomprehensible to those who thought of disease in terms of how they were to be treated (in this case being classified as a disease treated with mercury). The concept itself being discussed changes when the words change.

None of this is of course news to people in Science and Technology Studies, history of science, or indeed much of the humanities. But for scientists it often seems to undermine our conception of what we’re doing. It doesn’t need to. But you need to be aware of the problem.

This ramble brought to you in part by a conversation with @dalcashdvinksy and @StephenSergeant

Metrics and Money

Crane Paper Company in Dalton produces the pap...
Image via Wikipedia

David Crotty, over at Scholarly Kitchen has an interesting piece on metrics, arguing that many of these have not been thought through because they don’t provide concrete motivation to researchers to care about them. Really he’s focused mainly on exchange mechanisms, means of persuading people that doing high quality review is worth their while by giving them something in exchange, but the argument extends to all sorts of metrics. Why would you care about any given measure if achieving on it doesn’t translate into more resources, time, or glory?

You might expect me to disagree with a lot of this but for the most part I don’t. Any credible metric has to be real, it has to mean something. It has to matter. This is why connecting funders, technologists, data holders, and yes, even publishers, is at the core of the proposal I’m working with at the moment. We need funders to want to have access to data and to want to reward performance on those measures. If there’s money involved then researchers will follow.

Any time someone talks about a “system” using a language of currency there is a key question you have to ask; can this “value” can be translated into real money. If it can’t then it is unlikely people will take it seriously. Currency has to be credible, it has to be taken seriously or it doesn’t work. How much is the cash in your pocket actually worth? Cash has to embody transferable value and many of these schemes don’t provide anything more than a basic barter.

But equally the measures of value, or of cost have to be real. Confidence in the reality of community measures is crucial, and this is where I part company with David, because at the centre of his argument is what seems to me a massive hole.

“The Impact Factor, flawed though it may be, at least tries to measure something that directly affects career advancement–the quality and impact of one’s research results.  It’s relevant because it has direct meaning toward determining the two keys to building a scientific career, jobs and funding.”

The second half of this I agree with (but resent). But it depends absolutely on the first part being widely believed. And the first part simply isn’t true. The Thomson Reuters Journal Impact Factor does not try to to measure the quality and impact of individual research results. TR are shouting this from the treetops at the moment. We know that it is at best an extremely poor measure of individual performance in practice. In economic terms, our dependence on the JIF is a bubble. And bubbles burst.

The reason people are working on metrics is because they figure that replacing one rotten measure at the heart of the system with ones that are self-evidently technically superior should be easy. Of course this isn’t true. Changing culture, particularly reward culture is very difficult. You have to tackle the self-reinforcement that these measures thrive on – and you need to work very carefully to allow the bubble to deflate in a controlled fashion.

There is one further point where I disagree with David. He asks a rhetorical question:

“Should a university deny tenure to a researcher who is a poor peer reviewer, even if he brings in millions of dollars in grants each year and does groundbreaking research?  Should the NIH offer to fund poorly designed research proposals simply because the applicant is well-liked and does a good job interpreting the work of others?”

It’s interesting that David even asks these questions, because the answers seem obvious, self evident even. The answer, at least the answer to the underlying question, is in both cases yes. The ultimate funders should fund people who excel at review even if they are poor at other parts of the enterprise. The work of review must be valued or it simply won’t be done. I have heard heads of university departments tell researchers to do less reviewing and write more grants. And I can tell you in the current UK research funding climate that review is well off the top of my priority list. If there is no support for reviewing then in the current economic climate we will see less of it done; if there is no space in our community for people who excel at reviewing then who will teach it? Or do we continue the pernicious myth that the successful scientist is a master of all of their trades? Aside from anything else basic economics tells us that specialisation leads to efficiency gains, even when one of the specialists is a superior practitioner in both areas. Shouldn’t we be seeking those efficiency gains?

Because the real question is not whether reviewers should be funded, by someone in some form, but what the relative size of that investment should be in a balanced portfolio that covers all the contributions needed across the scientific enterprise. The question is how we balance, these activities. How do we tension them? And the answer to that is that we need a market. And to have a functioning market we need a functioning currency. That currency may just be money, but reputation can be converted, albeit not directly, into funds. Successfully hacking research reputation will make a big difference to more effective tensioning between different important and valid scientific research roles, and that’s why people are so interested in trying to do it.

Enhanced by Zemanta

Metrics of use: How to align researcher incentives with outcomes

slices of carrot
Image via Wikipedia

It has become reflexive in the Open Communities to talk about a need for “cultural change”. The obvious next step becomes to find strong and widely respected advocates of change, to evangelise to young researchers, and to hope for change to follow. Inevitably this process is slow, perhaps so slow as to be ineffective. So beyond the grassroots evangelism we move towards policy change as a top down mechanism for driving improved behaviour. If funders demand that data be open, that papers be accessible to the wider community, as a condition of funding then this will happen. The NIH mandate and the work of the Wellcome Trust on Open Access show that this can work, and indeed that mandates in some form are necessary to raise levels of compliance to acceptable levels.

But policy is a blunt instrument, and researchers being who they are don’t like to be pushed around. Passive aggressive responses from researchers are relatively ineffectual in the peer reviewed articles space. A paper is a paper. If its under the right licence then things will probably be ok and a specific licence is easy to mandate. Data though is a different fish. It is very easy to comply with a data availability mandate but provide that data in a form which is totally useless. Indeed it is rather hard work to provide it in a form that is useful. Data, software, reagents and materials, are incredibly diverse and it is difficult to make good policy that can be both effective and specific enough, as well as general enough to be useful. So beyond the policy mandate stick, which will only ever provide a minimum level of compliance, how do we motivate researchers to putting the effort into making their outputs available in a useful form? How do we encourage them to want to do the right thing? After all what we want to enable is re-use.

We need more sophisticated motivators than blunt policy instruments, so we arrive at metrics. Measuring the ouputs of researchers. There has been a wonderful animation illustrating a Daniel Pink talk doing the rounds in the past week. Well worth a look and important stuff but I think a naive application of it to researchers’ motivations would miss two important aspects. Firstly, money is never “off the table” in research. We are always to some extent limited by resources. Secondly the intrinsic motivators, the internal metrics that matter to researchers, are tightly tied to the metrics that are valued by their communities. In turn those metrics are tightly tied to resource allocation. Most researchers value their papers, the places they are published and the citations received, as measures of their value, because that’s what their community values. The system is highly leveraged towards rapid change, if and only if a research community starts to value a different set of metrics.

What might the metrics we would like to see look like? I would suggest that they should focus on what we want to see happen. We want return on the public investment, we want value for money, but above all we want to maximise the opportunity for research outputs to be used and to be useful. We want to optimise the usability and re-usability of research outputs and we want to encourage researchers to do that optimisation. Thus if our metrics are metrics of use we can drive behaviour in the right direction.

If we optimise for re-use then we automatically value access, and we automatically value the right licensing arrangements (or lack thereof). If we value and measure use then we optimise for the release of data in useful forms and for the release of open source research software. If we optimise for re-use, for discoverability, and for value add, then we can automatically tension the loss of access inherent in publishing in Nature or Science vs the enhanced discoverability and editorial contribution and put a real value on these aspects. We would stop arguing about whether tenure committees should value blogging and start asking how much those blogs were used by others to provide outreach, education, and research outcomes.

For this to work there would need to be mechanisms that automatically credit the use of a much wider range of outputs. We would need to cite software and data, would need to acknowledge the providers of metadata that enabled our search terms to find the right thing, and we would need to aggregate this information in a credible and transparent way. This is technically challenging, and technically interesting, but do-able. Many of the pieces are in place, and many of the community norms around giving credit and appropriate citation are in place, we’re just not too sure how to do it in many cases.

Equally this is a step back towards what the mother of all metrics, the Impact Factor was originally about. The IF was intended as a way of measuring the use of journals through counting citations, as a means of helping librarians to choose which journals to subscribe to. Article Level Metrics are in many ways the obvious return to this where we want to measure the outputs of specific researchers. The H-factor for all its weaknesses is a measure of re-use of outputs through formal citations. Influence and impact are already an important motivator at the policy level. Measuring use is actually a quite natural way to proceed. If we can get it right it might also provide the motivation we want to align researcher interests with the wider community and optimise access to research for both researchers and the public.

Reblog this post [with Zemanta]