What measurement does to us…

A thermometer showing −17°C.
A thermometer showing −17°C. (Photo credit: Wikipedia)

Over the past week this tweet was doing the rounds. I’m not sure where it comes from or precisely what its original context was, but it appeared in my feed from folks in various student analytics and big data crowds. The message I took was “measurement looks complicated until you pin it down”.

But what I took from this was something a bit different. Once upon a time the idea of temperature was a complex thing. It was subjective, people could reasonably disagree on whether today was hotter or colder than today. Note those differences between types of “cold” and “hot”; damp, dank, frosty, scalding, humid, prickly. This looks funny to us today because we can look at a digital readout and get a number. But what really happened is that our internal conception of what temperature is changed. What has actually happened is that a much richer and nuanced concept has been collapsed onto a single linear scale. To re-create that richness weather forecasters invent things like “wind chill” and “feels like” to capture different nuances but we have in fact lost something, the idea that different people respond differently to the same conditions.

https://twitter.com/SilverVVulpes/status/850439061273804801

Last year I did some work where I looked at the theoretical underpinnings for the meaning we give to referencing and citation indicators in the academy. What I found was something rather similar. Up until the 70s the idea of what made “quality” or “excellence” in research was much more contextual. The development of much better data sources, basically more reliable thermometers, for citation counting led to intense debates about whether this data had any connection to the qualities of research at all, let alone whether anything could be based on those numbers. The conception of research “quality” was much richer, including the idea that different people might have different responses.

In the 1970s and 80s something peculiar happens. This questioning of whether citations can represent the qualities of research disappears, to be replaced by the assumption that it does. A rear-guard action continues to question this, but it is based on the idea that people are doing many different things when they reference, not the idea that counting such things is fundamentally a questionable activity in and of itself. Suddenly citations became a “gold standard”, the linear scale against which everything was measured, and our ideas about the qualities of research became consequently impoverished.

At the same time it is hard to argue that a simple linear scale of defined temperature has created massive advances, we can track global weather against agreed standards, including how it is changing and quantify the effects of climate change. We can calibrate instruments against each other and control conditions in ways that allow everything from the safekeeping of drugs and vaccines to ensuring that our food is cooked to precisely the right degree. Of course on top of that we have to acknowledge that temperature actually isn’t as simple as concept as its made out to be as well. Definitions always break down somewhere.

https://twitter.com/StephenSerjeant/status/851016277992894464

It seems to me that its important to note that these changes in meaning can affect the way we think and talk about things. Quantitative indicators can help us to share findings and analysis, to argue more effectively, most importantly to share claims and evidence in a way which is much more reliably useful. At the same time if we aren’t careful those indicators can change the very things that we think are important. It can change the underlying concept of what we are talking about.

Ludwig Fleck in The Genesis and Development of a Scientific Fact explains this very effectively in terms of the history of the concept of “syphillis”. He explain how our modern conception (a disease with specific symptoms caused by an infection with a specific transmissible agent) would be totally incomprehensible to those who thought of disease in terms of how they were to be treated (in this case being classified as a disease treated with mercury). The concept itself being discussed changes when the words change.

None of this is of course news to people in Science and Technology Studies, history of science, or indeed much of the humanities. But for scientists it often seems to undermine our conception of what we’re doing. It doesn’t need to. But you need to be aware of the problem.

This ramble brought to you in part by a conversation with @dalcashdvinksy and @StephenSergeant

Draft White Paper – Researcher identifiers

National Science Foundation (NSF) Logo, reprod...
Image via Wikipedia

On April 26 I am attending a joint meeting of the NSF and EuroHORCS (European Heads of Research Councils) on “Changing the Conduct of Science in the Information Age”. I have been asked to submit a one page white paper in advance of the meeting and have been struggling a bit with this. This is stage one, a draft document relating to researcher identifiers. I’m not happy with it but reckon that other people out there may well be able to help where I am struggling. I may write a second one on metrics or at least a brief literature collection. Any and all comments welcome.

Summary

Citation lies at the core of research practice, recognizing both the contributions that others have made in the development of a specific piece of work and in linking related knowledge together. The technology of the web make it technically feasible to radically improve the precision, accuracy, and completeness of these links. Such improvements are crucial to the successful implementation of any system that purports to measure research outputs or quality.

Additionally the web offers the promise of extended and distributed projects involving diverse parties, many of which may not be professional researchers. Nonetheless such contributions deserve the same level of recognition as comparable contributions from professional researchers. Providing an open system in which people can contribute to research efforts and receive credit raises significant social issues of control and validation of identities.

Such open systems and federated systems are exposed to potential failure through lack of technical expertise on the part of users, particularly where a person loses or creates a new identity. This in many ways is already the case where we use institutional email addresses as proxies for researcher identity. Is D.C.Neylon@####.ac.uk (a no longer functioning email address), the same person as Cameron.Neylon@####.ac.uk? It is technically feasible to consider systems in which the actual identity token used is widely available and compatible with the wider consumer web but centralised and trusted authorities provide validation services to confirm specific claims around identity. Such a broker or clearing house would provide a similar role for identities as CrossRef provides for scholarly articles via the DOI.

General points

  • By adding the concept of a semantic-web ready researcher identifier, i.e. an identifier that provides a URL endpoint that uniquely represents a specific researcher, the existing technical capacity of the semantic web stack can be used to provide a linked data representation of contributions to existing published research objects that are accessible at URL endpoints. Such a representation could be readily expanded beyond authorship to funder contributions.
  • Crediting a researcher as a contributor to a specific object published on the web is a specific form of citation or linking in this view
  • The authoring tools to support the linking and publishing of research objects in this form do not currently exist in a widely useable form.
  • Semantic web technology provides an extensible means to adding and recognising diverse forms of contribution.

Authorisation, validation, and control

  • Access to such identifiers must be under the control of the researcher and not limited to those with institutional affiliations. Any person must be able to obtain and control a unique researcher identifier that refers to them.
  • Authorisation and validation of claimed rights of access or connections to specific institutions can be technically handled separately from the provision of identifiers.

Technical and legal issues

  • OpenID and OAuth provide a developing internet standard that provides technical means to achieve a distributed availability of identifiers and to separate issues of authentication from those of identification. They are a current leader for federated identity and authorisation solutions on the consumer web.
  • OpenID and OAuth do not currently provide the levels of security required in several jurisdictions for personal or sensitive information (e.g. UK data protection act).  Such federated systems may fall foul of jurisdictions with strong generic privacy requirements, e.g. Canada
  • To interoperate with the wider web and enable a wider notion of citation as a declaration of a piece of knowledge “Person X authored Paper Y”, identities must resolve on the web, in the sense of being a clickable hyperlink that takes a human or machine reader to a page containing information representing that person.

Social issues

  • There are profound social issues of trust in the maintenance of such identifiers, especially for non-professional researchers in the longer term.
  • A centralised trusted authority (or authorities) that validates specific claims about identity (a “CrossRef for people”) might provide a trusted broker for identity transactions in the research space that solves many of these trust problems.
  • Issues around trust and jurisdiction as well as scope and control are likely to limit and fragment any effort to coordinate, federate, or integrate differing identity solutions in the research space. Therefore interoperability of any developed system with the wider web must be a prime consideration.

Conclusions

Identity, unique identifiers, authorisation of access and validation of claims are issues that need to be solved before any transparent and believable metric systems can be reliably implemented. In the current world ill-considered, non-transparent, and irreproducible metric systems will almost inevitably lead to legal claims. At the same time there is a massive opportunity for wider involvement in research for which a much more diverse range of people’s contributions will need recognition.

A system in which recognition and citation takes the form of a link to a specified address on the web that represents a person has the potential to simultaneously make it much easier to unambiguously confirm who is being credited but additionally provides the opportunity to leverage an existing stack of tools and services to aggregate and organize information relating to identity. This is in fact a specific example of a wider view of addressable research objects on the web that can be part of a web of linked data. In this view a person is simply another object that can have specified relationships (links) to other objects.

Partial technical solutions in the form of OAuth, and OpenID exist that solve some subset of these problems. However these systems are currently not technically secure to a level compatible with handling the transfer of sensitive data. However they can interoperate with more secure transfer systems. They provide a federated and open system that enables any person to obtain and assert an identity and to control the appearance of that identity. Severe social issues around trust and persistence exist for this kind of system. This may be addressed through trusted centralized repositories that can act as a reliable broker.

Given expected issues with uptake of any system, systems that are interoperable with competitive or complementary offerings are crucial.

Reblog this post [with Zemanta]

Quick update from International Digital Curation Conference

Just a quick note from the IDCC given I was introduced as “one of those people who are probably blogging the conference”. I spoke this morning giving a talk on Radical Sharing – Transforming Science? A version of the slides is available at slideshare. It seemed to go reasonably well and I got some positive comments. The highlight for me today was John Wilbanks speaking this evening – John always gives a great talk (slides will also be on his slideshare at some point) and I invariably learn something. Today that was the importance of distinguishing between citation (which is a term from the scholarly community) and attribution (which is a term with specific legal meaning in copyright law). Having used the two interchangeably in my talk (no recording unfortunately) John made the point that it is important to distinguish the two practices, particularly the reasons that motivate themand the different enforcement frameworks.

Interesting talks this afternoon on costing for digital curation – not something I have spent a lot of time thinking about but clearly something that is rather important. Also this morning talks on CARMEN and iPLANT, projects that are delivering on infrastructure for sharing and re-using data. Tonight we are off to Edinburgh castle for the dinner which should be fun and tomorrow I make an early getaway to get to more meetings.

Links as the source code of our thinking – Tim O’Reilly

I just wanted to point to a post that Tim O’Reilly wrote just before the US election a few weeks back. There was an interesting discussion about the rights and wrongs of him posting on his political views and the rights and wrongs of that being linked to from the O’Reilly Media front page. In amongst the abuse that you have come to expect in public political discussions there is some thought provoking stuff. But what I wanted to point out and hopefully revive a discussion of is a point he makes right near the bottom.

[I have conflated two levels of comments here (Tim is quoting his own comment) – see the original post for the context]

“Thanks to everyone for wading in, especially those of you who are marshalling reasoned arguments and sharing actual sources and references, showing you’ve done your homework, and helping other people to see the data that helped to shape your point of view. We need a LOT more of that in this discussion, rather than slinging unsupported allegations back and forth.

Bringing this back to tech – showing the data behind your argument is a lot like open source. It’s a way of verifying the “code” that’s inside your head. If you can’t show us your code, it’s a lot harder to trust your results!”

Links as source code for your thinking: that’s a meme that should survive the particulars of this particular debate!

In a sense Tim is advocating the wholesale adoption of the very strong attribution culture we (like to think we) have in academic research. The importance of acknowedging your sources is clear but it also has much more value than that. By tracing back the influences that have brought someone to a specific conclusion or belief it is possible for other people to gain a much deeper insight into how those ideas evolved. Being able to parse the dependencies between ideas, data, samples, papers, and knowledge in an automatic, machine readable, way is the promise of the semantic web, but in the meantime just helping the poor old humans to trace back and understand where someone is coming from is very helpful.

The problem of academic credit and the value of diversity in the research community

This is the second in a series of posts (first one here) in which I am trying to process and collect ideas that came out of Scifoo. This post arises out of a discussion I had with Michael Eisen (UC Berkely) and Sean Eddy (HHMI Janelia Farm) at lunch on the Saturday. We had drifted from a discussion of the problem of attribution stacking and citing datasets (and datasets made up of datasets) into the problem of academic credit. I had trotted out the usual spiel about the need for giving credit for data sets and for tool development.

Michael made two interesting points. The first was that he felt people got too much credit for datasets already and that making them more widely citeable would actually devalue the contribution. The example he cited was genome sequences. This is a case where, for historical reasons, the publication of a dataset as a paper in a high ranking journal is considered appropriate.

In a sense I agree with this case. The problem here is that for this specific case it is allowable to push a dataset sized peg into a paper sized hole. This has arguably led to an over valuing of the sequence data itself and an undervaluing of the science it enables. Small molecule crystallography is similar in some regards with the publication of crystal structures in paper form bulking out the publication lists of many scientists. There is a real sense in which having a publication stream for data, making the data itself directly citeable, would lead to a devaluation of these contributions. On the other hand it would lead to a situation where you would cite what you used, rather than the paper in which it was, perhaps peripherally described. I think more broadly that the publication of data will lead to greater efficiency in research generally and more diversity in the streams to which people can contribute.

Michael’s comment on tool development was more telling though. As people at the bottom of the research tree (and I count myself amongst this group) it is easy to say ‘if only I got credit for developing this tool’, or ‘I ought to get more credit for writing my blog’, or anyone of a thousand other things we feel ‘ought to count’. The problem is that there is no such thing as ‘credit’. Hiring decisions and promotion decisions are made on the basis of perceived need. And the primary needs of any academic department are income and prestige. If we believe that people who develop tools should be more highly valued then there is little point in giving them ‘credit’ unless that ‘credit’ will be taken seriously in hiring decisions. We have this almost precisely backwards. If a department wanted tool developers then it would say so, and would look at CVs for evidence of this kind of work. If we believe that tool developers should get more support then we should be saying that at a higher, strategic level, not just trying to get it added as a standard section in academic CVs.

More widely there is a question as to why we might think that blogs, or public lectures, or code development, or more open sharing of protocols are something for which people should be given credit. There is often a case to be made for the contribution of a specific person in a non-traditional medium, but that doesn’t mean that every blog written by a scientists is a valuable contribution. In my view it isn’t the medium that is important, but the diversity of media and the concomitant diversity of contributions that they enable. In arguing for these contributions being significant what we are actually arguing for is diversity in the academic community.

So is diversity a good thing? The tightening and concentration of funding has, in my view, led to a decrease in diversity, both geographical and social, in the academy. In particular there is a tendency to large groups clustered together in major institutions, generally led by very smart people. There is a strong argument that these groups can be more productive, more effective, and crucially offer better value for money. Scifoo is a place where those of us who are less successful come face to face with the fact that there are many people a lot smarter than us and that these people are probably more successful for a reason. And you have to question whether your own small contribution with a small research group is worth the taxpayer’s money. In my view this is something you should question anyway as an academic researcher – there is far too much comfortable complacency and sense of entitlement, but that’s a story for another post.

So the question is; do I make a valid contribution? And does that provide value for money? And again for me Scifoo provides something of an answer. I don’t think I spoke to any person over the weekend without at least giving them something new to think about, a slightly different view on a situation, or just an introduction to something that hadn’t heard of before. These contributions were in very narrow areas, ones small enough for me to be expert, but my background and experience provided a different view. What does this mean for me? Probably that I should focus more on what makes my background and experience unique – that I should build out from that in the directions most likely to provide a complementary view.

But what does it mean more generally? I think that it means that a diverse set of experiences, contributions, and abilities will improve the quality of the research effort. At one session of Scifoo, on how to support ground breaking science, I made the tongue in cheek comment that I thought we needed more incremental science, more filling in of tables, of laying the foundations properly. The more I think about this the more I think it is important. If we don’t have proper foundations, filled out with good data and thought through in detail, then there are real risks in building new skyscrapers. Diversity adds reinforcement by providing better tools, better datasets, and different views from which to examine the current state of opinion and knowledge. There is an obvious tension between delivering radical new technologies and knowledge and the incremental process of filling in, backing up, and checking over the details. But too often the discussion is purely about how to achieve the first, with no attention given to the importance of the second. This is about balance not absolutes.

So to come back around to the original point, the value of different forms of contribution is not due to the fact that they are non-traditional or because of the medium per se, it is because they are different. If we value diversity at hiring committees, and I think we should, then looking at a diverse set of contributions, and the contribution that a given person is likely to make in the future based on their CVs, we can assess more effectively how they will differ from the people we already have. The tendency of ‘the academy’ to hire people in its own image is well established. No monoculture can ever be healthy; certainly not in a rapidly changing environment. So diversity is something we should value for its own sake, something we should try to encourage, and something that we should search CVs for evidence of. Then the credit for these activities will flow of its own accord.

Data is free or hidden – there is no middle ground

Science commons and other are organising a workshop on Open Science issues as a satellite meeting of the European Science Open Forum meeting in July. This is pitched as an opportunity to discuss issues around policy, funding, and social issues with an impact on the ‘Open Research Agenda’. In preparation for that meeting I wanted to continue to explore some of the conflicts that arise between wanting to make data freely available as soon as possible and the need to protect the interests of the researchers that have generated data and (perhaps) have a right to the benefits of exploiting that data.

John Cumbers proposed the idea of a ‘Protocol’ for open science that included the idea of a ‘use embargo’; the idea that when data is initially made available, no-one else should work on it for a specified period of time. I proposed more generally that people could ask that people leave data alone for any particular period of time, but that there ought to be an absolute limit on this type of embargo to prevent data being tied up. These kinds of ideas revolve around the need to forge community norms – standards of behaviour that are expected, and to some extent enforced, by a community. The problem is that these need to evolve naturally, rather than be imposed by committee. If there isn’t community buy in then proposed standards have no teeth.

An alternative approach to solving the problem is to adopt some sort ‘license’. A legal or contractual framework that creates obligation about how data can be used and re-used. This could impose embargoes of the type that John suggested, perhaps as flexible clauses in the license. One could imagine an ‘Open data – six month analysis embargo’ license. This is attractive because it apparently gives you control over what is done with your data while also allowing you to make it freely available. This is why people who first come to the table with an interest in sharing content always start with CC-BY-NC. They want everyone to have their content, but not to make money out of it. It is only later that people realise what other effects this restriction can have.

I had rejected the licensing approach because I thought it could only work in a walled garden, something which goes against my view of what open data is about. More recently John Wilbanks has written some wonderfully clear posts on the nature of the public domain, and the place of data in it, that make clear that it can’t even work in a walled garden. Because data is in the public domain, no contractual arrangement can protect your ability to exploit that data, it can only give you a legal right to punish someone who does something you haven’t agreed to. This has important consequences for the idea of Open Science licences and standards.

If we argue as an ‘Open Science Movement’ that data is in and must remain in the public domain then, if we believe this is in the common good, we should also argue for the widest possible interpretation of what is data. The results of an experiment, regardless of how clever its design might be, are a ‘fact of nature’, and therefore in the public domain (although not necessarily publically available). Therefore if any person has access to that data they can do whatever the like with it as long as they are not bound by a contractual arrangement. If someone breaks a contractual arrangement and makes the data freely available there is no way you can get that data back. You can punish the person who made it available if they broke a contract with you. But you can’t recover the data. The only way you can protect the right to exploit data is by keeping it secret. The is entirely different to creative content where if someone ignores or breaks licence terms then you can legally recover the content from anyone that has obtained it.

Why does this matter to the Open Science movement? Aren’t we all about making the data available for people to do whatever anyway? It matters because you can’t place any legal limitations on what people do with data you make available. You can’t put something up and say ‘you can only use this for X’ or ‘you can only use it after six months’ or even ‘you must attribute this data’. Even in a walled garden, once there is one hole, the entire edifice is gone. The only way we can protect the rights of those who generate data to benefit from exploiting it is through the hard work of developing and enforcing community norms that provide clear guidelines on what can be done. It’s that or simply keep the data secret.

What is important is that we are clear about this distinction between legal and ethical protections. We must not tell people that their data can be protected because essentially they can’t. And this is a real challenge to the ethos of open data because it means that our only absolutely reliable method for protecting people is by hiding data. Strong community norms will, and do, help but there is a need to be careful about how we encourage people to put data out there. And we need to be very strong in condemning people who do the ‘wrong’ thing. Which is why a discussion on what we believe is ‘right’ and ‘wrong’ behaviour is incredibly important. I hope that discussion kicks off in Barcelona and continues globally over the next few months. I know that not everyone can make the various meetings that are going on – but between them and the blogosphere and the ‘streamosphere‘ we have the tools, the expertise, and hopefully the will, to figure these things out.

Related articles

Zemanta Pixie

Attribution for all! Mechanisms for citation are the key to changing the academic credit culture

A reviewer at the National Institutes of Health evaluates a grant proposal.Image via Wikipedia

Once again a range of conversations in different places have collided in my feed reader. Over on Nature Networks, Martin Fenner posted on Researcher ID which lead to a discussion about attribution and in particular Martin’s comment that there was a need to be able to link to comments and the necessity of timestamps. Then DrugMonkey posted a thoughtful blog about the issue of funding body staff introducing ideas from unsuccessful grant proposals they have handled to projects which they have a responsibility in guiding. Continue reading “Attribution for all! Mechanisms for citation are the key to changing the academic credit culture”