Home » Blog

Draft White Paper – Researcher identifiers

15 April 2010 10 Comments
National Science Foundation (NSF) Logo, reprod...
Image via Wikipedia

On April 26 I am attending a joint meeting of the NSF and EuroHORCS (European Heads of Research Councils) on “Changing the Conduct of Science in the Information Age”. I have been asked to submit a one page white paper in advance of the meeting and have been struggling a bit with this. This is stage one, a draft document relating to researcher identifiers. I’m not happy with it but reckon that other people out there may well be able to help where I am struggling. I may write a second one on metrics or at least a brief literature collection. Any and all comments welcome.

Summary

Citation lies at the core of research practice, recognizing both the contributions that others have made in the development of a specific piece of work and in linking related knowledge together. The technology of the web make it technically feasible to radically improve the precision, accuracy, and completeness of these links. Such improvements are crucial to the successful implementation of any system that purports to measure research outputs or quality.

Additionally the web offers the promise of extended and distributed projects involving diverse parties, many of which may not be professional researchers. Nonetheless such contributions deserve the same level of recognition as comparable contributions from professional researchers. Providing an open system in which people can contribute to research efforts and receive credit raises significant social issues of control and validation of identities.

Such open systems and federated systems are exposed to potential failure through lack of technical expertise on the part of users, particularly where a person loses or creates a new identity. This in many ways is already the case where we use institutional email addresses as proxies for researcher identity. Is D.C.Neylon@####.ac.uk (a no longer functioning email address), the same person as Cameron.Neylon@####.ac.uk? It is technically feasible to consider systems in which the actual identity token used is widely available and compatible with the wider consumer web but centralised and trusted authorities provide validation services to confirm specific claims around identity. Such a broker or clearing house would provide a similar role for identities as CrossRef provides for scholarly articles via the DOI.

General points

  • By adding the concept of a semantic-web ready researcher identifier, i.e. an identifier that provides a URL endpoint that uniquely represents a specific researcher, the existing technical capacity of the semantic web stack can be used to provide a linked data representation of contributions to existing published research objects that are accessible at URL endpoints. Such a representation could be readily expanded beyond authorship to funder contributions.
  • Crediting a researcher as a contributor to a specific object published on the web is a specific form of citation or linking in this view
  • The authoring tools to support the linking and publishing of research objects in this form do not currently exist in a widely useable form.
  • Semantic web technology provides an extensible means to adding and recognising diverse forms of contribution.

Authorisation, validation, and control

  • Access to such identifiers must be under the control of the researcher and not limited to those with institutional affiliations. Any person must be able to obtain and control a unique researcher identifier that refers to them.
  • Authorisation and validation of claimed rights of access or connections to specific institutions can be technically handled separately from the provision of identifiers.

Technical and legal issues

  • OpenID and OAuth provide a developing internet standard that provides technical means to achieve a distributed availability of identifiers and to separate issues of authentication from those of identification. They are a current leader for federated identity and authorisation solutions on the consumer web.
  • OpenID and OAuth do not currently provide the levels of security required in several jurisdictions for personal or sensitive information (e.g. UK data protection act).  Such federated systems may fall foul of jurisdictions with strong generic privacy requirements, e.g. Canada
  • To interoperate with the wider web and enable a wider notion of citation as a declaration of a piece of knowledge “Person X authored Paper Y”, identities must resolve on the web, in the sense of being a clickable hyperlink that takes a human or machine reader to a page containing information representing that person.

Social issues

  • There are profound social issues of trust in the maintenance of such identifiers, especially for non-professional researchers in the longer term.
  • A centralised trusted authority (or authorities) that validates specific claims about identity (a “CrossRef for people”) might provide a trusted broker for identity transactions in the research space that solves many of these trust problems.
  • Issues around trust and jurisdiction as well as scope and control are likely to limit and fragment any effort to coordinate, federate, or integrate differing identity solutions in the research space. Therefore interoperability of any developed system with the wider web must be a prime consideration.

Conclusions

Identity, unique identifiers, authorisation of access and validation of claims are issues that need to be solved before any transparent and believable metric systems can be reliably implemented. In the current world ill-considered, non-transparent, and irreproducible metric systems will almost inevitably lead to legal claims. At the same time there is a massive opportunity for wider involvement in research for which a much more diverse range of people’s contributions will need recognition.

A system in which recognition and citation takes the form of a link to a specified address on the web that represents a person has the potential to simultaneously make it much easier to unambiguously confirm who is being credited but additionally provides the opportunity to leverage an existing stack of tools and services to aggregate and organize information relating to identity. This is in fact a specific example of a wider view of addressable research objects on the web that can be part of a web of linked data. In this view a person is simply another object that can have specified relationships (links) to other objects.

Partial technical solutions in the form of OAuth, and OpenID exist that solve some subset of these problems. However these systems are currently not technically secure to a level compatible with handling the transfer of sensitive data. However they can interoperate with more secure transfer systems. They provide a federated and open system that enables any person to obtain and assert an identity and to control the appearance of that identity. Severe social issues around trust and persistence exist for this kind of system. This may be addressed through trusted centralized repositories that can act as a reliable broker.

Given expected issues with uptake of any system, systems that are interoperable with competitive or complementary offerings are crucial.

Reblog this post [with Zemanta]

  • Two things I'd be cautious about:

    I'd be careful about proposing centralized solutions. They can easily become bottlenecks for innovation.

    I'd be careful about tying proposals to the semantic web. If the semantic web were a startup it would have gone out of business years ago. That's not necessarily a criticism – lots of things take time to get right, and aren't suited for development by a company. But I think it's worth thinking about.

    Finally, it needs to be easy to aggregate this information for purposes of CVs, hiring etc. Preferably the aggregation functions aren't owned by anyone, but rather are owned by everyone, so that people can come up with different ways of aggregating.

    On the issue of metrics: I'll repeat something that's repeated often, but bears repeating: metrics should only ever be secondary criteria. Organizations that rely on them as primary criteria will quickly start to do second- or third-rate work, as people begin gaming the system. The economist Robert Solow has a great story about how the Soviet committee on paper production used to pay mills according to the weight of the paper they produced. The result: paper too thick to put in a typewriter. So they changed to reward the number of sheets produced. The result: paper transparently thin. Metrics can always be gamed.

    (This leaves the question: how do you judge what is good research? I wish I had a good answer.)

  • band

    First, this is an important discussion to have. This post highlights several aspects of the conversations we need. Thanks.

    Second, I agree with all of Michael's cautions. Centralized solutions can become single-points-of-failure. Maybe a distributed directory system like the DNS is a better way to start?

    Third, you say above that the failure of open, distributed systems is the result “of a lack of technical expertise on the part of the users ….” It is my responsibility to keep track of my logins, but I think the systems problem is that I have to keep track of this information. I view this as a system usability problem, not a lack of competence on my part. Maybe I'm not understanding you.

    Fourth, why are the social issues of trust greater for non-professional researchers? I think the “professional/non-professional” distinction isn't the best frame for the trust issues. I'm not sure what distinctions and categories will help us talk about the issues of trust. This is a BIG issue, and I'd like to have a more nuanced way to talk about it.

    Fifth, when we talk about “interoperability” can we specify or define what an interoperable scenario looks like? I think I know when systems are not inter-operating to my satisfaction. But I'm less successful in defining what interoperability is. (This reminds me of industrial design discussions of how to define a comfortable chair.)

    -Bill

  • Hi Michael, thanks for the comments. I agree about centralization, I am really against it personally but listening to Geoff Bilder has convinced me that there is some need for an authority to look after some aspects of persistence and disambiguation. I think there is a way of squaring the circle that combines the federated aspect of OpenID (or something similar) with an authorisation or validation mechanism (perhaps OAuth) that can combine the best aspects of both. The technical devil is in the details.

    My point about the semantic web aspect is that we should be able to do simple things (that don't need to bother the average user) that actually make effective use of the stack that is already out there. Not that this is an argument for the semantic web but that there are tools that could be deployed that are very general and already available.

    Metrics…well yes, my main intention was to try and ask what different purposes we have for measuring things and how we might usefully enable it. The idea is to make a distinction between political or demonstrative metrics (CVs, presentations to politicians, where you want to make a case, primarily through aggregating and selecting material) and actually useful things like – “how many papers should I have read last week?”, “how far behind am I on looking at important datasets? How do I prioritise them?”

  • Bill, thanks very good points where I haven't been clear. I want to support the “non-professional researcher” and motivate their contributions so I definitely shouldn't point to them as lest trustworthy!

    Definitely agree that distributed is a good start, but I think there is a place for a trusted authority. That might be technically distributed and that could be a good solution, but I think the cultural issues around control by institutions, funders, and researchers will require some sort of (set of?) authorities that can say “yes we know this person exists and they really did do this other work”.

  • band

    Cameron, among people trust and authority are both socially constructed and frequently re-negotiated, so I think this problem is a juicy, socio-technical one. Using a tool like Shibboleth can move some of the identity verification work to other trusted institutions. But I don't see how to guarantee that, say, “band.myopenid.com” actually did a piece of work. Is the onus on me to prove I'm not lying? Hmmm ….

  • Oh I don't think we can guarantee it absolutely but within the research
    sphere there are trusted enough authorities (the journals themselves for
    instance) that we can rely on claims about authorship validated by them.
    Research is in a sense both a horrible place to do this because of the
    culture and technical issues but also good because we have some reasonable
    understanding about who trusts who about what.

    So to take your example – if a record label agrees that band.myopenid.com
    was a main contributor to the song you've just downloaded then that is good
    enough for us for most cases. If we're doing a fraud investigation then the
    level of proof would need to be much higher and you'd dig much deeper into
    those claims and the provenance record of the objects themselves (when was
    the track recorded, is there external evidence that band.myopenid.com was in
    the place where the track metadata says it was recorded).

  • There's “semantic web” – which was a buzz word and is now passe' … then there's providing semantic meaning to data (really doing the semantic web) which seems to slowly, slowly be getting some traction. I'm interested in also tracking different forms of contribution – having different relationships. personx <reviewed> object y, personx <authored> object y, personx <commented on> object y, object y <incorporates> objectz <authored by> personx and so on.

  • authority and trust are very much interesting socio-technical issues – however, there are some interesting network and computational ways to help people decide. Once everything's a network, you can use social rank or Bonaicich measures with some sort of scale (the guy might be an idiot, but he's authored 10 journal articles on the same subject so maybe read his comment, etc.)

  • There's “semantic web” – which was a buzz word and is now passe' … then there's providing semantic meaning to data (really doing the semantic web) which seems to slowly, slowly be getting some traction. I'm interested in also tracking different forms of contribution – having different relationships. personx <reviewed> object y, personx <authored> object y, personx <commented on> object y, object y <incorporates> objectz <authored by> personx and so on.

  • authority and trust are very much interesting socio-technical issues – however, there are some interesting network and computational ways to help people decide. Once everything's a network, you can use social rank or Bonaicich measures with some sort of scale (the guy might be an idiot, but he's authored 10 journal articles on the same subject so maybe read his comment, etc.)