Home » Blog

Draft White Paper – Researcher identifiers

15 April 2010 536 views 38 Comments
National Science Foundation (NSF) Logo, reprod...
Image via Wikipedia

On April 26 I am attending a joint meeting of the NSF and EuroHORCS (European Heads of Research Councils) on “Changing the Conduct of Science in the Information Age”. I have been asked to submit a one page white paper in advance of the meeting and have been struggling a bit with this. This is stage one, a draft document relating to researcher identifiers. I’m not happy with it but reckon that other people out there may well be able to help where I am struggling. I may write a second one on metrics or at least a brief literature collection. Any and all comments welcome.

Summary

Citation lies at the core of research practice, recognizing both the contributions that others have made in the development of a specific piece of work and in linking related knowledge together. The technology of the web make it technically feasible to radically improve the precision, accuracy, and completeness of these links. Such improvements are crucial to the successful implementation of any system that purports to measure research outputs or quality.

Additionally the web offers the promise of extended and distributed projects involving diverse parties, many of which may not be professional researchers. Nonetheless such contributions deserve the same level of recognition as comparable contributions from professional researchers. Providing an open system in which people can contribute to research efforts and receive credit raises significant social issues of control and validation of identities.

Such open systems and federated systems are exposed to potential failure through lack of technical expertise on the part of users, particularly where a person loses or creates a new identity. This in many ways is already the case where we use institutional email addresses as proxies for researcher identity. Is D.C.Neylon@####.ac.uk (a no longer functioning email address), the same person as Cameron.Neylon@####.ac.uk? It is technically feasible to consider systems in which the actual identity token used is widely available and compatible with the wider consumer web but centralised and trusted authorities provide validation services to confirm specific claims around identity. Such a broker or clearing house would provide a similar role for identities as CrossRef provides for scholarly articles via the DOI.

General points

  • By adding the concept of a semantic-web ready researcher identifier, i.e. an identifier that provides a URL endpoint that uniquely represents a specific researcher, the existing technical capacity of the semantic web stack can be used to provide a linked data representation of contributions to existing published research objects that are accessible at URL endpoints. Such a representation could be readily expanded beyond authorship to funder contributions.
  • Crediting a researcher as a contributor to a specific object published on the web is a specific form of citation or linking in this view
  • The authoring tools to support the linking and publishing of research objects in this form do not currently exist in a widely useable form.
  • Semantic web technology provides an extensible means to adding and recognising diverse forms of contribution.

Authorisation, validation, and control

  • Access to such identifiers must be under the control of the researcher and not limited to those with institutional affiliations. Any person must be able to obtain and control a unique researcher identifier that refers to them.
  • Authorisation and validation of claimed rights of access or connections to specific institutions can be technically handled separately from the provision of identifiers.

Technical and legal issues

  • OpenID and OAuth provide a developing internet standard that provides technical means to achieve a distributed availability of identifiers and to separate issues of authentication from those of identification. They are a current leader for federated identity and authorisation solutions on the consumer web.
  • OpenID and OAuth do not currently provide the levels of security required in several jurisdictions for personal or sensitive information (e.g. UK data protection act).  Such federated systems may fall foul of jurisdictions with strong generic privacy requirements, e.g. Canada
  • To interoperate with the wider web and enable a wider notion of citation as a declaration of a piece of knowledge “Person X authored Paper Y”, identities must resolve on the web, in the sense of being a clickable hyperlink that takes a human or machine reader to a page containing information representing that person.

Social issues

  • There are profound social issues of trust in the maintenance of such identifiers, especially for non-professional researchers in the longer term.
  • A centralised trusted authority (or authorities) that validates specific claims about identity (a “CrossRef for people”) might provide a trusted broker for identity transactions in the research space that solves many of these trust problems.
  • Issues around trust and jurisdiction as well as scope and control are likely to limit and fragment any effort to coordinate, federate, or integrate differing identity solutions in the research space. Therefore interoperability of any developed system with the wider web must be a prime consideration.

Conclusions

Identity, unique identifiers, authorisation of access and validation of claims are issues that need to be solved before any transparent and believable metric systems can be reliably implemented. In the current world ill-considered, non-transparent, and irreproducible metric systems will almost inevitably lead to legal claims. At the same time there is a massive opportunity for wider involvement in research for which a much more diverse range of people’s contributions will need recognition.

A system in which recognition and citation takes the form of a link to a specified address on the web that represents a person has the potential to simultaneously make it much easier to unambiguously confirm who is being credited but additionally provides the opportunity to leverage an existing stack of tools and services to aggregate and organize information relating to identity. This is in fact a specific example of a wider view of addressable research objects on the web that can be part of a web of linked data. In this view a person is simply another object that can have specified relationships (links) to other objects.

Partial technical solutions in the form of OAuth, and OpenID exist that solve some subset of these problems. However these systems are currently not technically secure to a level compatible with handling the transfer of sensitive data. However they can interoperate with more secure transfer systems. They provide a federated and open system that enables any person to obtain and assert an identity and to control the appearance of that identity. Severe social issues around trust and persistence exist for this kind of system. This may be addressed through trusted centralized repositories that can act as a reliable broker.

Given expected issues with uptake of any system, systems that are interoperable with competitive or complementary offerings are crucial.

Reblog this post [with Zemanta]

1 Star2 Stars3 Stars4 Stars5 Stars (No Ratings Yet)
Loading ... Loading ...

  • http://friendfeed.com/jillmwo Jill O’Neill

    I realize that you are pushing for open identifiers which is fine, but in the interests of full disclosure, should you leave out all references to other existing initiatives such as ORCID?

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/cameronneylon Cameron Neylon

    Jill, no I shouldn’t but I was struggling with where to put that and other initiatives. Its a complicated thing to describe what they are. Intention was certainly to include some sort of review of existing stuff.

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/mfenner Martin Fenner

    You touch on many of the issues relevant for researcher identifiers. I would take a slightly different approach: design a researcher identifier system that has the best chances to be successful. It should be as simple as possible, have wide support from many stakeholders, integrate with existing systems, and leaves ultimate control with the researcher.

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/mfenner Martin Fenner

    And I’m looking forward to the discussion about researcher identifiers at the workshop, where I will give a 5 min introduction to ORCID.

    This comment was originally posted on FriendFeed

  • http://michaelnielsen.org/blog Michael Nielsen

    Two things I'd be cautious about:

    I'd be careful about proposing centralized solutions. They can easily become bottlenecks for innovation.

    I'd be careful about tying proposals to the semantic web. If the semantic web were a startup it would have gone out of business years ago. That's not necessarily a criticism – lots of things take time to get right, and aren't suited for development by a company. But I think it's worth thinking about.

    Finally, it needs to be easy to aggregate this information for purposes of CVs, hiring etc. Preferably the aggregation functions aren't owned by anyone, but rather are owned by everyone, so that people can come up with different ways of aggregating.

    On the issue of metrics: I'll repeat something that's repeated often, but bears repeating: metrics should only ever be secondary criteria. Organizations that rely on them as primary criteria will quickly start to do second- or third-rate work, as people begin gaming the system. The economist Robert Solow has a great story about how the Soviet committee on paper production used to pay mills according to the weight of the paper they produced. The result: paper too thick to put in a typewriter. So they changed to reward the number of sheets produced. The result: paper transparently thin. Metrics can always be gamed.

    (This leaves the question: how do you judge what is good research? I wish I had a good answer.)

  • http://friendfeed.com/band Bill Anderson

    Martin makes a key point: the tools and techniques that are simple to understand and use are the ones that will be put into practice.

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/joergkurtwegner joergkurtwegner

    Why are we not simply using (distributed) FOAF ? http://en.wikipedia.org/wiki/FOAF_(software) Especially the foaf:mbox entries look pretty personalized to me? The only drawback is that each person can have multiple foaf:mbox entries. Anyway, they aer public anyway and I do not mind registering all of them. – http://foaf-visualizer.org/?uri=http://www.joergkurtwegner.de/foaf.xml&hl=en

    This comment was originally posted on FriendFeed

  • band

    First, this is an important discussion to have. This post highlights several aspects of the conversations we need. Thanks.

    Second, I agree with all of Michael's cautions. Centralized solutions can become single-points-of-failure. Maybe a distributed directory system like the DNS is a better way to start?

    Third, you say above that the failure of open, distributed systems is the result “of a lack of technical expertise on the part of the users ….” It is my responsibility to keep track of my logins, but I think the systems problem is that I have to keep track of this information. I view this as a system usability problem, not a lack of competence on my part. Maybe I'm not understanding you.

    Fourth, why are the social issues of trust greater for non-professional researchers? I think the “professional/non-professional” distinction isn't the best frame for the trust issues. I'm not sure what distinctions and categories will help us talk about the issues of trust. This is a BIG issue, and I'd like to have a more nuanced way to talk about it.

    Fifth, when we talk about “interoperability” can we specify or define what an interoperable scenario looks like? I think I know when systems are not inter-operating to my satisfaction. But I'm less successful in defining what interoperability is. (This reminds me of industrial design discussions of how to define a comfortable chair.)

    -Bill

  • http://friendfeed.com/cameronneylon Cameron Neylon

    Joerg, one reason is that we’ve just got a document here at STFC saying that our email addresses are private and sensitive information so we can’t give out others – that might be a silly answer but there are problems with uptake even with "simple" solutions. But I like Martin’s point. I think in the middle of "these are the things we might want a system to do" and "design a system that has a chance of uptake and success" we can find a good middle road. Like I said – I’m not really happy with this as written so the feedback is _really_ helpful!

    This comment was originally posted on FriendFeed

  • http://cameronneylon.net Cameron Neylon

    Hi Michael, thanks for the comments. I agree about centralization, I am really against it personally but listening to Geoff Bilder has convinced me that there is some need for an authority to look after some aspects of persistence and disambiguation. I think there is a way of squaring the circle that combines the federated aspect of OpenID (or something similar) with an authorisation or validation mechanism (perhaps OAuth) that can combine the best aspects of both. The technical devil is in the details.

    My point about the semantic web aspect is that we should be able to do simple things (that don't need to bother the average user) that actually make effective use of the stack that is already out there. Not that this is an argument for the semantic web but that there are tools that could be deployed that are very general and already available.

    Metrics…well yes, my main intention was to try and ask what different purposes we have for measuring things and how we might usefully enable it. The idea is to make a distinction between political or demonstrative metrics (CVs, presentations to politicians, where you want to make a case, primarily through aggregating and selecting material) and actually useful things like – “how many papers should I have read last week?”, “how far behind am I on looking at important datasets? How do I prioritise them?”

  • http://cameronneylon.net Cameron Neylon

    Bill, thanks very good points where I haven't been clear. I want to support the “non-professional researcher” and motivate their contributions so I definitely shouldn't point to them as lest trustworthy!

    Definitely agree that distributed is a good start, but I think there is a place for a trusted authority. That might be technically distributed and that could be a good solution, but I think the cultural issues around control by institutions, funders, and researchers will require some sort of (set of?) authorities that can say “yes we know this person exists and they really did do this other work”.

  • http://friendfeed.com/mfenner Martin Fenner

    I believe that there is now a critical mass of people and organizations that want unique identifiers for researchers. And there seems to be agreement that we want one single identifier, and not separate identifiers for high-energy physics, brazilians, or author publishing with Elsevier. We also want this unique identifier system to be working in two years, and it shouldn’t cost $2 billion. I would therefore design such as system with the bare minimum requirements, but make it extensible for later. Everything else will add cost and complexity and just delays the launch of such as system.

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/mfenner Martin Fenner

    A researcher identifier system requires a unique identifier, name, alternate names and maybe email and current institution. And should connect to other identifier systems for researchers, publications, research datasets, etc. The unique identifier should be provided by a single organization (and not be distributed), and that organization should not be a private company or government organization. The identifier should be verified (claimed) by the user, but should also allow verification from institutions, publishers and societies. Everything else is extra stuff that is not required and can be provided by third-party services, e.g. authentication, listings of publications or research grants, personal profiles (contact information, etc.), connections to other researchers, or tags for research interests.

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/chrisle Chris Leonard

    Cameron, you may be interested in some thoughts I had on this some time ago: http://www.slideshare.net/chrisle1972/unique-author-ids-presentation

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/cameronneylon Cameron Neylon

    I think I agree with most of those points but I think it is actually impossible to have both a single identifier that works across all researchers and additionally is provided by a single organisation. Might get close but it won’t be universal because the intersection of technical, legal, and political issues will make it unworkable. I’m not sure this matters as long as someone can set up other compatible systems at some point in the future. Maybe a good question is what level of coverage would be acceptable in a first pass working system?

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/chrisle Chris Leonard

    If didn’t want a central system, something like sameas might have a role to play to link the various ‘unique’ author ids? http://sameas.org/about.php

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/mfenner Martin Fenner

    Chris, your slideshare is a very good description of the ORCID system. You were almost two years ahead of your time ;).

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/mfenner Martin Fenner

    Centralized vs. decentralized unique identifiers is a recurring topic in the author identifier discussions. I prefer the centralized approach, but Geoff Bilder can explain the reasons much better: http://blogs.nature.com/mfenner/2009/02/17/interview-with-geoffrey-bilder

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/mfenner Martin Fenner

    Cameron, the level of coverage required to make an author identifier system workable is a good question. I would think that the initial push to use author identifiers will come from a) journal submission systems (for authors and reviewers) and b) institutions that want to use author identifiers to showcase and evaluate their research output. Both use cases already work if only some journals and some institutions use the author identifier.

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/joergkurtwegner joergkurtwegner

    Cameron, right I would not be willingly to share my eMail with everybody, but foaf:mbox is encrypted as foaf:mbox=f5726af91fa315377a58c9ec92e2920890ec4d9f Is this then still a concern?

    This comment was originally posted on FriendFeed

  • band

    Cameron, among people trust and authority are both socially constructed and frequently re-negotiated, so I think this problem is a juicy, socio-technical one. Using a tool like Shibboleth can move some of the identity verification work to other trusted institutions. But I don't see how to guarantee that, say, “band.myopenid.com” actually did a piece of work. Is the onus on me to prove I'm not lying? Hmmm ….

  • http://cameronneylon.net Cameron Neylon

    Oh I don't think we can guarantee it absolutely but within the research
    sphere there are trusted enough authorities (the journals themselves for
    instance) that we can rely on claims about authorship validated by them.
    Research is in a sense both a horrible place to do this because of the
    culture and technical issues but also good because we have some reasonable
    understanding about who trusts who about what.

    So to take your example – if a record label agrees that band.myopenid.com
    was a main contributor to the song you've just downloaded then that is good
    enough for us for most cases. If we're doing a fraud investigation then the
    level of proof would need to be much higher and you'd dig much deeper into
    those claims and the provenance record of the objects themselves (when was
    the track recorded, is there external evidence that band.myopenid.com was in
    the place where the track metadata says it was recorded).

  • http://twitter.com/cpikas Christina K. Pikas

    There's “semantic web” – which was a buzz word and is now passe' … then there's providing semantic meaning to data (really doing the semantic web) which seems to slowly, slowly be getting some traction. I'm interested in also tracking different forms of contribution – having different relationships. personx <reviewed> object y, personx <authored> object y, personx <commented on> object y, object y <incorporates> objectz <authored by> personx and so on.

  • http://twitter.com/cpikas Christina K. Pikas

    authority and trust are very much interesting socio-technical issues – however, there are some interesting network and computational ways to help people decide. Once everything's a network, you can use social rank or Bonaicich measures with some sort of scale (the guy might be an idiot, but he's authored 10 journal articles on the same subject so maybe read his comment, etc.)

  • http://twitter.com/cpikas Christina K. Pikas

    There's “semantic web” – which was a buzz word and is now passe' … then there's providing semantic meaning to data (really doing the semantic web) which seems to slowly, slowly be getting some traction. I'm interested in also tracking different forms of contribution – having different relationships. personx <reviewed> object y, personx <authored> object y, personx <commented on> object y, object y <incorporates> objectz <authored by> personx and so on.

  • http://twitter.com/cpikas Christina K. Pikas

    authority and trust are very much interesting socio-technical issues – however, there are some interesting network and computational ways to help people decide. Once everything's a network, you can use social rank or Bonaicich measures with some sort of scale (the guy might be an idiot, but he's authored 10 journal articles on the same subject so maybe read his comment, etc.)

  • http://friendfeed.com/mfenner Martin Fenner

    My thoughts for the workshop are here: http://blogs.nature.com/mfenner/2010/04/17/improving-the-conduct-of-science

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/cameronneylon Cameron Neylon

    Chris, thanks for that, very helpful. Joerg, technically if I now pass that string onto someone else I am violating our corporate policy (as I understand it – it may be that I’m missing something and I have some queries in). Basically it is personally identifiable therefore sensitive.

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/cameronneylon Cameron Neylon

    Martin, I think that question of coverage is a really good one. I’m less sure about how well it works if only some journals and institutions use them but it does point out that it will be easier to get critical mass in specific domains. What would be really useful is a set of usecases and an estimate of what critical mass is required to make them viable. (Haven’t had a chance to read your post yet – prob won’t get to that until this afternoon)

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/cameronneylon Cameron Neylon

    Martin, I think that question of coverage is a really good one. I’m less sure about how well it owrks if only some journals and institutions use them but it does point out that it will be easier to get critical mass in specific domains. What would be really useful is a set of usecases and an estimate of what critical mass is required to make them viable. (Haven’t had a chance to read your post yet – prob won’t get to that until this afternoon)

    This comment was originally posted on FriendFeed

  • http://friendfeed.com/mfenner Martin Fenner

    Author identifiers could be started with journal submission systems that publish a relatively large number of papers, e.g. PLoS ONE, PNAS or J Biol Chem. Most researchers don’t publish that often, so it will take some time until they see the benefits of using a standard author identifier.

    This comment was originally posted on FriendFeed