Metrics of use: How to align researcher incentives with outcomes

slices of carrot
Image via Wikipedia

It has become reflexive in the Open Communities to talk about a need for “cultural change”. The obvious next step becomes to find strong and widely respected advocates of change, to evangelise to young researchers, and to hope for change to follow. Inevitably this process is slow, perhaps so slow as to be ineffective. So beyond the grassroots evangelism we move towards policy change as a top down mechanism for driving improved behaviour. If funders demand that data be open, that papers be accessible to the wider community, as a condition of funding then this will happen. The NIH mandate and the work of the Wellcome Trust on Open Access show that this can work, and indeed that mandates in some form are necessary to raise levels of compliance to acceptable levels.

But policy is a blunt instrument, and researchers being who they are don’t like to be pushed around. Passive aggressive responses from researchers are relatively ineffectual in the peer reviewed articles space. A paper is a paper. If its under the right licence then things will probably be ok and a specific licence is easy to mandate. Data though is a different fish. It is very easy to comply with a data availability mandate but provide that data in a form which is totally useless. Indeed it is rather hard work to provide it in a form that is useful. Data, software, reagents and materials, are incredibly diverse and it is difficult to make good policy that can be both effective and specific enough, as well as general enough to be useful. So beyond the policy mandate stick, which will only ever provide a minimum level of compliance, how do we motivate researchers to putting the effort into making their outputs available in a useful form? How do we encourage them to want to do the right thing? After all what we want to enable is re-use.

We need more sophisticated motivators than blunt policy instruments, so we arrive at metrics. Measuring the ouputs of researchers. There has been a wonderful animation illustrating a Daniel Pink talk doing the rounds in the past week. Well worth a look and important stuff but I think a naive application of it to researchers’ motivations would miss two important aspects. Firstly, money is never “off the table” in research. We are always to some extent limited by resources. Secondly the intrinsic motivators, the internal metrics that matter to researchers, are tightly tied to the metrics that are valued by their communities. In turn those metrics are tightly tied to resource allocation. Most researchers value their papers, the places they are published and the citations received, as measures of their value, because that’s what their community values. The system is highly leveraged towards rapid change, if and only if a research community starts to value a different set of metrics.

What might the metrics we would like to see look like? I would suggest that they should focus on what we want to see happen. We want return on the public investment, we want value for money, but above all we want to maximise the opportunity for research outputs to be used and to be useful. We want to optimise the usability and re-usability of research outputs and we want to encourage researchers to do that optimisation. Thus if our metrics are metrics of use we can drive behaviour in the right direction.

If we optimise for re-use then we automatically value access, and we automatically value the right licensing arrangements (or lack thereof). If we value and measure use then we optimise for the release of data in useful forms and for the release of open source research software. If we optimise for re-use, for discoverability, and for value add, then we can automatically tension the loss of access inherent in publishing in Nature or Science vs the enhanced discoverability and editorial contribution and put a real value on these aspects. We would stop arguing about whether tenure committees should value blogging and start asking how much those blogs were used by others to provide outreach, education, and research outcomes.

For this to work there would need to be mechanisms that automatically credit the use of a much wider range of outputs. We would need to cite software and data, would need to acknowledge the providers of metadata that enabled our search terms to find the right thing, and we would need to aggregate this information in a credible and transparent way. This is technically challenging, and technically interesting, but do-able. Many of the pieces are in place, and many of the community norms around giving credit and appropriate citation are in place, we’re just not too sure how to do it in many cases.

Equally this is a step back towards what the mother of all metrics, the Impact Factor was originally about. The IF was intended as a way of measuring the use of journals through counting citations, as a means of helping librarians to choose which journals to subscribe to. Article Level Metrics are in many ways the obvious return to this where we want to measure the outputs of specific researchers. The H-factor for all its weaknesses is a measure of re-use of outputs through formal citations. Influence and impact are already an important motivator at the policy level. Measuring use is actually a quite natural way to proceed. If we can get it right it might also provide the motivation we want to align researcher interests with the wider community and optimise access to research for both researchers and the public.

Reblog this post [with Zemanta]

Why the web of data needs to be social

Picture1
Image by cameronneylon via Flickr

If you’ve been around either myself or Deepak Singh you will almost certainly have heard the Jeff Jonas/Jon Udell soundbite: ‘Data finds data. Then people find people’. Jonas is referring to data management frameworks and knowledge discovery and Udell is referring to the power of integrated data to bring people together.

At some level Jonas’ vision (see his chapter[pdf] in Beautiful Data) is what the semantic web ought to enable, the automated discovery of data or objects based on common patterns or characteristics. Thus far in practical terms we have signally failed to make this a reality, particularly for research data and objects.

Udell’s angle (or rather, my interpretation of his overall stance) is more linked to the social web – the discovery of common contexts through shared data frameworks. These contexts might be social groups, as in conventional social networks, a particular interest or passion, or – in the case of Jon’s championing of the iCalendar standard –  a date and place as demonstrated by the  the elmcity project supporting calendar curation and aggregation. Shared context enables the making of new connection, the creation of new links. But still mainly links between people.

It’s not the scientists who are social; it’s the data – Neil Saunders

The naïve analysis of the success of consumer social networks and the weaknesses of science communication has lead to efforts that almost precisely invert the Jonas/Udell concept. In the case of most of these “Facebooks for Scientists” the idea is that people find people, and then they connect with data through those people.

My belief is that it is this approach that has led to the almost complete failure of these networks to gain traction. Services that place the object  research at the centre; the reference management and bookmarking services, to some extent Twitter and Friendfeed, appear to gain much more real scientific use because they mediate the interactions that researchers are interested in, those between themselves and research objects. Friendfeed in particular seems to support this discovery pattern. Objects of interest are brought into your stream, which then leads to discovery of the person behind them.  I often use Citeulike in this mode. I find a paper of interest, identify the tags other people have used for it and the papers that share those tags. If these seems promising, I then might look at the library of the person, but I get to that person through the shared context of the research object, the paper, and the tags around that object.

Data, data everywhere, but not a lot of links – Simon Coles

A common complaint made of research data is that people don’t make it available. This is part of the problem but increasingly it is a smaller part. It is easy enough to put data up that many researchers are doing so, in supplementary data of journal articles, on personal websites, or on community or consumer sites. From a linked data perspective we ought to be having a field day with this, even if it represents only a small proportion of the total. However little of this data is easily discoverable and most of it is certainly not linked in any meaningful way.

A fundamental problem that I feel like I’ve been banging on about for years now is that dearth of well built tools for creating these links. Finally these tools are starting to appear with Freebase Gridworks being an early example. There is a good chance that it will become easier over time for people to create links as part of the process of making their own record. But the fundamental problems we always face, that this is hard work, and often unrewarded work, are limiting progress.

Data friends data…then knowledge becomes discoverable

Human interaction is unlikely to work at scale. We are going to need automated systems to wire the web of data together. The human process simply cannot keep up with the ongoing annotation and connection of data at the volumes that are being generated today. And we can’t afford not to if we want to optimize the opportunities of research to deliver useful outcomes.

When we think about social networks we always place people at their centre. But there is nothing to stop us replacing people with data or other research objects. Software that wants to find data, data that wants to find complementary or supportive data, or wants to find the right software to convert or analyze it. Instead of Farmville or Mafia Wars imagine useful tools that make these connections, negotiate content, and identify common context. As pointed out to me by Paul Walk this is very similar to what was envisioned in the 90s as the role of software agents. In this view the human research users are the poorly connected users on the outskirts of the web.

The point is that the hard part of creating linked data is making the links, not publishing the data. The semantic web has always suffered from the chicken and egg problem of a lack of user-friendly ways to generate RDF and few tools that could really use that RDF in exciting ways even if it did exist. I still can’t do a useful search on which restaurants in Bath will be open next Sunday. The reality is that the innards of this should be hidden from the user, the making of connections needs to be automated as far as possible, and as natural as possible when the user has to be involved. As easy as hitting that “like” button, or right clicking and adding a citation.

We have learnt a lot about the principles of when and how social networks work. If we can apply those lessons to the construction of open data management and discovery frameworks then we may stand some chance of actually making some of the original vision of the web work.

Reblog this post [with Zemanta]

Implementing the “Publication as Aggregation”

Google Wave Lab mockup
Image by cameronneylon via Flickr

I wrote a few weeks back about the idea of re-imagining the formally published scientific paper as an aggregation of objects. The idea behind this is that it provides both continuity, through enabling the display of such papers in more or less the same way as we do currently, enhancing functionality by, for instance, embedding active versions of figures in a native form, and at the same time providing a route towards a linked data web for research.

Fundamentally the idea is that we publish fragments, and then aggregate these fragments together. The mechanism of aggregation is supposed to be an expanded version of the familiar paradigm of citation: the introduction of a paper will link to and cite other papers as usual, and these would be incorporated as links and citations within the presentation of the paper. But in addition the paper will cite the introduction. By default the introduction would be included in the view of the paper presented to the user, but equally the user might choose to only look at figures, only conclusions, or only the citations. Figures would be citations to data, again available on the web, again with a default visualization that might be an embedded active graph or a static image.

I asserted that the tools for achieving this are more or less in place. Actually that is only half true. The tools for storing, displaying, and even to some extent archiving communications in this form do exist, at least in the form of examples.

An emerging standard for aggregated objects on the web is the Open Archives Initiative – Object Re-use and Exchange (OAI-ORE). The OAI-ORE object is a generic description of the address of a series of things on the web and how they relate to each other. It is the natural approach for representing the idea of a paper as an aggregation of pieces. The OAI-ORE object itself is RDF with no concept of how it should be displayed or used. It is just a set of things, each labeled with their role in the overall object. In principle at least makes it straightforward to display it in any number of ways. A simple example would be converting OAI-ORE to the NLM-DTD XML format. The devil as always, is in the detail here but this makes a good first pass technical requirement for the details of how the pieces of the OAI-ORE object are described; it must be straightforward to convert to NLM-DTD.

Once we have both the collection of pieces, their relationship to each other, and the idea that we can choose to display some or all of these pieces in any way we choose then a lot of the rest falls into place. Figures can be data objects, which have a default visualization method. These visualizations can be embedded in the way which is now familiar with audio and video files. But equally references to gene names, structures, and chemical entities could be treated the same way. Want the chemical name? Just click a button, the visualization tool will deliver that. Want the structure? Again, just click the button, toggle the menu, or write the script to ask for it in that form if you are doing that kind of thing. We would need more open standards for embedding objects; probably less Flash, and more open standards but that’s a fairly minor issue.

There needs to be some communication between the citing object (the paper) and the cited object (the data, figure, text, external reference). This could be built up from the TrackBack or Pingback protocols. There also needs to be default content negotiation: “I want this data, what can you give me? Graph? Table?…ok I’ll take the graph…” That’s just a RESTful API, something which is more or less standard for the consumer web data services but which is badly missing on the research web. None of this is actually terribly difficult and there are good tools out there to do it.

But I said that we only had half solved. The other side of the problem is good authoring and publishing tools. All of the above assumes that these OAI-ORE objects exist or can be easily built, and that the pieces we want to aggregate are already on the web, ready to be pointed at and embedded. They are not. We have two fundamental problems. First we have to get these things onto the web in a useful and re-useable form. Some of this can be done with existing data services such as ChemSpider, Genbank, PubChem, GEO, etc. but that is the easy end of the problem. The hard bit is the heterogenous mass of pieces of data, Excel spreadsheets, CSV files, XML and binaries, that make up the majority of the research outputs we generate.

Publication could be made easy, using automatic upload tools and lightweight data services that provide a place for them on the web. The criticism is often made that “just publishing” is not enough because there is no context. What is often missed is that the best way to provide context is for the person who generated the research object to link it in to a larger record. The catch is, for that to be useful they have to publish it to the web first, otherwise the record they create points at a local and inaccessible object. So we need tools that simply push the raw material up onto the web, probably in the short to medium term to secure servers, but ones where the individual objects can be made public at some point.

So the other tools we need are for authoring these documents. These will look and behave like a Word Processor (or like a LaTeX document for those who prefer that route) but with a clever reference manager and citation creator. Today our reference libraries only contain papers. But imagine that your library contained all of the data you’ve generated as well and that the easiest way to write up your lab notebook was to simply right click and include the reference, select the visualization that you want and you’re done. All the details of default visualizations, of where the data really is, of adding the record to the OAI-ORE root node, all of this is done for you behind the scenes. You might need to select lumps of text to say what their role is, probably using some analogue of styles, similar to the way that the Integrated Content Environment (ICE) system does.

This environment could be built for Word, for Open Office, for LaTeX. One of the reasons I remain excited about Google Wave is that it should be relatively easy to prototype such an environment, the hooks are already there in a way that they aren’t in traditional document authoring tools because Wave is much more web native. It will however take quite a lot of work. There is also a chicken and egg problem in that such an environment isn’t a whole lot of use without the published objects to aggregate together, and the publication services to provide rich views of the final aggregated documents.  It will be quite a lot of work to build all of these pieces, and it will take some time before the benefits become clear. But I think it is a direction that is worth pursuing because it takes the best of what we already know works on the web and applies it to an evolutionary adaption of the communication style that is already familiar. The revolution comes once the pieces are there for people to work with in new ways.

Reblog this post [with Zemanta]

In defence of author-pays business models

Latest journal ranking in the biological sciences
Image by cameronneylon via Flickr

There has been an awful lot recently written and said about author-pays business models for scholarly publishing and a lot of it has focussed on PLoS ONE.  Most recently Kent Anderson has written a piece on Scholarly Kitchen that contains a number of fairly serious misconceptions about the processes of PLoS ONE. This is a shame because I feel this has muddled the much more interesting question that was intended to be the focus of his piece. Nonetheless here I want to give a robust defence of author pays models and of PLoS ONE in particular. Hopefully I can deal with the more interesting question, how radical should or could PLoS be, in a later post.

A common charge leveled at author-payment funded journals is that they are pushed in the direction of being non-selective. The figure that PLoS ONE publishes around 70% of the papers it receives is often given as a demonstration of this. There are a range of reasons why this is nonsense. The first and simplest is that the evidence we have suggests that of papers rejected from journals between 50% and 95% of them are ultimately published elsewhere [1, 2 (pdf), 3, 4]. The cost of this trickle down, a result of the use of subjective selection criteria of “importance”, is enormous in authors’ and referees’ time and represents a significant potential opportunity cost in terms of lost time. PLoS ONE seeks to remove this cost by simply asking “should this be published?” In the light of the figures above it seems that 70% is a reasonable proportion of papers that are probably “basically ok but might need some work”.

The second presumption is that the peer review process is somehow “light touch”. This is perhaps the result of some mis-messaging that went on early in the history of PLoS ONE but it is absolute nonsense. As both an academic editor and an author I would argue that the peer review process is as rigorous as I have experienced at any other journal (and I do mean any other journal).

As an author I have two papers published in PLoS ONE, both went through at least one round of revision, and one was initially rejected. As an editor I have seen two papers withdrawn after the initial round of peer review, presumably not because the authors felt that the required changes represented a “light touch”. I have rejected one and have never accepted a paper without revision. Every paper I have edited has had at least one external peer reviewer and I try to get at least two. Several papers have gone through more than one cycle of revision with one going through four. Figures provided by Pete Binfield (comment from Pete about 20 comments in) suggest that this kind of proportion is about average for PLoS ONE Academic Editors. The difference between PLoS ONE and other journals is that I look for what is publishable in a submission and work with the authors to bring that out rather than taking delight in rejecting some arbitrary proportion of submissions and imagining that this equates to a quality filter. I see my role as providing a service.

The more insidious claim made is that there is a link between this supposed light touch review and the author pays models; that there is pressure on those who make the publication decision to publish as much as possible. Let me put this as simply as possible. The decision whether to publish is mine as an Academic Editor and mine alone. I have never so much as discussed my decision on a paper with the professional staff at PLoS and I have never received any payment whatsoever from PLoS (with the possible exception of two lunches and one night’s accommodation for a PLoS meeting I attended – and I missed the drinks reception…). If I ever perceived pressure to accept or was offered inducements to accept papers I would resign immediately and publicly as an AE.

That an author pays model has the potential to create a conflict of interest is clear. That is why, within reputable publishers, structures are put in place to reduce that risk as far as is possible, divorcing the financial side from editorial decision making, creating Chinese walls between editorial and financial staff within the publisher.  The suggestion that my editorial decisions are influenced by the fact the authors will pay is, to be frank, offensive, calling into serious question my professional integrity and that of the other AEs. It is also a slightly strange suggestion. I have no financial stake in PLoS. If it were to go under tomorrow it would make no difference to my take home pay and no difference to my finances. I would be disappointed, but not poorer.

Another point that is rarely raised is that the author pays model is much more widely used than people generally admit. Page charges and colour charges for many disciplines are of the same order as Open Access publication charges. The Journal of Biological Chemistry has been charging page rates for years while increasing publication volume. Author fees of one sort or another are very common right across the biological and medical sciences literature. And it is not new. Bill Hooker’s analysis (here and here) of these hidden charges bears reading.

But the core of the argument for author payments is that the market for scholarly publishing is badly broken. Until the pain of the costs of publication is directly felt by those making the choice of where to (try to) publish we will never change the system. The market is also the right place to have this out. It is value for money that we should be optimising. Let me illustrate with an example. I have heard figures of around £25,000 given as the level of author charge that would be required to sustain Cell, Nature, or Science as Open Access APC supported journals. This is usually followed by a statement to the effect “so they can’t possibly go OA because authors would never pay that much”.

Let’s unpack that statement.

If authors were forced to make a choice between the cost of publishing in these top journals versus putting that money back into their research they would choose the latter. If the customer actually had to make the choice to pay the true costs of publishing in these journals, they wouldn’t…if journals believed that authors would see the real cost as good value for money, many of them would have made that switch years ago. Subscription charges as a business model have allowed an appallingly wasteful situation to continue unchecked because authors can pretend that there is no difference in cost to where they publish, they accept that premium offerings are value for money because they don’t have to pay for them. Make them make the choice between publishing in a “top” journal vs a “quality” journal and getting another few months of postdoc time and the equation changes radically. Maybe £25k is good value for money. But it would be interesting to find out how many people think that.

We need a market where the true costs are a factor in the choices of where, or indeed whether, to formally publish scholarly work. Today, we do not have that market and there is little to no pressure to bring down publisher costs. That is why we need to move towards an author pays system.

Reblog this post [with Zemanta]

Engage or become irrelevant

Crowd being turned back at Coliseum (LOC)
Image by The Library of Congress via Flickr

Friday and Saturday last week I had the privilege of attending the first Sage Congress. Hopefully this will be the first in a series of posts that cover that meeting because there is simply so much to think about and so much to just get on and do.

This is not a post about public engagement work by scientists. It is not about going to schools and giving talks. It is not about engaging with the main stream media to present your work to the great unwashed. It is about engaging with the people who will be driving your research agenda within ten years, about how the way researchers connect with society will be changed over the next decade whether they like it or not. The aim of Sage Bionetworks, the wider Sage Commons, and its constituent projects is nothing less than to change the pace at which medical research operates. An aim that was put forward seriously as a twelve month goal in one of our breakouts was to document three use cases where information from the Sage Commons had made a difference to a patient. The scientific details are perhaps less important than the delivery plan; an open plaform for laboratory and clinical data, linked to detailed models that explain that data, and ultimately to tools for clinical staff and laboratory scientists to use and crucially to contribute back to where appropriate.

As you might expect expect the meeting included scientists, technologists, policy people, funders and publishers. It also included a significant number of patient advocates and by the end of the meeting, for me at least, they were at the core the project. This might not be surprising if it were just as motivation for getting things done. Josh Sommer‘s enormously powerful talk was pitched perfectly to spur the group to action. I cannot do it justice, but will link to the video when it is available. But that was only half the story. The second half was when these same patient advocates got up at the synthesis session at the end of the meeting to say they had formed their own workstream. Their aim? To get Stephen on Oprah. Again publicity and information for “the public”, support perhaps and help in fundraising. But to focus on that is still to miss the point.

A second hand conversation was related to me in which a major agency representative had said “we will never make data public”. I have sympathy for this view. Such agencies need to protect their standards, and this includes an absolute adherence to privacy policies and validated ethical procedures. But contrast that with the talk from Anne Wojcicki talking about how 23andMe get enormous response rates on questionaires containing deeply personal questions where the aggregate information will be made public. Contrast it with the talk from Rob Epstein of Medco talking about cold calling patients to ask if they would be willing to contribute to rapid testing programmes to see whether genotyping can reduce hospitalizations caused by warfarin. And contrast it with Josh Sommer’s work with the Chordoma Foundation, Gilles Frydman‘s with ACOR and the Society for Participatory Medicine, or the many other examples at the congress; services like Patients Like Me where patients want to push data out, both because they get valuable information back for themselves and because they want to make a difference. We are rapidly moving towards a world where networks of patients might refuse to sign up for trials that don’t commit to making the data publicly available.

People like me tend to advocate getting funders to push for policy change, because they hold the pursestrings and are best placed to push through change. One thing we’ve often forgotten is that they are simply intermediaries. They are not the real funders, and they don’t provide the only form of funding. Increasingly they don’t hold the real power either. In clinical research the patients involved are directly funding your work as well as indirectly through their taxes or charitable donations. They are perhaps the biggest funders of medical research; donating their time and hard won information about their state of health. They are also the most effective advocates of that research. The engagment group at the congress didn’t stand up and say “we want to help”, they stood up to say “you need us to succeed in your aims”.

What projects like GalaxyZoo show us is that when you effectively enable an engaged portion of the wider community to contribute to your research that you can increase the pace by orders of magnitude. “The public” is not some homogeneous group of barbarians at the gate of our ivory towers. They are a diverse group, many of them interested in what researchers do; many of them passionately interested in some specific thing for a wide range of different reasons. In a world where the web enables access and communication, and enables those with common interests to find each other, people who are passionately interested in what you are doing are going to be increasingly unimpressed if avenues are unavailable for them to follow and contribute. And funders, including those ultimate funders, are going to be increasingly unimpressed if you don’t effectively tap into that resource.

The need to actively engage with, not at, the wider community as active contributors is shifting the balance of power in research, probably irrevocably. I think that is probably a good thing.

Reblog this post [with Zemanta]

Draft White Paper – Researcher identifiers

National Science Foundation (NSF) Logo, reprod...
Image via Wikipedia

On April 26 I am attending a joint meeting of the NSF and EuroHORCS (European Heads of Research Councils) on “Changing the Conduct of Science in the Information Age”. I have been asked to submit a one page white paper in advance of the meeting and have been struggling a bit with this. This is stage one, a draft document relating to researcher identifiers. I’m not happy with it but reckon that other people out there may well be able to help where I am struggling. I may write a second one on metrics or at least a brief literature collection. Any and all comments welcome.

Summary

Citation lies at the core of research practice, recognizing both the contributions that others have made in the development of a specific piece of work and in linking related knowledge together. The technology of the web make it technically feasible to radically improve the precision, accuracy, and completeness of these links. Such improvements are crucial to the successful implementation of any system that purports to measure research outputs or quality.

Additionally the web offers the promise of extended and distributed projects involving diverse parties, many of which may not be professional researchers. Nonetheless such contributions deserve the same level of recognition as comparable contributions from professional researchers. Providing an open system in which people can contribute to research efforts and receive credit raises significant social issues of control and validation of identities.

Such open systems and federated systems are exposed to potential failure through lack of technical expertise on the part of users, particularly where a person loses or creates a new identity. This in many ways is already the case where we use institutional email addresses as proxies for researcher identity. Is D.C.Neylon@####.ac.uk (a no longer functioning email address), the same person as Cameron.Neylon@####.ac.uk? It is technically feasible to consider systems in which the actual identity token used is widely available and compatible with the wider consumer web but centralised and trusted authorities provide validation services to confirm specific claims around identity. Such a broker or clearing house would provide a similar role for identities as CrossRef provides for scholarly articles via the DOI.

General points

  • By adding the concept of a semantic-web ready researcher identifier, i.e. an identifier that provides a URL endpoint that uniquely represents a specific researcher, the existing technical capacity of the semantic web stack can be used to provide a linked data representation of contributions to existing published research objects that are accessible at URL endpoints. Such a representation could be readily expanded beyond authorship to funder contributions.
  • Crediting a researcher as a contributor to a specific object published on the web is a specific form of citation or linking in this view
  • The authoring tools to support the linking and publishing of research objects in this form do not currently exist in a widely useable form.
  • Semantic web technology provides an extensible means to adding and recognising diverse forms of contribution.

Authorisation, validation, and control

  • Access to such identifiers must be under the control of the researcher and not limited to those with institutional affiliations. Any person must be able to obtain and control a unique researcher identifier that refers to them.
  • Authorisation and validation of claimed rights of access or connections to specific institutions can be technically handled separately from the provision of identifiers.

Technical and legal issues

  • OpenID and OAuth provide a developing internet standard that provides technical means to achieve a distributed availability of identifiers and to separate issues of authentication from those of identification. They are a current leader for federated identity and authorisation solutions on the consumer web.
  • OpenID and OAuth do not currently provide the levels of security required in several jurisdictions for personal or sensitive information (e.g. UK data protection act).  Such federated systems may fall foul of jurisdictions with strong generic privacy requirements, e.g. Canada
  • To interoperate with the wider web and enable a wider notion of citation as a declaration of a piece of knowledge “Person X authored Paper Y”, identities must resolve on the web, in the sense of being a clickable hyperlink that takes a human or machine reader to a page containing information representing that person.

Social issues

  • There are profound social issues of trust in the maintenance of such identifiers, especially for non-professional researchers in the longer term.
  • A centralised trusted authority (or authorities) that validates specific claims about identity (a “CrossRef for people”) might provide a trusted broker for identity transactions in the research space that solves many of these trust problems.
  • Issues around trust and jurisdiction as well as scope and control are likely to limit and fragment any effort to coordinate, federate, or integrate differing identity solutions in the research space. Therefore interoperability of any developed system with the wider web must be a prime consideration.

Conclusions

Identity, unique identifiers, authorisation of access and validation of claims are issues that need to be solved before any transparent and believable metric systems can be reliably implemented. In the current world ill-considered, non-transparent, and irreproducible metric systems will almost inevitably lead to legal claims. At the same time there is a massive opportunity for wider involvement in research for which a much more diverse range of people’s contributions will need recognition.

A system in which recognition and citation takes the form of a link to a specified address on the web that represents a person has the potential to simultaneously make it much easier to unambiguously confirm who is being credited but additionally provides the opportunity to leverage an existing stack of tools and services to aggregate and organize information relating to identity. This is in fact a specific example of a wider view of addressable research objects on the web that can be part of a web of linked data. In this view a person is simply another object that can have specified relationships (links) to other objects.

Partial technical solutions in the form of OAuth, and OpenID exist that solve some subset of these problems. However these systems are currently not technically secure to a level compatible with handling the transfer of sensitive data. However they can interoperate with more secure transfer systems. They provide a federated and open system that enables any person to obtain and assert an identity and to control the appearance of that identity. Severe social issues around trust and persistence exist for this kind of system. This may be addressed through trusted centralized repositories that can act as a reliable broker.

Given expected issues with uptake of any system, systems that are interoperable with competitive or complementary offerings are crucial.

Reblog this post [with Zemanta]

The future of research communication is aggregation

Paper as aggregation
Image by cameronneylon via Flickr

“In the future everyone will be a journal editor for 15 minutes” – apologies to Andy Warhol

Suddenly it seems everyone wants to re-imagine scientific communication. From the ACS symposium a few weeks back to a PLoS Forum, via interesting conversations with a range of publishers, funders and scientists, it seems a lot of people are thinking much more seriously about how to make scientific communication more effective, more appropriate to the 21st century and above all, to take more advantage of the power of the web.

For me, the “paper” of the future has to encompass much more than just the narrative descriptions of processed results we have today. It needs to support a much more diverse range of publication types, data, software, processes, protocols, and ideas, as well provide a rich and interactive means of diving into the detail where the user in interested and skimming over the surface where they are not. It needs to provide re-visualisation and streaming under the users control and crucially it needs to provide the ability to repackage the content for new purposes; education, public engagement, even main stream media reporting.

I’ve got a lot of mileage recently out of thinking about how to organise data and records by ignoring the actual formats and thinking more about what the objects I’m dealing with are, what they represent, and what I want to do with them. So what do we get if we apply this thinking to the scholarly published article?

For me, a paper is an aggregation of objects. It contains, text, divided up into sections, often with references to other pieces of work. Some of these references are internal, to figures and tables, which are representations of data in some form or another. The paper world of journals has led us to think about these as images but a much better mental model for figures on the web is of an embedded object, perhaps a visualisation from a service like Many Eyes, Swivel, and Tableau Public. Why is this better? It is better because it maps more effectively onto what we want to do with the figure. We want to use it to absorb the data it represents, and to do this we might want to zoom, pan, re-colour, or re-draw the data. But we want to know if we do this that we are using the same underlying data, so the data needs a home, an address somewhere on the web, perhaps with the journal, or perhaps somewhere else entirely, that we can refer to with confidence.

If that data has an individual identity it can in turn refer back to the process used to generate it, perhaps in an online notebook or record, perhaps pointing to a workflow or software process based on another website. Maybe when I read the paper I want that included, maybe when you read it you don’t – it is a personal choice, but one that should be easy to make. Indeed, it is a choice that would be easy to make with today’s flexible web frameworks if the underlying pieces were available and represented in the right way.

The authors of the paper can also be included as a reference to a unique identifier. Perhaps the authors of the different segments are different. This is no problem, each piece can refer to the people that generated it. Funders and other supporting players might be included by reference. Again this solves a real problem of today, different players are interested in how people contributed to a piece of work, not just who wrote the paper. Providing a reference to a person where the link show what their contribution was can provide this much more detailed information. Finally the overall aggregation of pieces that is brought together and finally published also has a unique identifier, often in the form of the familiar DOI.

This view of the paper is interesting to me for two reasons. The first is that it natively supports a wide range of publication or communication types, including data papers, process papers, protocols, ideas and proposals. If we think of publication as the act of bringing a set of things together and providing them with a coherent identity then that publication can be many things with many possible uses. In a sense this is doing what a traditional paper should do, bringing all the relevant information into a single set of pages that can be found together, as opposed to what they usually do, tick a set of boxes about what a paper is supposed to look like. “Is this publishable?” is an almost meaningless question on the web. Of course it is. “Is it a paper?” is the question we are actually asking. By applying the principles of what the paper should be doing as opposed to the straightjacket of a paginated, print-based document, we get much more flexibility.

The second aspect which I find exciting revolves around the idea of citation as both internal and external references about the relationships between these individual objects. If the whole aggregation has an address on the web via a doi or a URL, and if its relationship both to the objects that make it up and to other available things on the web are made clear in a machine readable citation then we have the beginnings of a machine readable scientific web of knowledge. If we take this view of objects and aggregates that cite each other, and we provide details of what the citations mean (this was used in that, this process created that output, this paper is cited as an input to that one) then we are building the semantic web as a byproduct of what we want to do anyway. Instead of scaring people with angle brackets we are using a paradigm that researchers understand and respect, citation, to build up meaningful links between packages of knowledge. We need the authoring tools that help us build and aggregate these objects together and tools that make forming these citations easy and natural by using the existing ideas around linking and referencing but if we can build those we get the semantic web for science as a free side product – while also making it easier for humans to find the details they’re looking for.

Finally this view blows apart the monolithic role of the publisher and creates an implicit marketplace where anybody can offer aggregations that they have created to potential customers. This might range from a high school student putting their science library project on the web through to a large scale commercial publisher that provides a strong brand identity, quality filtering, and added value through their infrastructure or services. And everything in between. It would mean that large scale publishers would have to compete directly with the small scale on a value-for-money basis and that new types of communication could be rapidly prototyped and deployed.

There are a whole series of technical questions wrapped up in this view, in particular if we are aggregating things that are on the web, how did they get there in the first place, and what authoring tools will we need to pull them together. I’ll try to start on that in a follow-up post.

Reblog this post [with Zemanta]

A letter to my MP

For those not in the UK this will probably be a little parochial. Don Foster is my local MP in Bath. The Digital Economy Bill, currently going through a “wash-up” process triggered by the announcement of a general election yesterday in the British Parliament has drawn extensive criticism from most of the British technology community. Last night an unprecedented number of people followed its second reading on BBC and via Twitter.  As this is explicitly political please read the disclaimer on this one.

Dear Don Foster,

I am writing firstly to commend you for your attendance at the Digital Economy Bill Second Reading last night. I was one of thousands, perhaps tens of thousands of people watching the reading unfold on Twitter. By now perhaps some MPs and party strategists are digesting what happened but I wished to pick out a few things that seemed particularly relevant, particularly in the context of a general election.

This was the first real exposure of many of those watching to the internal functioning of the house. A large community of highly engaged people motivated to either watch, listen, or follow blow by blow descriptions of exactly how the debate proceeded. The almost universal reaction was one of abject horror.

Representative democracy bases its existence on the assumption that the full community can not be effectively involved in an informed and considered criticism of proposed bills and that it is therefore of value to both place some buffer between raw, and probably ill informed public opinion, and actual decision making. This presumes that MPs, particularly party spokespersons take the time to become expert on the matter of bills they represent. By contrast what we saw last night was a minute by minute dissection by well informed people outside of parliament of what, with a small number of honourable exceptions, totally uninformed people within parliament were saying.

The placement of copyright infringement alongside theft (Afriye, Timms, Wishart) displays a fundamental lack of understanding of the UK legal system, and particularly the distinction between civil and criminal law, property and monopoly rights. Not things that are well understood by the public but things that the public have a right to expect parliamentarians to educate themselves about as they go to the heart of what the bill is about. These points were dissected and rebutted instantly online only to be repeated uncritically in the house.

The idea that the bill has any chance at all of reducing illegal filesharing by 70% is laughable, as is the idea that “technical measures” can protect public WIFI against unfair take down notices. Finally the notion that the “creative industries” are suffering when they have taken record profits are their own research shows that illegal file sharers are their biggest customer needs to be put to parliament.

But the UK’s real creative industry were those on Twitter last night. The people whose livelihood depends on a free and working internet, who work as sole traders or in small companies. The people who will create the media of the 21st century. The people who will bring the UK out of recession. They were out in force last night and while we disagree passionately about the details of copyright and intellectual property rights and how they should be best applied, there was one voice united in the wish that the Digital Economy Bill in its current form be buried.

Particular horror was reserved online for those MPs who stated clearly that the process of the bills progress was unacceptable. That something so important has had such little scrutiny and that something so controversial has been placed in the wash-up process. Member after member stood up to say the bill and its progress was flawed, dangerous, and “appalling” but they would nonetheless “reluctantly” support it.

Finally I would note that, while you were present, the lack of other Liberal Democrats in the house was noted. This is a natural constituency for your party. Indeed Bath has a vibrant technology community as you are no doubt aware. I hope your party strategists have seen the damage that was done last night and I hope they draw the logical conclusion. If the Liberal Democrats turn out in force tonight and bury this bill at the third reading then it will make a difference to your electoral results. If you want a hung parliament, this is the way to get it.

Yours sincerely,

Cameron Neylon

p.s. I will be posting this letter publicly on my blog at http://cameronneylon.net Please feel free to reply or comment there. I hope you will give me permission to publish any other reply you make in a similar form.

Reblog this post [with Zemanta]

The personal and the institutional

Twittering and microblogging not permitted
Image by cameronneylon via Flickr

A number of things recently have lead me to reflect on the nature of interactions between social media, research organisations and the wider community. There has been an awful lot written about the effective use of social media by organisations, the risks involved in trusting staff and members of an organisation to engage productively and positively with a wider audience. Above all there seems a real focus on the potential for people to embarrass the organisation. Relatively little focus is applied to the ability of the organisation to embarrass its staff but that is perhaps a subject for another post.

In the area of academic research this takes on a whole new hue due to the presence of a strong principle and community expectation of free speech, the principle of “academic freedom”. No-one really knows what academic freedom is. It’s one of those things that people can’t define but will be very clear about when it has been taken away. In general terms it is the expectation that a tenured academic has earnt the right to be able to speak their opinion, regardless of how controversial. We can accept there are some bounds on this, of ethics, taste, and legality – racism would generally be regarded as unacceptable – while noting that the boundary between what is socially unacceptable and what is a validly held and supported academic opinion is both elastic and almost impossible to define. Try expressing the opinion, for example, that their might be a biological basis to the difference between men and women on average scores on a specific maths test. These grey areas, looking at how the academy ( or academies) censor themselves are interesting but aren’t directly relevant to this post. Here I am more interested in how institutions censor their staff.

Organisations always seek to control the messages they release to the wider community. The first priority of any organisation or institution is its own survival. This is not necessarily a bad thing – presumably the institution exists because it is  (or at least was) the most effective way of delivering a specific mission. If it ceases to exist, that mission can’t be delivered. Controlling the message is a means of controlling others reactions and hence the future. Research institutions have always struggled with this – the corporate centre sending once message of clear vision, high standards, continuous positive development, while the academics privately mutter in the privacy of their own coffee room about creeping beauracracy, lack of resources, and falling standards.

There is fault on both sides here. Research administration and support only very rarely puts the needs and resources of academics at its centre. Time and time again the layers of beauracracy mean that what may or may not have been a good idea gets buried in a new set of unconnected paperwork, that more administration is required taking resources away from frontline activities, and that target setting results in target meeting but at the cost of what was important in the first place. There is usually a fundamental lack of understanding of what researchers do and what motivates them.

On the other side academics are arrogant and self absorbed, rarely interested in contributing to the solution of larger problems. They fail to understand, or take any interest in the corporate obligations of the organisations that support them and will only rarely cooperate and compromise to find solutions to problems. Worse than this, academics build social and reward structures that encourage this kind of behaviour, promoting individual achievement rather than that of teams, penalising people for accepting compromises, and rarely rewarding the key positive contribution of effective communication and problem solving between the academic side and administration.

What the first decade of the social web has taught us is that organisations that effectively harness the goodwill of their staff or members using social media tools do well. Organisations that effectively use Twitter or Facebook enable and encourage their staff to take the shared organisational values out to the wider public. Enable your staff to take responsibility and respond rapidly to issues, make it easy to identify the right person to engage with a specific issue, and admit (and fix) mistakes early and often, is the advice you can get from any social media consultant. Bring the right expert attention to bear on a problem and solve it collaboratively, whether its internal or with a customer. This is simply another variation on Michael Nielsen’s writing on markets in expert attention – the organisations that build effective internal markets and apply the added value to improving their offering will win.

This approach is antithetical to traditional command and control management structures. It implies a fluidity and a lack of direct control over people’s time. It is also requires that there be slack in the system, something that doesn’t sit well with efficiency drives. In its extreme form it removes the need for the organisation to formally exist, allowing a fluid interaction of free agents to interact in a market for their time. What it does do though is map very well onto a rather traditional view of how the academy is “managed”. Academics provide a limited resource, their time, and apply it to a large extent in a way determined by what they think is important. Management structures are in practice fairly flat (and used to be much more so) and interactions are driven more by interests and personal whim than by widely accepted corporate objectives. Research organisations, and perhaps by extension those commercial interests that interact most directly with them, should be ideally suited to harness the power of the social web to first solve their internal problems and secondly interact more effectively with their customers and stakeholders.

Why doesn’t this happen? A variety of reasons, some of them the usual suspects, a lack of adoption of new tools by academics, appalling IT procurement procedures and poor standards of software development, and a simple lack of time to develop new approaches, and a real lack of appreciation of the value that diversity of contributions can bring to a successful department and organisation. The biggest one though I suspect is a lack of good will between administrations and academics. Academics will not adopt any tools en masse across a department, let alone an organisation because they are naturally suspicious of the agenda and competence of those choosing the tools. And the diversity of tools they choose on their own means that none have critical mass within the organisation – few academic institutions had a useful global calendar system until very recently. Administration don’t trust the herd of cats that make up their academic staff to engage productively with the problems they have and see the need to have a technical solution that has critical mass of users, and therefore involves a central decision.

The problems of both diversity and lack of critical mass are a solid indication that the social web has some way to mature – these conversations should occur effectively across different tools and frameworks – and the uptake at research institutions should (although it may seem paradoxical) be expected to much slower than in more top down, managed organisation, or at least organisations with a shared focus. But it strikes me that the institutions that get this right, and they won’t be the traditional top institutions, will very rapidly accrue a serious advantage, both in terms of freeing up staff time to focus on core activities and releasing real monetary resource to support those activities. If the social side works, then the resource will also go to the right place. Watch for academic institutions trying to bring in strong social media experience into senior management. It will be a very interesting story to follow.

Reblog this post [with Zemanta]

“Friendfeeds for Science” pt II – Design ideas for a research focussed aggregator

Who likes me on friendfeed?
Image by cameronneylon via Flickr

This post, while only 48 hours old is somewhat outdated by these two Friendfeed discussions. This was written independently of those discussions so it seemed worth putting out in its original form rather than spending too much time rewriting.

I wrote recently about Sciencefeed, a Friendfeed like system aimed at scientists and was fairly critical. I also promised to write about what I thought a “Friendfeed for Researchers” should look like. To look at this we need to think about what Friendfeed, and other services including Twitter, Facebook, and Posterous are used for and what else they could do.

Friendfeed is an aggregator that enables, as I have written before, an “object-centric” means of interacting around those objects. As Alan Cann has pointed out this is not the only thing it does, also enabling the person-centric interactions that I see as more typical of Facebook and Twitter. Enabling both is important, as is the realization that all of these systems need to interoperate effectively with each other, something which is still evolving. But core to the development of something that works for researchers is that standard research objects and particularly papers, need to be first class objects. Author lists, one click to full text, one click to bookmark to my library.

Functionality 1: Treat research objects as first class citizens with special attention, start with journal papers and support for Citeulike/Zotero/Mendeley etc.

On top of this Friendfeed is a community, or rather several interlinked communities that have their own traditions, standards, and expectations, that are supported to a greater or lesser extent by the functionality of rooms, search, hiding, and administration found within Friendfeed. Any new service needs to understand and support these expectations.

Friendfeed also doesn’t so some things. It is not terribly effective as a bookmark tool, nor very good as tool for identifying and mining for objects or information that is more than a few days old although paradoxically it has served quite well as a means of archiving tweets and exposing them to search engines. The idea of a tool that surfaces objects to Google is an interesting one, and one we could take advantage of.  Granularity of sharing is also limited, what if I want slidesets to be public but tweets to be a private feed? Or to collect different feeds under different headings for different communities, public, domain-specific, and only for the interested specialist?

Finally Friendfeed doesn’t have a very sophisticated karma system.  While likes and comments will keep bringing specific objects (and by extension the people who have brought them in) into your attention stream there is none of the filtering power enabled by tools like StackOverflow. Whether or not such a thing is something we would want is an interesting question but it has the potential to enable much more sophisticated filtering and curation of content. StackOverflow itself has an interesting limitation as well; there is only one rank order of answers, I can’t choose to privelege the upmods of one specific curator rather than another. I certainly can’t choose to order my stream based on a persons upmods but not their downmods.

A user on Friendfeed plays three distinct roles, content author, content curator, and content consumer. Different people will emphasise different roles, from the pure broadcaster, to the pure reader who doesn’t ever interact. The real added value comes from the curation role and in particular enabling granular filtering based on your choice of curators. Curation comes in the form of choosing to push content to Friendfeed from outside servces, from “likes”, and from commenting. Commenting is both curation and authoring, providing context as well as providing new information or opinion. But supporting and validating this activity will be important. Whatever choice is made around “liking” or StackOverflow style up and down-modding needs to apply to comments as well as objects.

Functionality addition 2: Enable rating of comments and by extension, the people making them

If reputation gathering is to be useful in driving filtering functionality as I have suggested we will need good ways of separating content authoring from curation. One thing that really annoys me is seeing an interesting title and a friendly avatar on Friendfeed and clicking through to find something written by someone else. Not because I don’t want to read something written by someone else, but because my decision to click through was based on assumptions about who the author was.  We need to support a strong culture of citation and attribution in research. A Friendfeed for research will need to clearly mark the distinction between who has brought an object into the service, who has curated it, and who authored it. Both should be valued but the roles should be measured separately.

Functionality addition 3: Clearly designate authors and curators of objects brought into the stream. Possibly enable these activities to be rated separately?

If we recognize a role of author, outside that of the user’s curation activity we can also enable the rating of people and objects that don’t belong to users. This would allow researchers who are not users to build up reputation within the system. This has the potential to solve the “ghost town” phenomonen that plagues most science social networking sites. A new user could be able to claim the author role for objects that were originally brought  in by someone else. This would immediately connect them with other people who have commented on their work, and provide them with a reputation that can be further built upon through taking on curation activities.

This is a sensitive area, holding information on people without their knowledge, but it is something done already across indexing services, aggregation services, and chat rooms. The use of karma in this context would need to be very carefully thought out., and whether it would be made available either within or outside the system would be an important question to tackle.

Functionality addition 4: Collect reputation and comment information for authors who are not users to enable them to rapidly connect with relevant content if they choose to join.

Finally there is the question of interacting with this content and filtering it through the rating systems that have been created. The UI issues for this are formidable but there is a need to enable different views. A streaming view, and more static views of content a user has collected over long periods, as well as search. There is probably enough for another whole post in those issues.

Summary: Overall for me the key to building a service that takes inspiration from Friendfeed but delivers more functionality for researchers, while not alienating a wider potential user base is to build a tool that enables and supports curation rating and granular filtering of content. Authorship is key, as is quantitative measures of value and personal relevance that will enable users to build their own view of the content they are interested in, to collect it for themselves and to continue to curate it for themselves, either on their own or in collaboraton with others.

Reblog this post [with Zemanta]