Eduserv Symposium 2009 – Evolution or revolution: The future of identity and access management for research

I am speaking at the Eduserv Symposium on London in late May on the subject of the importance of identity systems for advancing the open research agenda.

From the announcement:

The Eduserv Symposium 2009 will be held on Thursday 21st May 2009 at the Royal College of Physicians, London. More details about the event are available from: http://www.eduserv.org.uk/events/esym09

This symposium will be of value to people with an interest in the impact that the social Web is having on research practice and scholarly communication and the resulting implications for identity and access management. Attendees will gain an insight into the way the research landscape is evolving and will be better informed when making future decisions about policy or practice in this area.

Confirmed speakers include: James Farnhill, Joint Information Systems Committee; Nate Klingenstein, Internet2; Cameron Neylon, Science and Technology Facilities Council; Mike Roch, University of Reading; David Smith, CABI; John Watt, National e-Science Centre (Glasgow).

I’ve just written the abstract and title for my talk and will in time honored fashion be preparing the talk “just in time” so will try to make it as up to the minute as possible. Any comments or suggestions are welcome and the slides will be available on slideshare as soon as I have finalized them (probably just after I give the talk…)

Oh, you’re that “Cameron Neylon”: Why effective identity management is critical to the development of open research

There is a growing community developing around the need make the outputs of research available more efficiently and more effectively. This ranges from efforts to improve the quality of data presentation in published peer reviewed papers through to efforts where the full record of research is made available online, as it is recorded. A major fear as more material goes online in different forms is that people will not receive credit for their contribution. The recognition of researcher’s contribution has always focussed on easily measurable quantities. As the diversity of measurable contributions increases there is a growing need to aggregate the contributions of a specific researcher together in a reliable and authoritative way. The key to changing researcher behaviour lies in creating a reward structure that acknowledges their contribution and allows them to effectively cited. Effective mechanisms for uniquely identifying researchers are therefore at the heart of constructing reward systems that support an approach to research that fully exploits the communication technologies available to us today.

Capturing the record of research process – Part II

So in the last post I got all abstract about what the record of process might require and what it might look like. In this post I want to describe a concrete implementation that could be built with existing tools. What I want to think about is the architecture that is required to capture all of this information and what it might look like.

The example I am going to use is very simple. We will take some data and do a curve fit to it. We start with a data file, which we assume we can reference with a URI, and load it up into our package. That’s all, keep it simple. What I hope to start working on in Part III is to build a toy package that would do that and maybe fit some data to a model. I am going to assume that we are using some sort of package that utilizes a command line because that is the most natural way of thinking about generating a log file, but there is no reason why a similar framework can’t be applied to something using a GUI.

Our first job is to get our data. This data will naturally be available via a URI, properly annotated and described. In loading the data we will declare it to be of a specific type, in this case something that can be represented as two columns of numbers. So we have created an internal object that contains our data. Assuming we are running some sort of automatic logging program on our command line our logfile will now look something like:
> start data_analysis
...loading data_analysis
...Version 0.1
...Package at: http://mycodeversioningsystem.org/myuserid/data_analysis/0.1
...Date is: 01/01/01
...Local environment is: Mac OS10.5
...Machine: localhost
...Directory: /usr/myuserid/Documents/Data/some_experiment
> data = load_data(URI)
...connecting to URI
...found data
...created two column data object "data"
...pushed "data" to http://myrepository.org/myuserid/new_object_id
..."data" aliased to http://myrepository.org/myuserid/new_object_id

That last couple are important because we want all of our intermediates to be accessible via a URI on the open web. The load_data routine will include the pushing of the newly created object in some useable form to an external repository. Existing services that could provide this functionality include a blog or wiki with an API, a code repository like GitHub, GoogleCode, or SourceForge, an institutional or disciplinary repository, or a service like MyExperiment.org. The key thing is that the repository must then expose the data set in a form which can be readily extracted by the data analysis tool being used. The tool then uses that publicly exposed form (or an internal representation of the same object for offline work).

At the same time a script file is being created that if run within the correct version of data_analysis should generate the same results.
# Script Record
# Package: http://http://mycodeversioningsystem.org/myuserid/data_analysis/0.1
# User: myuserid
# Date: 01/01/01
# System: Mac OS 10.5
data = load_data(URI)

The script might well include some system scripting that would attempt to check whether the correct environment (e.g. Python) for the tool is available and to download and start up the tool itself if the script is directly executed from a GUI or command line environment. The script does not care what the new URI created for the data object was because when it is re-run it will create a new one. The Script should run independently of any previous execution of the same workflow.

Finally there is the graph. What we have done so far is to take one data object and convert it to a new object which is a version of the original. That is then placed online to generate an accessible URI. We want our graph to assert that http://myrepository.org/myuserid/new_object_id is a version of URI (excuse my probably malformed RDF).

<data_analysis:data_object
    rdf:about= "http://myrepository.org/myuserid/new_object_id">
  <data_analysis:data_type>two_column_data</data_analysis:data_type>
  <data_analysis:generated>
    <data_analysis:generated_from rdf:resource="URI"/>
    <data_analysis:generated_by_command>load_data</data_analysis:generated_via>
    <data_analysis:generated_by_version rdf:resource="http://mycodeversioningsystem.org/myuserid/data_analysis/0.1>
    <data_analysis:generated_in_system>Max OS 10.5</data_analysis:generated_in_system>
    <data_analysis:generated_by rdf:resource="http://myuserid.name"/>
    <data_analysis:generated_on_date dc:date="01/01/01"/>
  </data_analysis:generated>
</data_analysis:data_object>

Now this is obviously a toy example. It is relatively trivial to set up the data analysis package so as to write out these three different types of descriptive files. Each time a step is taken, that step is then described and appended to each of the three descriptions. Things will get more complicated if a process requires multiple inputs or generates multiple outputs but this is only really a question of setting up a vocabulary that makes reasonable sense. In principle multiple steps can be collapsed by combining a script file and the rdf as follows:

<data_analysis:generated_by_command
    rdf:resource="http://myrepository/myuserid/location_of_script"/>

I don’t know anything much about theoretical computer science but it seems to me that any data analysis package that works through a stepwise process running previously defined commands could be described in this way. And that given that this is how computer programs run that this suggests that any data analysis process can be logged this way. It obviously has to be implemented to write out the files but in many cases this may not even be too hard. Building it in at the beginning is obviously better. The hard part is building vocabularies that make sense locally and are specific enough but are appropriately wired into wider and more general vocabularies. It is obvious that the reference to data_analysis:data_type = “two_column_data” above should probably point to some external vocabulary that describes generic data formats and their representations (in this case probably a Python pickled two column array). It is less obvious where that should be, or whether something appropriate already exists.

This then provides a clear set of descriptive files that can be used to characterise a data analysis process. The log file provides a record of exactly what happened, that is reasonably human readable, and can be hacked using regular expressions if desired. There is no reason in principle why this couldn’t be in the form of an XML file with a style sheet appropriate for human readability. The script file provides the concept of what happened as well as the instructions for repeating the process. It could usefully be compared to a plan which would look very similar but might have informative differences. The graph is a record of the relationships between the objects that were generated. It is machine readable and can additionally be used to automate the reproduction of the process, but it is a record of what happened.

The graph is immensely powerful because it can be ultimately used to parse across multiple sets of data generated by different versions of the package and even completely different packages used by different people (provided the vocabularies have some common reference). It enables the comparison of analyses carried out in parallel by different people.

But what is most powerful about the idea of an rdf based graph file of the process is that it can be automated and completely hidden from the user. The file may be presented to the user in some pleasant and readable form but they need never know they are generating rdf. The process of wiring the dataweb up, and the following process of wiring up the web of things in experimental science, will rest on having the connections captured from and not created by the user. This approach seems to provide a way towards making that happen.

What does this tell us about what a data analysis tool should look like? Well ideally it will be open source, but at a minimum there must be a set of versions that can be referenced. Ideally these versions would be available on an appropriate code repository configured to enable an automatic download. They must provide, at a minimum a log file, and preferably both script and graph versions of this log (in principle the script can be derived from either of the other two which can be derived from each other, the log and graph can’t be derived from the script). The local vocabulary must be available online and should preferably be well wired into the wider data web. The majority of this should be trivial to implement for most command line driven tools and not terribly difficult for GUI driven tools. The more complicated aspects lie in the pushing out of intermediate objects and the finalized logs onto appropriate online repositories.

A range of currently available services could play these roles, from code repositories such as Sourceforge and Github, through to the internet archive, and data and process repositories such as MyExperiment.org and Talis Connected Commons, or to locally provided repositories. Many of these have sensible APIs and/or REST interfaces that should make this relatively easy. For new analysis tools this shouldn’t be terribly difficult to implement. Implementing it in existing tools could be more challenging but not impossible. It’s a question of will rather than severe technical barriers as far as I can see. I am going to start trying to implement some of this in a toy data fitting package in Python, which will be hosted at Github, as soon as I get some specific feedback on just how bad that RDF is…

Capturing the record of research process – Part I

When it comes to getting data up on the web, I am actually a great optimist. I think things are moving in the right direction and with high profile people like Tim Berners-Lee making the case, the with meme of “Linked Data” spreading, and a steadily improving set of tools and interfaces that make all the complexities of RDF, OWL, OAI-ORE, and other standards disappear for the average user, there is a real sense that this might come together. It will take some time; certainly years, possibly a decade, but we are well along the road.

I am less sanguine about our ability to record the processes that lead to that data or the process by which raw data is turned into information. Process remains a real recording problem. Part of the problem for this is just that scientists are rather poor at recording process in most cases, often assuming that the data file generated is enough. Part of the problem is the lack of tools to do the recording effectively, the reason why we are interested in developing electronic laboratory recording systems.

Part of it is that the people who have been most effective at getting data onto the web tend to be informaticians rather than experimentalists. This has lead in a number of cases to very sophisticated systems for describing data, and for describing experiments, that are a bit hazy about what it is they are actually describing. I mentioned this when I talked about Peter Murray-Rust and Egon Willighagen’s efforts at putting at UsefulChem experiments into Chemistry Markup Language. While we could describe a set of results we got awfully confused because we weren’t sure whether we were describing a specific experiment, the average of several experiments, or a description of what we thought was happening in the experiments. This is not restricted to CML. Similar problems arise with many of the MIBBI guidelines where the process of generating the data gets wrapped up with the data file. None of these things are bad, it is just that we may need some extra thought about distinguishing between classes and instances.

In the world of linked data Process can be thought of either as a link between objects, or as an object in its own right. The distinction is important. A link between, say, a material and data files, points to a vocabulary entry (e.g. PCR, HPLC analysis, NMR analysis). That is the link creates a pointer to the class of a process, not to a specific example of carrying out that process. If the process is an object in its own right then it refers reasonably clearly to a specific instance of that process (it links this material, to that data file, on this date) but it breaks the links between the inputs and outputs. While RDF can handle this kind of break by following the links though it creates the risk of a rapidly expanding vocabulary to hold all the potential relationships to different processes.There are other problems with this view. The object that represents the process. Is it a a log of what happened? Or is it the instructions to transform inputs into outputs? Both are important for reproducibility but they may be quite different things. They may even in some cases be incompatible if for some reason the specific experiment went off piste (power cut, technical issues, or simply something that wasn’t recorded at the time, but turned out to be important later).

The problem is we need to think clearly about what it is we mean when we point at a process. Is it merely a nexus of connections? Things coming out, things going in. Is it the log of what happened? Or is it the set of instructions that lets us convert these inputs into those outputs. People who have thought about this from a vocabulary of informatics perspective tend to worry mostly about categorization. Once you’ve figure out what the name of this process is and where it falls in the heirachy you are done. People who work on minimal description frameworks start from the presumption that we understand what the process is (a microarray experiment, electrophoresis etc). Everything within that is then internally consistent but it is challenging in many cases to link out into other descriptions of other processes. Again the distinction between what was planned to be done, and what was actually done, what instructions you would give to someone else if they wanted to do it, and how you would do it yourself if you did it again, have a tendency to be somewhat muddled.

Basically process is complicated, and multifaceted, with different requirements depending on your perspective. The description of a process will be made up of several different objects. There is a collection of inputs and a collection of outputs. There is a pointer to a classification, which may in itself place requirements on the process. The specific example will inherit requirements from the abstract class (any PCR has two primers, a specific PCR has two specific primers). Finally we need the log of what happened, as well as the “script” that would enable you to do it again. On top of this for a complex process you would need a clear machine-readable explanation of how everything is related. Ideally a set of RDF statements that explain what the role of everything is, and links the process into the wider web of data and things. Essentially the RDF graph of the process that you would use in an idealized machine reading scenario.

Diagramatic representation of process

This is all very abstract at the moment but much of it is obvious and the rest I will try to tease out by example in the next post. At the core is the point that process is a complicated thing and different people will want different things from it. This means that it is necessarily going to be a composite object. The log is something that we would always want for any process (but often do not at the moment). The “script”, the instructions to repeat the process, whether it be experimental or analytical is also desirable and should be accessible either by stripping the correct commands out of the log, or by adapting a generic set of instructions for that type of process. The new things is the concept of the graph. This requires two things. Firstly that every relevant object has an identity that can be referenced. Secondly that we capture the context of each object and what we do to it at each stage of the process. It is building systems that capture this as we go that will be the challenge and what I want to try and give an example of in the next post.

Do high art and science mix?

On Friday night we went down to London to see the final night of the English National Opera’s run of Doctor Atomic, the new opera from John Adams with a libretto by Peter Sellars. The opera focuses on the days and hours leading up to first test firing of an atomic bomb at the Trinity site on 15 July 1945. The doctor of the title is J Robert Oppenheimer who guided the Manhattan Project to its successful conclusion and the drama turns around him as the lynch pin connecting scientists, engineers, the military and politicians. The title recalls the science fiction novels of the 50s and 60s as but the key reference is to Thomas Mann’s much earlier doctor, Faust.

Sellars made the intriguing choice of drawing the libretto from archival documents, memoirs, recollections and interviews. The libretto places the words of the main protagonists in their own mouths; words that are alternately moving, disturbing, and dry and technical. These literal texts are supplemented with poetry drawn from Oppenheimer’s own interest. John Donne’s Holy Sonnet XIV, apparently the source of the name of the test site, Baudelaire, and the Bhagadavad Gita (which I learnt from the programme that Oppenheimer taught himself Sanskrit to read) all play a role. The libretto places the scientists in their historical context, giving them the human motivations, misgivings, and terrors, as well as the exhultation of success and the technical genius that lead to their ultimate success. The question is, does this presentation of science and scientists in the opera hall work?

The evening starts well, with a periodic table projected onto a sheer curtain. It appears to be drawn from  textbook circa 1945, with gaps beneath cesium, below manganese, and between neodymium and samarium. The choice is clever, providing a puzzle for people with all level of scientific experience; what is it; what do the symbols mean; why are there gaps; what goes into the gaps? It also leads into the first chorus “We beleived, that matter can neither be created nor destroyed, but only altered in form…”. Unfortunately at this point things began to fall apart, the chorus and orchestra appearing to struggle with the formidable rythmic challenges that Adam’s had set them. The majority of the first half felt largely shapeless, moments of rythmic and dramatic excitement lost within large tracts of expository and largely atonal semi-recitative.

Gerald Finley and Sasha Cooke were both in magnificent voice throughout but given limited opportunities for lyrical singing. The trademark rythmic drive of Adam’s previous works seems broken to pieces throughout the first Act, the chorus is used sparingly and the orchestra rarely feeling like it is moving beyond a source of sound effects. Act 1 is redeemed however by powerful closing scene in which Oppenheimer, in the shadow of the Trinity device, sings a lyrical and despairing setting of the Donne, “Batter my heart, three person’d God”, punctuated with a driving orchestral interlude reminiscent of the more familar Adam’s of Klinhoffer and Nixon in China. The effect is shattering, leaving the audience reflective in the gap between acts, but is also dramatically puzzling. Up until now Oppenheimer has show few doubts, now suddenly he seems to be despairing.

The second half started more confidently, with both chorus and orchestra appearing to settle. The second half is better organised and touches the heights of Adams’ previous work but opportunities are missed. The chorus and orchestra again seem under utilised but more generally the opportunity for effective musical treatment seems to be missed. The centre of the second act revolves around the night before the test, and the fears of those involved; fear that it won’t work, that the stormy weather will lead to the device detonating, the fears of General Groves that the scientists will run amok, and the, largely unspoke, fear of the consequences of the now inevitable use of the bomb in warfare. During the extended build to the climax and end of the opera the leads stand spread across the front of the stage, singing in turn, but not together, expressing their separate fears and beliefs but not providing the dramatic synthesis that would link them all together. Leo Szilard’s petition to Truman is sung in the first half by Robert Wilson as part of the introduction but not used, as it might have been, as a dramatic chorus in the second half.

But the core dramatic failure lies in the choice of the test itself as the end and climax of the opera. In todays world of CGI effects and surround sound it is simply not feasible to credibly treat an atomic explosion on a theatre stage or in an orchestral treatment. The tension is built extremely effectively towards the climax but then falls flat as the whole cast stand or crouch on stage, waiting for something to happen, the orchestra simulating the countdown through regular clock like noises all going at different rates. The flash that signals the explosion itself is simply a large array of spots trained on the backdrop. The curtain drops and a recording of a woman asking for water in Japanese is played. Moving yes, dramatically satisfying? No.

We do not see the consequences of the test. We don’t see the, undoubted, exhultation of scientists achieving what they set out to do. Nor do we see their growing realisation of the consequences. We are shown in short form the human consequences of the use of the weapon but not the way this effects the people who built the weapon. Ultimately as an exploration of the inner world of Oppenheimer, Teller, and the other scientists it feels as though it falls short. Given the inability to represent the explosion itself my personal choice would have been to let the action flow on, to the party and celebration that followed, allowing the cast to drift away gradually sobering and leaving Oppenheimer finally on stage to reprise the Donne.

The Opera is moving because we know the consequences of what followed. I found it dramatically unsatifying for both structural reasons and because it did not take the full opportunity to explore the complexities of the motivations of  the scientists themselves. There are two revealing moments. In the first half, having heard the impassioned Wilson reading Leo Szilard’s petition, Oppenheimer asks Wilson how he is feeling. The answer; “pretty excited actually”, captures the tension at the heart of the scientists beautifully, the excitement and adventure of the science, and the fears and disquiet over the consequences. In the second act, Edward Teller comes on stage to announce that Fermi is taking bets as to whether the test will ignite the atmosphere. He goes on at some length as to how his calculations suggested this would happen, Oppenheimer telling him to shut up, saying that Bethe had checked the figures. Teller continues to grandstand but then agrees magnanimously that he has now revised his figures and the world will not come to an end. Humour or arrogance? Or the heady combination of both that is not an unusual characteristic amongst scientists.

These flashes of insight into the leads minds never seem to fully explored, merely occasionally exposed. Later in Act 2 Teller teases Oppenheimer and the other scientists for placing bets on explosive yields for the device much lower than they have calculated. We never find out who was right. The playful, perhaps childish, nature of the scientists is exposed but their response after the test is not dissected. What was the response of those betting when they realised they knew exactly what the yield of the device to be dropped on Hiroshima was? Were the bets paid out or abandoned in disgust? The opera draws on the Faust legend for inspiration but the drama of Faust turns on the inner conflict of the scientist and the personal consequences of the choices made, things we don’t see enough of here.

Finally there is the question of the text itself. The chorus is restricted largely to technical statements. Richard Morrison, reviewing in the Times (I don’t seem to be able to find the review in question online) said that Adams struggled to bring poetry to some of these texts. It seems to be though, that there is poetry here for the right composer. I was slightly dissappointed that Adams was not that composer. Don’t let me put you off seeing this. It is not one of the great operas but it is very good and in places very powerful. It has flaws and weaknesses but if you want to spend your life waiting for perfection you are going to need to be very patient. This is a good night at the opera and a moving drama. We just need to keep waiting for the first successful opera on a scientific subject.

We believed that
“Matter can be neither
created nor destroyed
but only altered in form.”

We believed that
“Energy can be neither
created nor destroyed
but only altered in form.”

But now we know that
energy may become matter,
and now we know that
matter may become energy
and thus be altered in form.
[….]

We surround the plutonium core
from thirty two points
spaced equally around its surface,
the thirty-two points
are the centers of the
twenty triangular faces
of an icosahedron
interwoven with the
twelve pentagonal faces
of a dodecahedron.
We squeeze the sphere.
Bring the atoms closer.
Til the subcritical mass
goes supercritical.

We disturb the stable nucleus.

From the Chorus Text, Act I Scene I, Dr Atomic, Libretto arranged by Peter Sellars from a variety of texts, copyright Boosey and Hawkes

Doc Bushwell has also reviewed Dr Atomic

Fantasy Science Funding: How do we get peer review of grant proposals to scale?

This post is both a follow up to last week’s post on the cost’s of peer review and a response to Duncan Hull‘s post of nine or so months ago proposing a game of “Fantasy Science Funding“. The game requires you to describe how you would distribute the funding of the BBSRC if you were a benign (or not so benign) dictator. The post and the discussion should be read bearing in mind my standard disclaimer.

Peer review is in crisis. Anyone who tells you otherwise either has their head in the sand or is trying to sell you something. Volumes are increasing, quality of review is decreasing. The willingness of scientists to take on refereeing is increasingly the major problem for those who commission it. This is a problem for peer reviewed publication but the problems for the reviewing of funding applications are far worse.

For grant review, the problems that are already evident in scholarly publishing, fundamentally the increasing volume, are exacerbated by the fact that success rates for grans are falling and that successful grants are increasingly in the hands of a smaller number of people in a smaller number of places. Regardless of whether you agree with this process of concentrating grant funding this creates a very significant perception problem. If the perception of your referees is that they have no chance of getting funding, why on earth should they referee

Is this really happening? Well in the UK chemistry community last year there was an outcry when two EPSRC grant rounds in a row had success rates of 10% or lower. Bear in mind this was the success rate of grants that made it to panel, i.e. it is an upper bound, assuming there weren’t any grants removed at an earlier stage. As you can imagine there was significant hand wringing and a lot of jumping up and down but what struck me was two statements I heard made. The first, was from someone who had sat on one of the panels, was that it “raises the question of whether it is worth our time to attend panel meetings”. The second was the suggestion that the chemistry community could threaten to unilaterally withdraw from EPSRC peer review. These sentiments are now being repeated on UK mailing lists in response to EPSRC’s most recent changes to grant submission guidelines. Whether serious or not, credible or not, this shows that the compact of community contribution to the review process is perilously close to breaking down.

The research council response to this is to attempt to reduce the number of grant proposals, generally by threatening to block those who have a record of serial rejection. This will fail. With success rates as low as they are, and with successful grants concentrated in the hands of the few, most academics are serial failures. The only way departments can increase income is by increasing the volume and quality of grant applications. With little effective control over quality the focus will necessarily be on increasing volume. The only way research councils will control this is either by making applications a direct cost to departments, or by reducing the need of academics to apply.

The cost of refereeing is enormous and largely hidden. But it pales into insignificance compared to the cost of applying for grants. Low success rates make the application process an immense waste of departmental resources. The approximate average cost of running a UK academic for a year is £100,000. If you assume that each academic writes one grant per year and that this takes around two weeks full time work that amounts to ~£4k per academic per year. If there are 100,000 academics in the UK this is £400M, which with a 20% success rate means that £320M is lost in the UK each year. Let’s say that £100M is a reasonable ballpark figure.

In more direct terms this means that academics who are selected for their ability to do research, are being taken away from what they are good at to play a game which they will on average lose four times out of five. It would be a much more effective use of government funding to have those people actually doing research.

So this is a game of Fantasy Funding, how would I spend the money? Well, rather than discuss my biases about what science is important, which are probably not very interesting, it is perhaps more useful to think about how the system might be changed to reduce these wastages. And there is a simple, if somewhat radical, way of doing this.

Cut the budget in two and distribute half of it directly to academics on a pro-rata basis.

By letting researcher’s focus on getting on with research you will reduce their need for funding and reduce the burden. By setting the bar naturally higher for funding research you still maintain the perception that everyone is in with a chance and reduce the risk of referee drop out due to dis-enchantment with the process. More importantly you enable innovative research by allowing it to keep ticking over and in particular you enable a new type of peer review.

If you look at the amounts of money involved, say a few hundred million pounds for BBSRC, and divide that up amongst all bioscience academics, you end up with figures of a £5-20K per academic per year. Not enough to hire a postdoc, just about enough to run a PhD student (at least at UK rates). But what if you put that together with the money from a few other academics? If you can convince your peers that you have an interesting and fun idea then they can pool funds together. Or perhaps share a technician between two groups so that you don’t lose the entire group memory every time a student leaves? Effective collaboration will lead to a win on all sides.

If these arguments sound familiar it is because they are not so different to the notion of 20% time, best known as a Google policy of having all staff spend some time on personal projects. By supporting low level innovation and enabling small scale judging of ideas and pooling of resources it is possible to enable bottom up innovation of precisely the kind that is stifled by top down peer review.

No doubt there would be many unintended consequences, and probably a lot of wastage, but in amongst that I wouldn’t bet against the occassional brilliant innovation which is virtually impossible in the current climate.

What is clear is that doing nothing is not an option. Look at that EPSRC statement again. People with a long term success rate below 25% will be blocked…I just checked my success rate over the past ten years (about 15% by numbers of grants, 70% by value but that is dominated by one large grant). The current success rate at chemistry panel is around 15%. And that is skewed towards a limited number of people and places.

The system of peer review relies absolutely on the communities agreement to contribute and some level of faith in the outcome. It relies absolutely on trust. That trust is perilously close to a breakdown.

Why good intentions are not enough to get negative results published

There are a set of memes that seem to be popping up with increasing regularity in the last few weeks. The first is that more of the outputs of scientific research need to be published. Sometimes this means the publication of negative results, other times it might mean that a community doesn’t feel they have an outlet for their particular research field. The traditional response to this is “we need a journal” for this. Over the years there have been many attempts to create a “Journal of Negative Results”. There is a Journal of Negative Results – Ecology and Evolutionary Biology (two papers in 2008), a Journal of Negative Results in Biomedicine (four papers in 2009, actually looks pretty active) , a Journal of Interesting Negative Results in Natural Language (one paper), and a Journal of Negative Results in Speech and Audio Sciences, which appears to be defunct.

The idea is that there is a huge backlog of papers detailing negative results that people are gagging to get out if only there was somewhere to publish them. Unfortunately there are several problems with this. The first is that actually writing a paper is hard work. Most academics I know do not have the problem of not having anything to publish, they have the problem of getting around to writing the papers, sorting out the details, making sure that everything is in good shape. This leads to the second problem, that getting a negative result to a standard worthy of publication is much harder than for a positive result. You only need to make that compound, get that crystal, clone that gene, get the microarray to work once and you’ve got the data to analyse for publication. To show that it doesn’t work you need to repeat several times, make sure your statistics are in order, and establish your working condition. Partly this is a problem with the standards we apply to recording our research; designing experiments so that negative results are well established is not high on many scientists’ priorities. But partly it is the nature of beast. Negative results need to be much more tightly bounded to be useful .

Finally, even if you can get the papers, who is going to read them? And more importantly who is going to cite them? Because if no-one cites them then the standing of your journal is not going to be very high. Will people pay to have papers published there? Will you be able to get editors? Will people referee for you? Will people pay for subscriptions? Clearly this journal will be difficult to fund and keep running. And this is where the second meme comes in, one which still gets suprising traction, that “publishing on the web is free”. Now we know this isn’t the case, but there is a slighlty more sophisticated approach which is “we will be able to manage with volunteers”. After all with a couple of dedicated editors donating the time, peer review being done for free, and authors taking on the formatting role, the costs can be kept manageable surely? Some journals do survive on this business model, but it requires real dedication and drive, usually on the part of one person. The unfortunate truth is that putting in a lot of your spare time to support a journal which is not regarded as high impact (however it is measured) is not very attractive.

For this reason, in my view, these types of journals need much more work put into the business model than for a conventional specialist journal. To have any credibility in the long term you need a business model that works for the long term. I am afraid that “I think this is really important” is not a business model, no matter how good your intentions. A lot of the standing of a journal is tied up with the author’s view of whether it will still be there in ten years time. If that isn’t convincing, they won’t submit, if they don’t submit you have no impact, and in the long term a downward spiral until you have no journal.

The fundamental problem is that the “we need a journal” approach is stuck in the printed page paradigm. To get negative results published we need to reduce the barriers to publication much lower than they currently are, while at the same time applying either a pre- or post-publication filter. Rich Apodaca, writing on Zusammen last week talked about micropublication in chemistry, the idea of reducing the smallest publishable unit by providing routes to submit smaller packages of knowledge or data to some sort of archive. This is technically possible today, services like ChemSpider, NMRShiftDB, and others make it possible to submit small pieces of information to a central archive. More generally the web makes it possible to publish whatever we want, in whatever form we want, but hopefully semantic web tools will enable us to do this in an increasingly more useful form in the near future.

Fundamentally my personal belief is that the vast majority of “negative results” and other journals that are trying to expand the set of publishable work will not succeed. This is precisely because they are pushing the limits of the “publish through journal” approach by setting up a journal. To succeed these efforts need to embrace the nature of the web, to act as a web-native resource, and not as a printed journal that happens to be viewed in a browser. This does two things, it reduces the barrier to authors submitting work, making the project more likely to be successful, and it can also reduce costs. It doesn’t in itself provide a business model, nor does it provide quality assurance, but it can provide a much richer set of options for developing both of these that are appropriate to the web. Routes towards quality assurance are well established, but suffer from the ongoing problem of getting researchers involved in the process, a subject for another post. Micropublication might work through micropayments, the whole lab book might be hosted for a fee with a number of “publications” bundled in, research funders may pay for services directly, or more interestingly the archive may be able to sell services built over the top of the data, truly adding value to the data.

But the key is a low barriers for authors and a robust business model that can operate even if the service is perceived as being low impact. Without these you are creating a lot of work for yourself, and probably a lot of grief. Nothing comes free, and if there isn’t income, that cost will be your time.

The API for the physical world is coming – or – how a night out in Bath ended with chickens and sword dancing

Last night I made it for the first time to one of the BathCamp evening events organised by Mike Ellis in Bath. The original BathCamp was a BarCamp held in late summer last year that I couldn’t get to, but Mike has been organising a couple of evenings since. I thought I’d go along because it sounded like fun, and I knew a few of the people were good value, and that I didn’t know any of the others.

Last night there were two talks, the first was from Dale Lane, talking about home electricity use monitoring, with a focus on what can be done with the data. He showed monitoring, mashups, competitions for who could use the least electricity, and widgets to make your phone beep when consumption jumped; the latter is apparently really irritating as your phone goes off every time the fridge goes on. Irritating, but makes you think. Dale also touched on the potential privacy issues, a very large grey area which is of definite interest to me; who owns this data, and who has the right to use it?

Dale’s talk was followed by one from Ben Tomlinson, about the Arduino platform. Now maybe everyone knows about this already but it was new to me. Essentially Arduino is a cheap open source platform for building programmable interactive devices. The core is a £30 board with enough processing power and memory to run reasonably sensible programs. There is already quite a big code base which you can download and then flash onto the boards. The boards themselves take some basic analogue and digital inputs. But then there are all the other bits and bobs you can get, both unrelated with standard interfaces, but also built to fit; literally to plug and play, including motion sensors, motor actuators, ethernet, wireless, even web servers!

Now these were developed almost as toys, or for creatives, but what struck me was that this was a platform for building simple, cheap scientific equipment. Things to do basic measurement jobs and then stream out the results. One thing we’ve learnt over the years is that the best way to make well recorded and reliable measurements of something is to take humans out of the equation as far as possible. This platform basically makes it possible to knock up sensors, communication devices, and responsive systems for less than the price of decent mp3 player. It is a commoditized instrument platform, and one that comes with open source tools to play with.

Both Dale and Ben mentioned a site I hadn’t come across before, called Pachube (I think it is supposed to be a contraction of “patch” and “you tube” but not sure, no-one seemed to know how to pronounce it) which provides a simple platform for hooking up location centric streamed data. Dale is streaming his electric consumption data in, someone is streaming the flow of the Colarado River. What could we do with lab sensor data? Or even experimental data. What happens when you can knock up an instrument to do a measurement in the morning and then just leave it recording for six months? What if that instrument then starts interacting with the world?

All of this has been possible for some time if you have the time and the technical chops to put it together. What is changing is two things; the ease of being able to plug-and-play with these systems, download a widget, plug in a sensor, connect to the local wireless. As this becomes accessible then people start to play. It was no accident that Ben’s central example were hacked toy chickens. The second key issue is that as these become commoditized the prices drop until, as Dale mentioned, electric companies are giving away the meters, and the Arduino kit can be got together for less than £50. Not only will this let us do new science, it will let us involved the whole world in doing that science.

When you abstract away the complexities and details of dealing with a system into a good API you enable much more than the techies to start wiring things up. When artists, children, scientists, politicians, school teachers, and journalists feel comfortable with putting these things together amazing things start to happen. These tools are very close to being the API to link the physical and online worlds.

The sword dancing? Hard to explain but probably best to see it for yourself…

What is the cost of peer review? Can we afford (not to have) high impact journals?

Late last year the Research Information Network held a workshop in London to launch a report, and in many ways more importantly, a detailed economic model of the scholarly publishing industry. The model aims to capture the diversity of the scholarly publishing industry and to isolate costs and approaches to enable the user to ask questions such as “what is the consequence of moving to a 95% author pays model” as well as to simply ask how much money is going in and where it ends up. I’ve been meaning to write about this for ages but a couple of things in the last week have prompted me to get on and do it.

The first of these was an announcement by email [can’t find a copy online at the moment] by the EPSRC, the UK’s main funder of physical sciences and engineering. While the requirement for a two page enconomic impact statement for each grant proposal got more headlines, what struck me as much more important were two other policy changes. The first was that, unless specifically invited, rejected proposals can not be resubmitted. This may seem strange, particularly to US researchers, where a process of refinement and resubmission, perhaps multiple times, is standard, but the BBSRC (UK biological sciences funder) has had a similar policy for some years. The second, frankly somewhat scarey change, is that some proportion of researchers that have a history of rejection will be barred from applying altogether. What is the reason for these changes? Fundamentally the burden of carrying out peer review on all of the submitted proposals is becoming too great.

The second thing was that, for the first time, I have been involved in refereeing a paper for a Nature Publishing Group journal. Now I like to think, like I guess everyone else does, that I do a reasonable job of paper refereeing. I wrote perhaps one and a half sides of A4 describing what I thought was important about the paper and making some specific criticisms and suggestions for changes. The paper went around the loop and on the second revision I saw what the other referees had written; pages upon pages of closely argued and detailed points. Now the other referees were much more critical of the paper but nonetheless this supported a suspicion that I have had for some time, that refereeing at some high impact journals is qualitatively different to what the majority of us receive, and probably deliver; an often form driven exercise with a couple of lines of comments and complaints. This level of quality peer review takes an awful lot of time and it costs money; money that is coming from somewhere. Nonetheless it provides better feedback for authors and no doubt means the end product is better than it would otherwise have been.

The final factor was a blog post from Molecular Philosophy discussing why the author felt Open Access Publishers are, if not doomed to failure, then face a very challenging road ahead. The centre of the argument as I understand it focused around the costs of high impact journals, particularly the costs of selection, refinement, and preparation for print. Broadly speaking I think it is generally accepted that a volume model of OA publication, such as that practiced by PLoS ONE and BMC can be profitable. I think it is also generally accepted that a profitable business model for high impact OA publication has yet to be convincingly demonstrated. The question I would like to ask though is different. The Molecular Philosophy post skips the zeroth order questions. Can we afford high impact publications?

Returning to the RIN funded study and model of scholarly publishing some very interesting points came out [see Daniel Hull’s presentation for most of the data here]. The first of these, which in retrospect is obvious but important, is that the vast majority of the costs of producing a paper are incurred in doing the research it describes (£116G worldwide). The second biggest contributor? Researchers reading the papers (£34G worldwide). Only about 14% of the costs of the total life cycle are actually taken up with costs directly attributable to publication. But that is the 14% we are interested in, so how does it divide up?

The “Scholarly Communication Process” as everything in the middle is termed in the model is divided up into actual publication/distribution costs (£6.4G), access provision costs (providing libraries and internet access, £2.1G) and the costs of researchers looking for articles (£16.4G). Yes, the biggest cost is the time you spend trying to find those papers. Arguably that is a sunk cost in as much as once you’ve decided to do research searching for information is a given, but it does make the point that more efficient searching has the potential to save a lot of money. In any case it is a non-cash cost in terms of journal subscriptions or author charges.

So to find the real costs of publication per se we need to look inside that £6.4. Of the costs of actually publishing the articles the biggest single cost is peer review weighing in at around £1.9G globally, just ahead of fixed “first copy” publication costs of £1.8G. So 29% of the total costs incurred in publication and distribution of scholarly articles arises from the cost of peer review.

There are lots of other interesting points in the reports and models (the UK is a net exporter of peer review, but the UK publishes more articles than would be expected based on its subscription expenditure) but the most interesting aspect of the model is its ability to model changes in the publishing landscape. The first scenario presented is one in which publication moves to being 90% electronic. This actually leads to a fairly modest decrease in costs overall with a total overall saving of a little under £1G (less than 1%). Modeling a move to a 90% author pays model (assuming 90% electronic only) leads to very little change overall, but interestingly that depends significantly on the cost of systems put in place to make author payments. If these are expensive and bureaucratic then the costs can rise as many small payments are more expensive than few big ones. But overall the costs shouldn’t need to change much, meaning if mechanisms can be put in place to move the money around, the business models should ultimately be able to make sense. None of this however helps in figuring out how to manage a transition from one system to another, when for all useful purposes costs are likely to double in the short term as systems are duplicated.

The most interesting scenario, though was the third. What happens as research expands. A 2.5% real increase year on year for ten years was modeled. This may seem profligate in today’s economic situation but with many countries explicitly spending stimulus money on research, or already engaged in large scale increases of structural research funding it may not be far off. This results in 28% more articles, 11% more journals, a 12% increase in subscription costs (assuming of course that only the real cost increases are passed on) and a 25% increase in the costs of peer review (£531M on a base of £1.8G).

I started this post talking about proposal refereeing. The increased cost in refereeing proposals as the volume of science increases would be added on top of that for journals. I think it is safe to say that the increase in cost would be of the same order. The refereeing system is already struggling under the burden. Funding bodies are creating new, and arguably totally unfair, rules to try and reduce the burden, journals are struggling to find referees for paper. Increases in the volume of science, whether they come from increased funding in the western world or from growing, increasingly technology driven, economies could easily increase that burden by 20-30% in the next ten years. I am sceptical that the system, as it currently exists, can cope and I am sceptical that peer review, in its current form is affordable in the medium to long term.

So, bearing in mind Paulo’s admonishment that I need to offer solutions as well as problems, what can we do about this? We need to find a way of doing peer review effectively, but it needs to be more efficient. Equally if there are areas where we can save money we should be doing that. Remember that £16.4G just to find the papers to read? I believe in post-publication peer review because it reduces the costs and time wasted in bringing work to community view and because it makes the filtering and quality assurance of that published work continuous and ongoing. But in the current context it offers significant cost savings. A significant proportion of published papers are never cited. To me it follows from this that there is no point in peer reviewing them. Indeed citation is an act of post-publication peer review in its own right and it has recently been shown that Google PageRank type algorithms do a pretty good job of identifying important papers without any human involvement at all (beyond the act of citation). Of course for PageRank mechanisms to work well the citation and its full context are needed making OA a pre-requisite.

If refereeing can be restricted to those papers that are worth the effort then it should be possible to reduce the burden significantly. But what does this mean for high impact journals? The whole point of high impact journals is that they are hard to get into. This is why both the editorial staff and peer review costs are so high for them. Many people make the case that they are crucial for helping to filter out the important papers (remember that £16.4G again). In turn I would argue that they reduce value by making the process of deciding what is “important” a closed shop, taking that decision away, to a certain extent, from the community where I feel it belongs. But at the end of the day it is a purely economic argument. What is the overall cost of running, supporting through peer review, and paying for, either by subscription or via author charges, a journal at the very top level? What are the benefits gained in terms of filtering and how do they compare to other filtering systems. Do the benefits justify the costs?

If we believe that there are better filtering systems possible, then they need to be built, and the cost benefit analysis done. The opportunity is coming soon to offer different, and more efficient, approaches as the burden becomes too much to handle. We either have to bear the cost or find better solutions.

[This has got far too long already – and I don’t have any simple answers in terms of refereeing grant proposals but will try to put some ideas in another post which is long overdue in response to a promise to Duncan Hull]

Contributor IDs – an attempt to aggregate and integrate

Following on from my post last month about using OpenID as a way of identifying individual researchers,  Chris Rusbridge made the sensible request that when conversations go spreading themselves around the web it would be good if they could be summarised and aggregated back together. Here I am going to make an attempt to do that – but I won’t claim that this is a completely unbiased account. I will try to point to as much of the conversation as possible but if I miss things out or misprepresent something please correct me in the comments or the usual places.

The majority of the conversation around my post occured on friendfeed, at the item here, but also see commentary around Jan Aert’s post (and friendfeed item) and Bjoern Bremb’s summary post. Other commentary included posts from Andy Powell (Eduserv), Chris Leonard (PhysMathCentral), Euan, Amanda Hill of the Names project, and Paul Walk (UKOLN). There was also a related article in Times Higher Education discussing the article (Bourne and Fink) in PLoS Comp Biol that kicked a lot of this off [Ed – Duncan Hull also pointed out there is a parallel discussion about the ethics of IDs that I haven’t kept up with – see the commentary at the PLoS Comp Biol paper for examples]. David Bradley also pointed out to me a post he wrote some time ago which touches on some of the same issues although from a different angle. Pierre set up a page on OpenWetWare to aggregate material to, and Martin Fenner has a collected set of bookmarks with the tag authorid at Connotea.

The first point which seems to be one of broad agreement is that there is a clear need for some form of unique identifier for researchers. This is not necessarily as obvious as it might seem. With many of these proposals there is significant push back from communities who don’t see any point in the effort involved. I haven’t seen any evidence of that with this discussion which leads me to believe that there is broad support for the idea from researchers, informaticians, publishers, funders, and research managers. There is also strong agreement that any system that works will have to be credible and trustworthy to researchers as well as other users, and have a solid and sustainable business model. Many technically minded people pointed out that building something was easy – getting people to sign up to use it was the hard bit.

Equally, and here I am reading between the lines somewhat, any workable system would have to be well designed and easy to use for researchers. There was much backwards and forwards about how “RDF is too hard”, “you can’t expect people to generate FOAF” and “OpenID has too many technical problems for widespread uptake”. Equally people thinking about what the back end would have to look like to even stand a chance of providing an integrated system that would work felt that FOAF, RDF, OAuth, and OpenID would have to provide a big part of the gubbins. The message for me was that the way the user interface(s) is presented have to be got right. There are small models of aspects of this that show that easy interfaces can be built to capture sophisticated data, but getting it right at scale will be a big challenge.

Where there is less agreement is on the details, both technical and organisational of how best to go about creating a useful set of unique identifiers. There was some to-and-fro as to whether CrossRef was the right organisation to manage such a system. Partly this turned on concern over centralised versus distributed systems and partly over issues of scope and trust. Nonetheless the majority view appeared to be that CrossRef would be right place to start and CrossRef do seem to have plans in this direction (from Geoffry Bilder see this Friendfeed item).

There was also a lot of discussion around identity tokens versus authorisation. Overall it seemed that the view was that these can be productively kept separate. One of the things that appealed to me in the first instance was that OpenIDs could be used as either tokens (just a unique code that is used as an identifier) as well as a login mechanism. The machinery is already in place to make that work. Nonetheless it was generally accepted, I think, that the first important step is an identifier. Login mechansisms are not necessarily required, or even wanted, at the moment.

The discussion as to whether OpenID is a good mechanism seemed in the end to go around in circles. Many people brought up technical problems they had with getting OpenIDs to work, and there are ongoing problems both with the underlying services that support and build on the standard as well as with the quality of some of the services that provide OpenIDs. This was at the core of my original proposal to build a specialist provider, that had an interface, and functionality that worked for researchers. As Bjoern pointed out, I should of course be applying my own five criteria for successful web services (got to the last slide) to this proposal. Key questions: 1) can it offer something compelling? Well no, not unless someone, somewhere requires you to have this thing 2) can you pre-populate? Well yes, and maybe that is the key…(see later). In the end, as with the concern over other “informatics-jock” terms and approaches, the important thing is that all of the technical side is invisible to end users.

Another important discussion, that again, didn’t really come to a conclusion, was who would pass out these identifiers? And when? Here there seemed to be two different perspectives. Those who wanted the identifiers to be completely separated from institutional associations, at least at first order. Others seemed concerned that access to identifiers be controlled via institutions. I definitely belong in the first camp. I would argue that you just give them to everyone who requests them. The problem then comes with duplication, what if someone accidentally (or deliberately) ends up with two or more identities. At one level I don’t see that it matters to anyone except to the person concerned (I’d certainly be trying to avoid having my publication record cut in half). But at the very least you would need to have a good interface for merging records when it was required. My personal belief is that it is more important to allow people to contribute than to protect the ground. I know others disagree and that somewhere we will need to find a middle path.

One thing that was helpful was the fact that we seemed to do a pretty good job of getting various projects in this space aggregated together (and possibly more aware of each other). Among these is ResearcherID, a commercial offering that has been running for a while now, the Names project, a collaboration of Mimas and the British Library funded by JISC, ClaimID is an OpenID provider that some people use that provides some of the flexible “home page” functionality (see Maxine Clark’s for instance) that drove my original ideas, PublicationsList.org provides an online homepage but does what ClaimID doesn’t, providing a PubMed search that makes it easier (as long as your papers are in PubMed) to populate that home page with your papers (but not easier to include datasets, blogs, or wikis – see here for my attempts to include a blog post on my page). There are probably a number of others, feel free to point out what I’ve missed!

So finally where does this leave us? With a clear need for something to be done, with a few organisations identified as the best ones to take it forward, and with a lot of discussion required about the background technicalities required. If you’re still reading this far down the page then you’re obviously someone who cares about this. So I’ll give my thoughts, feel free to disagree!

  1. We need an identity token, not an authorisation mechanism. Authorisation can get easily broken and is technically hard to implement across a wide range of legacy platforms. If it is possible to build in the option for authorisation in the future then that is great but it is not the current priority.
  2. The backend gubbins will probably be distributed RDF. There is identity information all over the place which needs to be aggregated together. This isn’t likely to change so a centralised database, to my mind, will not be able to cope. RDF is built to deal with these kinds of problems and also allows multiple potential identity tokens to be pulled together to say they represent one person.
  3. This means that user interfaces will be crucial. The simpler the better but the backend, with words like FOAF and RDF needs to be effectively invisible to the user. Very simple interfaces asking “are you the person who wrote this paper” are going to win, complex signup procedures are not.
  4. Publishers and funders will have to lead. The end view of what is being discussed here is very like a personal home page for researchers. But instead of being a home page on a server it is a dynamic document pulled together from stuff all over the web. But researchers are not going to be interested for the most part in having another home page that they have to look after. Publishers in particular understand the value (and will get most value out of in the short term) unique identifiers so with the most to gain and the most direct interest they are best placed to lead, probably through organisations like CrossRef that aggregate things of interest across the industry. Funders will come along as they see the benefits of monitoring research outputs, and forward looking ones will probably come along straight away, others will lag behind. The main point is that pre-populating and then letting researchers come along and prune and correct is going to be more productive than waiting for ten millions researchers to sign up to a new service.
  5. The really big question is whether there is value in doing this specially for researchers. This is not a problem unique to research and one in which a variety of messy and disparate solutions are starting to arise. Maybe the best option is to sit back and wait to see what happens. I often say that in most cases generic services are a better bet than specially built ones for researchers because the community size isn’t there and there simply isn’t a sufficient need for added functionality. My feeling is that for identity that there is a special need, and that if we capture the whole research community that it will be big enough to support a viable service. There is a specific need for following and aggregating the work of people that I don’t think is general, and is different to the authentication issues involved in finance. So I think in this case it is worth building specialist services.

The best hope I think lies in individual publishers starting to disambiguate authors across their existing corpus. Many have already put a lot of effort into this. In turn, perhaps through CrossRef, it should be possible to agree an arbitrary identifier for each individual author. If this is exposed as a service it is then possible to start linking the information up. People can and will and the services will start to grow around that. Once this exists then some of the ideas around recognising referees and other efforts will start to flow.