Open Research: The personal, the social, and the political

Next Tuesday I’m giving a talk at the Institute for Science Ethics and Innovation in Manchester. This is a departure for me in terms of talk subjects, in as much as it is much more to do with policy and politics. I have struggled quite a bit with it so this is an effort to work it out on “paper”. Warning, it’s rather long. The title of the talk is “Open Research: What can we do? What should we do? And is there any point?”

I’d like to start by explaining where I’m coming from. This involves explaining a bit about me. I live in Bath. I work at the Rutherford Appleton Laboratory, which is near Didcot. I work for STFC but this talk is a personal view so you shouldn’t take any of these views as representing STFC policy. Bath and Didcot are around 60 miles apart so each morning I get up pretty early, I get on a train, then I get on a bus which gets me to work. I work on developing methodology to study complex biological structures. We have a particular interest in trying to improve methods for looking at proteins that live in biological membranes and protein-nucleic acid complexes. I also have done work on protein labelling that lets us make cool stuff and pretty pictures. This work involves an interesting mixture of small scale lab work, work at large facilities on big instruments, often multi-national facilities. It also involves far too much travelling.

A good question to ask at this point is “Why?” Why do I do these things? Why does the government fund me to do them? Actually it’s not so much why the government funds them as why the public does. Why does the taxpayer support our work? Even that’s not really the right question because there is no public. We are the public. We are the taxpayer. So why do we as a community support science and research? Historically science was carried out by people sufficiently wealthy to fund it themselves, or in a small number of cases by people who could find wealth patrons. After the second world war there was a political and social concensus that science needed to be supported and that concensus has supported research funding more or less to the present day. But with the war receding in public memory we seem to have retained the need to frame the argument for research funding in terms of conflict or threat. The War on Cancer, the threat of climate change. Worse, we seem to have come to believe our own propaganda, that the only way to justify public research funding is that it will cure this, or save us from that. And the reality is that in most cases we will probably not deliver on this.

These are big issues and I don’t really have answers to a lot them but it seems to me that they are important questions to think about. So here are some of my ideas about how to tackle them from a variety of perspectives. First the personal.

A personal perspective on why and how I do research

My belief is we have to start with being honest with ourselves, personally, about why and how we do research. This sounds like some sort of self-help mantra I know but let me explain what I mean. My personal aim is to maximise my positive impact on the world, either through my own work or through enabling the work of others. I didn’t come at this from first principles but it has evolved. I also understand I am personally motivated by recognition and reward and that I am strongly, perhaps too strongly, motivated by others opinions of me. My understanding of my own skills and limitations means that I largely focus my research work on methodology development and enabling others. I can potentially have a bigger impact by building systems and capabilities that help others do their research than I can by doing that research myself. I am lucky enough to work in an organization that values that kind of contribution to the research effort.

Because I want my work to be used as far as is possible I make as much as possible of it freely available. Again I am lucky that I live now when the internet makes this kind of publishing possible. We have services that enable us to easily publish ideas, data, media, and process and I can push a wide variety of objects onto the web for people to use if they so wish. Even better than that I can work on developing tools and systems that help other people to do this effectively. If I can have a bigger impact by enabling other peoples research then I can multiply that again by helping other people to share that research. But here we start to run into problems. Publishing is easy. But sharing is not so easy. I can push to the web, but is anyone listening? And if they are, can they understand what I am saying?

A social perspective (and the technical issues that go with it)

If I want my publishing to be useful I need to make it available to people in a way they can make use of. We know that networks increase in value as they grow much more than linearly. If I want to maximise my impact, I have to make connections and maximise the ability of other people to make connections. Indeed Merton made the case for this in scientific research 20 years ago.

I propose the seeming paradox that in science, private property is established by having its substance freely given to others who might want to make use of it.

This is now a social problem but a social problem with a distinct technical edge to it.  Actually we have two related problems. The issue of how I make my work available in a useful form and the separate but related issue of how I persuade others to make their work available for others to use.

The key to making my work useful is interoperability. This is at root a technical issue but at a purely technical level is one that has been solved. We can share through agreed data formats and vocabularies. The challenges we face in actually making it happen are less technical problems than social ons but I will defer those for the moment. We also need legal interoperability. Science Commons amongst others has focused very hard on this question and I don’t want to discuss it in detail here except to say that I agree with the position that Science Commons takes; that if you want to maximise the ability of others to re-use your work then you must make it available with liberal licences that do not limit fields of use or the choice of license on derivative works. This mean CC-BY, BSD etc. but if you want to be sure then your best choice is explicit dedication to the public domain.

But technical and legal interoperability are just subsets of what I think is more important;  process interoperability. If the object we publish are to be useful then they must be able to fit into the processes that researchers actually use. As we move to the question of persuading others to share and build the network this becomes even more important. We are asking people to change the way they do things, to raise their standards perhaps. So we need to make sure that this is as easy as possible and fits into their existing workflows. The problem with understanding how to achieve technical and legal interoperability is that the temptation is to impose it and I am as guilty of this as anyone. What I’d like to do is use a story from our work to illustrate an approach that I think can help us to make this easier.

Making life easier by capturing process as it happens: Objects first, structure later

Our own work on web based laboratory recording systems, which really originates in the group of Jeremy Frey at Southampton came out of earlier work on a fully semantic RDF backed system for recording synthetic chemistry. In contrast we took an almost completely unstructured approach to recording work in a molecular biology laboratory, not because we were clever or knew it would work out, but because it was a contrast to what had gone before. The LaBLog is based on a Blog framework and allows the user to put in completely free text, completely arbitrary file attachments, and to organize things in whichever way they like. Obviously a recipe for chaos.

And it was to start with as we found our way around but we went through several stages of re-organization and interface design over a period of about 18 months. The key realization we made was that while a lot of what we were doing was difficult to structure in advance that there were elements within that, specific processes, specific types of material that were consistently repeated, even stereotyped, and that structuring these gave big benefits. We developed a template system that made producing these repeated processes and materials much easier. These templates depended on how we organized our posts, and the metadata that described them, and the metadata in turn was driven by the need for the templates to be effective. A virtuous circle developed around the positive re-inforcement that the templates and associated metadata provided. More suprisingly the structure that evolved out of this matched in many cases well onto existing ontologies. In specific cases where it didn’t we could see that either the problem arose from the ontology itself, or the fact that our work simply wasn’t well mapped by that ontology. But the structure arose spontaneously out of a considered attempt to make the user/designer’s life easier. And was then mapped onto the external vocabularies.

I don’t want to suggest that our particular implementation is perfect. It is far from it, with gaping holes in the usability and our ability to actually exploit the structure that has developed. But I think the general point is useful. For the average scientist to be willing to publish more of their research, that process has to be made easy and it has to recognise the inherently unstructured nature of most research. We need to apply structured descriptions where they make the user’s life easier but allow unstructured or semi-structured representations elsewhere. But we need to build tools that make it easy to take those unstructured or semi-structure records and mold them into a specific structured narrative as part of a reporting process that the researcher has to do anyway. Writing a report, writing a paper. These things need to be done anyway and if we could build tools so that the easiest way to write the report or paper is to bring elements of the original record together and push those onto the web in agreed formats through easy to use filters and aggregators then we will have taken an enormous leap forward.

Once you’ve insinuated these systems into the researchers process then we can start talking about making that process better. But until then technical and legal interoperability are not enough – we need to interoperate with existing processes as well. If we could achieve this then much more research material would flow online, connections would be formed around those materials, and the network would build.

And finally – the political

This is all very well. With good tools and good process I can make it easier for people to use what I publish and I can make it easier for others to publish. This is great but it won’t make others want to publish. I believe that more rapid publication of research is a good thing. But if we are to have a rational discussion about whether this is true we need to have agreed goals. And that moves the discussion into the political sphere.

I asked earlier why it is that we do science as a society, why we fund it. As a research community I feel we have no coherent answer to these questions.  I also talked about being honest to ourselves. We should be honest with other researchers about what motivates us, why we choose to do what we do, and how we choose to divide limited resources. And as recipients of taxpayers money we need to be clear with government and the wider community about what we can achieve. We also have an obligation to optimize the use of the money we spend. And to optimize the effective use of the outputs derived from that money.

We need at core a much more sophisticated conversation with the wider community about the benefits that research brings; to the economy, to health, to the environment, to education. And we need a much more rational conversation within the research community as to how those different forms of impact are and should be tensioned against each other.  We need in short a complete overhaul if not a replacement of the post-war concensus on public funding of research. My fear is that without this the current funding squeeze will turn into a long term decline. And that without some serious self-examination the current self-indulgent bleating of the research community is unlikely to increase popular support for public research funding.

There are no simple answers to this but it seems clear to me that at a minimum we need to be demonstrating that we are serious about maximising the efficiency with which we spend public money. That means making sure that research outputs can be re-used, that wheels don’t need to re-invented, and innovation flows easily from the academic lab into the commercial arena. And it means distinguishing between the effective use of public money to address market failures and subsidising UK companies that are failing to make effective investments in research and development.

The capital generated by science is in ideas, capability, and people. You maximise the effective use of capital by making it easy to move, by reducing barriers to trade. In science we can achieve this by maximising the ability transfer research outputs. If we to be taken seriously as guardians of public money and to be seen as worthy of that responsibility our systems need to make ideas, data, methodology, and materials flow easily. That means making our data, our process, and our materials freely available and interoperable. That means open research.

We need a much greater engagement with the wider community on how science works and what science can do. The web provides an immense opportunity to engage the public in active research as demonstrated by efforts as diverse as Galaxy Zoo with 250,000 contributors and millions of galaxy classifications and the Open Dinosaur Project with people reading online papers and adding the measurements of thigh bones to an online spreadsheet. Without the publicly available Sloan Digital Sky Survey, without access to the paleontology papers, and without the tools to put the collected data online and share them these people, this “public”, would be far less engaged. That means open research.

And finally we need to turn the tools of our research on ourselves. We need to critically analyse our own systems and processes for distributing resources, for communicating results, and for apportioning credit. We need to judge them against the value for money they offer to the taxpayer and where they are found wanting we need to adjust. In the modern networked world we need to do this in a transparent and honest manner. That means open research.

But even if we agree these things are necessary, or a general good, they are just policy. We already have policies which are largely ignored. Even when obliged to by journal publication policies or funder conditions researchers avoid, obfuscate, and block attempts to gain access to data, materials, and methdology. Researchers are humans too with the same needs to get ahead and to be recognized as anyone else. We need to find a way to map those personal needs, and those personal goals, onto the community’s need for more openness in research. As with the tooling we need to “bake in” the openness to our processes to make it the easiest way to get ahead. Policy can help with cultural change but we need an environment in which open research is the simplest and easiest approach to take. This is interoperability again but in this case the policy and process has to interoperate with the real world. Something that is often a bit of a problem.

So in conclusion…

I started with a title I’ve barely touched on.  But I hope with some of the ideas I’ve explored we are in a position to answer the questions I posed. What can we do in terms of Open Research? The web makes it technically possible for us the share data, process, and records in real time. It makes it easier for us to share materials though I haven’t really touched on that. We have the technical ability to make that data useful through shared data formats and vocabularies. Many of the details are technically and socially challenging but we can share pretty much anything we choose to on a wide variety of timeframes.

What should we do? We should make that choice easier through the development of tools and interfaces that recognize that it is usually humans doing and recording the research and exploiting the ability of machines to structure that record when they are doing the work. These tools need to exploit structure where it is appropriate and allow freedom where it is not. We need tools to help us map our records onto structures as we decide how we want to present them. Most importantly we need to develop structures of resource distribution, communication, and recognition that encourage openness by making it the easiest approach to take. Encouragement may be all that’s required. The lesson from the web is that once network effects take hold they can take care of the rest.

But is there any point? Is all of this worth the effort? My answer, of course, is an unequivocal yes. More open research will be more effective, more efficient, and provide better value for the taxpayer’s money. But more importantly I believe it is the only credible way to negotiate a new concensus on the public funding of research. We need an honest conversation with government and the wider community about why research is valuable, what the outcomes are, and how the contribute to our society. We can’t do that if the majority cannot even see those outcomes. The wider community is more sophisticated that we give it credit for. And in many ways the research community is less sophisticated than we think. We are all “the public”. If we don’t trust the public to understand why and how we do research, if we don’t trust ourselves to communicate the excitement and importance of our work effectively, then I don’t see why we deserve to be trusted to spend that money.

Nature Communications: A breakthrough for open access?

A great deal of excitement but relatively little detailed information thus far has followed the announcement by Nature Publishing Group of a new online only journal with an author-pays open access option. NPG have managed and run a number of open access (although see caveats below) and hybrid journals as well as online only journals for a while now. What is different about Nature Communications is that it will be the first clearly Nature-branded journal that falls into either of these categories.

This is significant because it is bringing the Nature brand into the mix. Stephen Inchcoombe, executive director of NPG in email correspondence quoted in the The Scientist, notes the increasing uptake of open-access options and the willingness of funders to pay processing charges for publication as major reasons for NPG to provide a wider range of options.

In the NPG press release David Hoole, head of content licensing for NPG says:

“Developments in publishing and web technologies, coupled with increasing commitment by research funders to cover the costs of open access, mean the time is right for a journal that offers editorial excellence and real choice for authors.”

The reference to “editorial excellence” and the use of the Nature brand are crucial here and what makes this announcement significant. The question is whether NPG can deliver something novel and successful.

The journal will be called Nature Communications. “Communications” is a moniker usually reserved for “rapid publication” journals. At the same time the Nature brand is all about exclusivity, painstaking peer review, and editorial work. Can these two be reconciled successfully and, perhaps most importantly, how much will it cost? In the article in The Scientist a timeframe of 28 days from submission to publication is mentioned but as a minimum period. Four weeks is fast, but not super-fast for an online only journal.

But speed is not the only criterion. Reasonably fast and with a Nature brand may well be good enough for many, particularly those who have come out of the triage process at Nature itself. So what of that branding – where is the new journal pitched? The press release is a little equivocal on this:

Nature Communications will publish research papers in all areas of the biological, chemical and physical sciences, encouraging papers that provide a multidisciplinary approach. The research will be of the highest quality, without necessarily having the scientific reach of papers published in Nature and the Nature research journals, and as such will represent advances of significant interest to specialists within each field.

So more specific – less general interest, but still “the highest quality”. This is interesting because there is an argument that this could easily cannibalise the “Nature Baby” journals. Why wait for Nature Biotech or Nature Physics when you can get your paper out faster in Nature Communications? Or on the other hand might it be out-competed by the other Nature journals – if the selection criteria are more or less the same, highest quality but not of general interest, why would you go for a new journal over the old favourites? Particularly if you are the kind of person that feels uncomfortable with online only journals.

If the issue of the selectivity difference between the old and the new Nature journals then the peer review process can perhaps offer us clues. Again some interesting but not entirely clear statements in the press release:

A team of independent editors, supported by an external editorial advisory panel, will make rapid and fair publication decisions based on peer review, with all the rigour expected of a Nature-branded journal.

This sounds a little like the PLoS ONE model – a large editorial board with the intention of spreading the load of peer review so as to speed it up. With the use of the term “peer review” it is to be presumed that this means external peer review by referees with no formal connection to NPG. Again I would have thought that NPG are very unlikely to dilute their brand by utilising editorial peer review of any sort. Given the slow point of the process is getting a response back from peer reviewers, whether they are reviewing for Nature or for PLoS ONE, its not clear to me how this can be speed up or indeed even changed from the traditional process, without risking a perception of a quality drop. This is going to be a very tough balance to find.

So finally, does this meant that NPG are serious about Open Access? NPG have been running OA and online only journals (although see the caveat below about the licence) for a while now and appear to be serious about increasing this offering. They will have looked very seriously at the numbers before making a decision on this and my reading is that those numbers are saying that they need to have a serious offering. This is a hybrid and it will be easy to make accusations that, along with other fairly unsuccessful hybrid offerings, it is being set up to fail.

I doubt this is the case personally, but nor do I necessarily believe that the OA option will necessarily get the strong support it will need to thrive. The critical question will be pricing. If this is pitched at the level of other hybrid options, too high to be worth what is being offered in terms of access, then it will appear to have been set up to fail. Yet NPG can justifiably charge a premium if they are providing real editorial value.  Indeed they have to. NPG has in the past said that they would have to charge enormous processing charges to published authors to recover costs of peer review. So they can’t offer something relatively cheap, yet claim the peer review is to the same standards. The price is absolutely critical to credibility. I would guess something around £2500 or $US4000. Higher than PLoS Biology/Medicine but lower than other hybrid offerings.

So then the question becomes value for money. Is the OA offering up to scratch? Again the press release is not as enlightening as one would wish:

Authors who choose the open-access option will be able to license their work under a Creative Commons license, including the option to allow derivative works.

So does that mean it will be a non-commercial license? In which case it is not Open Access under the BBB declarations (most explicitly in the Budapest Declaration). This would be consistent with the existing author rights that NPG allows and their current “Open Access” journal licences but in my opinion would be a mistake. If there is any chance of the accusation that this isn’t “real OA” sticking then NPG will make a rod for their own back. And I really can’t see it making the slightest difference to their cost recovery. Equally the option to allow derivative works? The BBB declarations are unequivocal about derivative works being at the core of Open Access. From  a tactical perspective it would be much simpler and easier for them to go for straight CC-BY. It will get support (or at least neutralize opposition) from even the hardline OA community, and it doesn’t leave NPG open to any criticism of muddying the waters. The fact that such a journal is being released shows that NPG gets the growing importance of Open Access publication. This paragraph, in its current form, suggests that the organization as a whole hasn’t internalised the messages about why. There are people within NPG who get this through and through but this paragraph suggests to me that that understanding has not got far enough within the organisation to make this journal a success. The lack of mention of a specific licence is a red rag and an entirely unnecessary one.

So in summary the outlook is positive. The efforts of the OA movement are having an impact at the highest levels amongst traditional publishers. Whether you view this as a positive or a negative response it is a success in my view that NPG feels that a response is necessary. But the devil is in the details. Critical to both the journal’s success and the success of this initiative as a public relations exercise will be the pricing, the licence and acceptance of the journal by the OA movement. The press release is not as promising on these issues as might be hoped. But it is early days yet and no doubt there will be more information to come as the journal gets closer to going live.

There is a Nature Network Forum for discussions of Nature Communications which will be a good place to see new information as it comes out.

Show us the data now damnit! Excuses are running out.

A very interesting paper from Caroline Savage and Andrew Vickers was published in PLoS ONE last week detailing an empirical study of data sharing of PLoS journal authors. The results themselves, that one out ten corresponding authors provided data, are not particularly surprising, mirroring as they do previous studies, both formal [pdf] and informal (also from Vickers, I assume this is a different data set), of data sharing.

Nor are the reasons why data was not shared particularly new. Two authors couldn’t be tracked down at all. Several did not reply and the remainder came up with the usual excuses; “too hard”, “need more information”, “university policy forbids it”. The numbers in the study are small and it is a shame it wasn’t possible to do a wider study that might have teased out discipline, gender, and age differences in attitude. Such a study really ought to be done but it isn’t clear to me how to do it effectively, properly, or indeed ethically. The reason why small numbers were chosen was both to focus on PLoS authors, who might be expected to have more open attitudes, and to make the request from the authors, that the data was to be used in a Master educational project, plausible.

So while helpful, the paper itself isn’t doesn’t provide much that is new. What will be interesting will be to see how PLoS responds. These authors are clearly violating stated PLoS policy on data sharing (see e.g. PLoS ONE policy). The papers should arguably be publicly pulled from the journals. Most journals have similar policies on data sharing, and most have no corporate interest in actually enforcing them. I am unaware of any cases where a paper has been retracted due to the authors unwillingness to share (if there are examples I’d love to know about them! [Ed. Hilary Spencer from NPG pointed us in the direction of some case studies in a presentation from Philip Campbell).

Is it fair that a small group be used as a scapegoat? Is it really necessary to go for the nuclear option and pull the papers? As was said in a Friendfeed discussion thread on the paper: “IME [In my experience] researchers are reeeeeeeally good at calling bluffs. I think there’s no other way“. I can’t see any other way of raising the profile of this issue. Should PLoS take the risk of being seen as hardline on this? Risking the consequences of people not sending papers there because of the need to reveal data?

The PLoS offering has always been about quality, high profile journals delivering important papers, and at PLoS ONE critical analysis of the quality of the methodology. The perceived value of that quality is compromised by authors who do not make data available. My personal view is that PLoS would win by taking a hard line and the moral high ground. Your paper might be important enough to get into Journal X, but is the data of sufficient quality to make it into PLoS ONE? Other journals would be forced to follow – at least those that take quality seriously.

There will always be cases where data can not or should not be available. But these should be carefully delineated exceptions and not the rule. If you can’t be bothered putting your data into a shape worthy of publication then the conclusions you have based on that data are worthless. You should not be allowed to publish. End of. We are running out of excuses. The time to make the data available is now. If it isn’t backed by the data then it shouldn’t be published.

Update: It is clear from this editorial blog post from the PLoS Medicine editors that PLoS do not in fact know which papers are involved.  As was pointed out by Steve Koch in the friendfeed discussion there is an irony that Savage and Vickers have not, in a sense, provided their own raw data i.e. the emails and names of correspondents. However I would accept that to do so would be a an unethical breach of presumed privacy as the correspondents might reasonably have expected these were private emails and to publish names would effectively be entrapment. Life is never straightforward and this is precisely the kind of grey area we need more explicit guidance on.

Savage CJ, Vickers AJ (2009) Empirical Study of Data Sharing by Authors Publishing in PLoS Journals. PLoS ONE 4(9): e7078. doi:10.1371/journal.pone.0007078

Full disclosure: I am an academic editor for PLoS ONE and have raised the issue of insisting on supporting data for all charts and graphs in PLoS ONE papers in the editors’ forum. There is also a recent paper with my name on in which the words “data not shown” appear. If anyone wants that data I will make sure they get it, and as soon as Nature enable article commenting we’ll try to get something up there. The usual excuses apply, and don’t really cut the mustard.

A question of trust

I have long being sceptical of the costs and value delivered by our traditional methods of peer review. This is really on two fronts, firstly that the costs, where they have been estimated are extremely high, representing a multi-billion dollar subsidy by governments of the scholarly publishing industry. Secondly the value that is delivered through peer review, the critical analysis of claims, informed opinion on the quality of the experiments, is largely lost. At best it is wrapped up in the final version of the paper. At worst it is simply completely lost to the final end user. A part of this, which the more I think about the more I find bizarre is that the whole process is carried on under a shroud of secrecy. This means that as an end user, as I do not know who the peer reviewers are, and do not necessarily know what  process has been followed or even the basis of the editorial decision to publish. As a result I have no means of assessing the quality of peer review for any given journal, let alone any specific paper.

Those of us who see this as a problem have a responsibility to provide credible and workable alternatives to traditional peer review. So far despite many ideas we haven’t, to be honest, had very much success. Post-publication commenting, open peer review, and Digg like voting mechanisms have been explored but have yet to have any large success in scholarly publishing. PLoS is leading the charge on presenting article level metrics for all of its papers, but these remain papers that have also been through a traditional peer review process. Very little that is both radical with respect to the decision and means of publishing and successful in getting traction amongst scientists has been seen as yet.

Out on the real web it has taken non-academics to demonstrate the truly radical when it comes to publication. Whatever you may think of the accuracy of Wikipedia in your specific area, and I know it has some weaknesses in several of mine, it is the first location that most people find, and the first location that most people look for, when searching for factual information on the web. Roderic Page put up some interesting statistics when he looked this week at the top hits for over 5000 thousand mammal names in Google. Wikipedia took the top spot 48% of the time and was in the top 10 in virtually every case (97%). If you want to place factual information on the web Wikipedia should be your first port of call. Anything else is largely a waste of your time and effort. This doesn’t incidentally mean that other sources are not worthwhile or have a place, but that people need to work with the assumption that people’s first landing point will be Wikipedia.

“But”, I hear you say, “how do we know whether we can trust a given Wikipedia article, or specific statements in it?”

The traditional answer has been to say you need to look in the logs, check the discussion page, and click back the profiles of the people who made specific edits. However this in inaccessible to many people, simply because they do not know how to process the information. Very few universities have an “Effective Use of Wikipedia 101” course. Mostly because very few people would be able to teach it.

So I was very interested in an article on Mashable about marking up and colouring Wikipedia text according to its “trustworthiness”. Andrew Su kindly pointed me in the direction of the group doing the work and their papers and presentations. The system they are using, which can be added to any MediaWiki installation measures two things, how long a specific piece of text has stayed in situ, and who either edited it, or left it in place. People who write long lasting edits get higher status, and this in turn promotes the text that they have “approved” by editing around but not changing.

This to me is very exciting because it provides extra value and information for both users and editors without requiring anyone to do any more work than install a plugin. The editors and writers simply continue working as they have. The user can access an immediate view of the trustworthiness of the article with a high level of granularity, essentially at the level of single statements. And most importantly the editor gets a metric, a number that is consistently calculated across all editors, that they can put on a CV. Editors are peer reviewers, they are doing review, on a constantly evolving and dynamic article that can both change in response to the outside world and also be continuously improved. Not only does the Wikipedia process capture most of the valuable aspects of traditional peer review, it jettisons many of the problems. But without some sort of reward it was always going to be difficult to get professional scientists to be active editors. Trust metrics could provide that reward.

Now there are many questions to ask about the calculation of this “karma” metric, should it be subject biased so we know that highly ranked editors have relevant expertise, or should it be general so as to discourage highly ranked editors from modifying text that is outside of their expertise? What should the mathematics behind it be? It will take time clearly for such metrics to be respected as a scholarly contribution, but equally I can see the ground shifting very rapidly towards a situation where a lack of engagement, a lack of interest in contributing to the publicly accessible store of knowledge, is seen as a serious negative on a CV. However this particular initiative pans out it is to me this is one of the first and most natural replacements for peer review that could be effective within dynamic documents, solving most of the central problems without requiring significant additional work.

I look forward to the day when I see CVs with a Wikipedia Karma Rank on them. If you happen to be applying for a job with me in the future, consider it a worthwhile thing to include.

An open letter to Lord Mandelson

Lord Mandelson is the UK minister for Business Innovation and Skills which includes the digital infrastructure remit. He recently announced that a version of the “three strikes” approach to combatting illegal firesharing, with the sanction being removal of internet access, would be applied in the UK. This is a copy of a letter I have sent to Lord Mandelson via the wonderful site www.writetothem.com that provides an easy way to write to UK parliamentarians. If you have an interest in the issue I suggest you do the same.

Lord Mandelson

House of Lords

Palace of Westminster

4 September 2009

Dear Lord Mandelson

I am writing to protest the decision taken by yourself to impose a “three strikes” approach to online rights and monopoly violations with an ultimate sanction requiring service providers to remove internet access. I am not a UK citizen but have lived in the UK for ten years and regard it as my home. I have a direct interest in the use of new technologies for communication, particularly in scientific research, and a vested interest in the long term competitiveness of the UK and its ability to support continued innovation in this area.

Your decision is wrong. Not because copyright violation should be allowed or respected and not because the main stream content industry should be ashamed that it makes money. It is wrong because it will stifle the development of new forms of creativity and the development of entirely new industries. As an advocate of Open Access scientific publication and copyright reform I am critical of the the current system of rights and monopolies but I work hard to respect the rights of content producers. And it is very hard work. Even as someone with some expertise in copyright and licensing, to do this right, requires time and effort. When I write, or prepare presentations, I spend significant amounts of time identifying work I can re-use, checking that licences are compatible, and making sure I license my own derivative work in a way that respects the rights of those people  whose work I have built on.

New forms of creativity are developing that re-use and re-purpose existing content but in fact this is not new at all. Re-use and re-purposing in culture has a grand tradition from Homer, via Don Quixote to Romeo and Juliet, from Brahms’ Haydn variations to Hendrix’s version of the Star Spangled Banner. In my own field all science and technology is derivative. It builds constantly on the work of others. But the internet makes new forms of re-use possible. New types of value creation are also made possible.  Re-use of images, video, and text, as well as ideas and data are enabling the development of new forms of business, new types of innovation in ways that are very challenging to predict. Your proposal will stifle this innovation by creating an environment of fear around re-use and by privileging certain classes and types of content and producer over the generators of new and innovative products. Those who do not care will ignore and circumvent the rules by technical means. And those who are exploring new types of derivative work, new types of innovative content, will be discouraged by the atmosphere of fear and uncertainty created by your policy.

Nonetheless it is important that the rights of content producers are respected. The key is finding the right balance between the needs to existing industries and individuals involved in the creation of new content and new industries. I would suggest that the key to any protection mechanism is parity. Large and traditional content producers, if given additional rights over those currently provided by law, must also respect equivalent rights for the small and new media producer.

This can be simply achieved by providing a similar three strikes mechanisms for traditional media. Thus if a television broadcaster uses, without appropriate attributions or licensing, video, images, or text taken by an individual then they should have their broadcast licence revoked. Similarly if print media utilise text from bloggers or Wikipedia without appropriate licensing or attribution, then the rights holders should be able to revoke their paper supply. Paper suppliers to the print media would be required to implement systems to enable online authors to register complaints and would be responsible for imposing these sanctions.

Clearly such a system is farcical, creating a nightmare of bureaucracy and heavy handed sanctions that stifle experimentation and economic activity. Yet it is analogous to what you have proposed. Only you are imposing this to protect a mature set of industries with no real long term growth potential while stifling the potential of a whole new class of industries and innovation with massive growth potential over the next few decades.

Your proposal is wrong for purely economic reasons. It is wrong because it will stifle a major opportunity for economic growth right at the point where we need it most. And it is wrong because as a government your role is not to legislate to protect business models but to regulate in a way that balances the risks of damage in one sector against the potential for encouraging new sectors to develop. I respectfully suggest that you have got that balance wrong.  I disagreed with much in Lord Carter’s report but perhaps the best measure of its balance was the equally vociferous criticism it received from both sides of the debate. This to me suggests that it forms a productive basis on which to move forward.

Yours sincerely

Cameron Neylon

Writing a Wave Robot – Some thoughts on good practice for Research Robots

ChemSpidey lives! Even in the face of Karen James’ heavy irony I am still amazed that someone like me with very little programming experience was able to pull together something that actually worked effectively in a live demo. As long as you’re not actively scared of trying to put things together it is becoming relatively straightforward to build tools that do useful things. Building ChemSpidey relied heavily on existing services and other people’s code but pulling that together was a relatively straightforward process. The biggest problems were fixing the strange and in most cases undocumented behaviour of some of the pieces I used. So what is ChemSpidey?

ChemSpidey is a Wave robot that can be found at chemspidey@appspot.com. The code repository is available at Github and you should feel free to re-use it in anyway you see fit, although I wouldn’t really recommend it at the moment, it isn’t exactly the highest quality. One of the first applications I see for Wave is to make it easy to author (semi-)semantic documents which link objects within the document to records on the web. In chemistry it would be helpful to link the names of compounds through to records about those compounds on the relevant databases.

If ChemSpidey is added to a wave it watches for text of the form “chem[ChemicalName{;weight {m}g}]” where the curly bracketed parts are optional. When a blip is submitted by hitting the “done” button ChemSpidey searches through the blip looking for this text and if it finds it, strips out the name and sends it) to the ChemSpider SimpleSearch service. ChemSpider returns a list of database ids and the robot currently just pulls the top one off the list and adds the text ChemicalName (csid:####) to the wave, where the id is linked back to ChemSpider. If there is a weight present it asks the ChemSpider MassSpec API for the nominal molecular weight calculates the number of moles and inserts that. You can see video of it working here (look along the timeline for the ChemSpidey tag).

ChemSpidey

What have I learned? Well some stuff that is probably obvious to anyone who is a proper developer. Use the current version of the API. Google AppEngine pushes strings around as unicode which broke my tests because I had developed things using standard Python strings. But I think it might be useful to start drawing some more general lessons about how best to design robots for research, so to kick of the discussion here are my thoughts, many of which came out of discussions with Ian Mulvany as we prepared for last weeks demo.

  1. Always add a Welcome Blip when the Robot is added to a wave. This makes the user confident that something has happened. Lets you notify users if a new version has been released, which might change the way the robot works, and lets you provide some short instructions.It’s good to include a version number here as well.
  2. Have some help available. Ian’s Janey robot responds to the request (janey help) in a blip with an extended help blip explaining context. Blips are easily deleted later if the user wants to get rid of them and putting these in separate blips keeps them out of the main document.
  3. Where you modify text leave an annotation. I’ve only just started to play with annotations but it seems immensely useful to at least attempt to leave a trace here of what you’ve done that makes it easy for either your own Robot or others, or just human users to see who did what. I would suggest leaving annotations that identfy the robot, include any text that was parsed, and ideally provide some domain information. We need to discuss how to setup some name spaces for this.
  4. Try to isolate the “science handling” from the “wave handling”. ChemSpidey mixes up a lot of things into one Python script. Looking back at it now it makes much more sense to isolate the interaction with the wave from the routines that parse text, or do mole calculations. This means both that the different levels of code should become easier for others to re-use and also if Wave doesn’t turn out to be the one system to rule them all that we can re-use the code. I am no architecture expert and it would be good to get some clues from some good ones about how best to separate things out.

These are just some initial thoughts from a very novice Python programmer. My code satisfies essentially none of these suggestions but I will make a concerted attempt to improve on that. What I really want to do is kick the conversation on from where we are at the moment, which is basically playing around, into how we design an architecture that allows rapid development of useful and powerful functionality.

Some (probably not original) thoughts about originality

A number of things have prompted me to be thinking about what makes a piece of writing “original” in a web based world where we might draft things in the open, get informal public peer review, where un-refereed conference posters can be published online, and pre-print servers of submitted versions of papers are increasingly widely used. I’m in the process of correcting an invited paper that derives mostly from a set of blog posts and had to revise another piece because it was too much like a blog post but what got me thinking most was a discussion on the PLoS ONE Academic Editors forum about the originality requirements for PLoS ONE.

In particular the question arose of papers that have been previously peer reviewed and published, but in journals that are not indexed or that very few people have access to. Many of us have one or two papers in journals that are essentially inaccessible, local society journals or just journals that were never online, and never widely enough distributed for anyone to find. I have a paper in Complex Systems (volume 17, issue 4 since you ask) that is not indexed in Pubmed, only available in a preprint archive and has effectively no citations. Probably because it isn’t in an index and no-one has ever found it.  But it describes a nice piece of work that we went through hell to publish because we hoped someone might find it useful.

Now everyone agreed, and this is what the PLoS ONE submission policy says quite clearly, that such a paper cannot be submitted for publication. This is essentially a restatement of the Ingelfinger Rule. But being the contrary person I am I started wondering why. For a commercial publisher with a subscripton business model it is clear that you don’t want to take on content that you can’t exert a copyright over, but for a non-profit with a mission to bring science to wider audience does this really make sense? If the science is currently unaccessible and is of appropriate quality for a given journal and the authors are willing to pay the costs to bring it to a wider public, why is this not allowed?

The reason usually given is that if something is “already published” then we don’t need another version. But if something is effectively inaccessible is that not true. Are preprints, conference proceedings, even privately circulated copies, not “already published”. There is also still a strong sense that there needs to be a “version of record”, that there is a potential for confusion with different versions. There is a need for better practice in the citation of different versions of work but this is a problem we already have. Again a version in an obscure place is unlikely to cause confusion. Another reason is that refereeing is a scarce resource that needs to be protected. This points to our failure to publish and re-use referee’s reports within the current system, to actually realise the value that we (claim to) ascribe to them. But again, if the author is willing to pay for this, why should they not be allowed to?

However, in my view, at the core to the rejection of “republication” is an objection to the idea that people might manage to get double credit for a single publication. In a world where the numbers matter people do game the system to maximise the number of papers they have. Credit where credit’s due is a good principle and people feel, rightly, uneasy with people getting more credit for the same work published in the same form. I think there are three answers to this, one social, one technical, and one…well lets just call it heretical.

Firstly placing two versions of a manuscript on the same CV is simply bad practice. Physicists don’t list both the ArXiv and journal versions of papers on their publication lists. In most disciplines, where conference papers are not peer reviewed, they are listed separate to formally published peer reviewed papers in CVs. We have strong social norms around “double counting”. These differ from discipline to discipline as to whether work presented at conferences can be published as a journal paper, whether pre-prints are widely accepted, and how control needs to be exerted over media releases but while there may be differences over what constitutes “the same paper” there are storng social norms that you only publish the same thing once.  These social norms are at the root of the objection to re-publication.

Secondly the technical identification of duplicate available versions, either deliberately by the authors to avoid potential confusion, or in an investigative roleto identify potential misconduct, is now trivial. A quick search can rapidly identify duplicate versions of papers. I note paranthetically that it would be even easier with a fully open access corpus but where there is either misconduct, or the potential for confusion, tools like Turnitin and Google will sort it out for you pretty quickly.

Finally though, for me the strongest answer to the concern over “double credit” is that this is a deep indication we have the whole issue backwards. Are we really more concerned about someone having an extra paper on their CV than we are about getting the science into the hands of as many people as possible? This seems to me a strong indication that we value the role of the paper as a virtual notch on the bedpost over its role in communicating results. We should never forget that STM publishing is a multibillion dollar industry supported primarily through public subsidy. There are cheaper ways to provide people with CV points if that is all we care about.

This is a place where the author (or funder) pays model really comes it in its own. If an author feels strongly enough that a paper will get to a wider audience in a new journal, if they feel strongly enough that it will benefit from that journal’s peer review process, and they are prepared to pay a fee for that publication, why should they be prevented from doing so? If that publication does bring that science to a wider audience, is not a public service publisher discharging their mission through that publication?

Now I’m not going to recommend this as a change in policy to PLoS. It’s far too radical and would probably raise more problems in terms of public relations than it would solve in terms of science communication. But I do want to question the motivations that lie at the bottom of this traditional prohibition. As I have said before and will probably say over and over (and over) again. We are spending public money here. We need to be clear about what it is we are buying, whether it is primarily for career measurement or communication, and whether we are getting the best possible value for money. If we don’t ask the question, then in my view we don’t deserve the funding.

The Future of the Paper…does it have one? (and the answer is yes!)

A session entitled “The Future of the Paper” at Science Online London 2009 was a panel made up of an interesting set of people, Lee-Ann Coleman from the British Library, Katharine Barnes the editor of Nature Protocols, Theo Bloom from PLoS and Enrico Balli of SISSA Medialab.

The panelists rehearsed many of the issues and problems that have been discussed before and I won’t re-hash here. My feeling was that the panelists didn’t offer a radical enough view of the possibilities but there was an interesting discussion around what a paper was for and where it was going. My own thinking on this has been recently revolving around the importance of a narrative as a human route into the data. It might be argued that if the whole scientific enterprise could be made machine readable then we wouldn’t need papers. Lee-Ann argued and I agree that the paper as the human readable version will retain an important place. Our scientific model building exploits our particular skill as story tellers, something computers remain extremely poor at.

But this is becoming an increasingly smaller part of the overall record itself. For a growing band of scientists the paper is only a means of citing a dataset or an idea. We need to widen the idea of what the literature is and what it is made up of. To do this we need to make all of these objects stable and citeable. As Phil Lord pointed out this isn’t enough because you also have to make those objects and their citations “count” for career credit. My personal view is that the market in talent will actually drive the adoption of wider metrics that are essentially variations of Page Rank because other metrics will become increasingly useless, and the market will become increasingly efficient as geographical location becomes gradually less important. But I’m almost certainly over optimistic about how effective this will be.

Where I thought the panel didn’t go far enough was in questioning the form of the paper as an object within a journal. Essentially each presentation became “and because there wasn’t a journal for this kind of thing we created/will create a new one”. To me the problem isn’t the paper. As I said above the idea of a narrative document is a useful and important one. The problem is that we keep thinking in terms of journals, as though a pair of covers around a set of paper documents has any relevance in the modern world.

The journal used to play an important role in publication. The publisher still has an important role but we need to step outside the notion of the journal and present different types of content and objects in the best way for that set of objects. The journal as brand may still have a role to play although I think that is increasingly going to be important only at the very top of the market. The idea of the journal is both constraining our thinking about how best to publish different types of research object and distorting the way we do and communicate science. Data publication should be optimized for access to and discoverability of data, software publication should make the software available and useable. Neither are particularly helped by putting “papers” in “journals”. They are helped by creating stable, appropriate publication mechanisms, with appropriate review mechanisms, making them citeable and making them valued. The point at which our response to needing to publish things stops being “well we’d better create a journal for that” then we might just have made it into the 21st century.

But the paper remains the way we tell story’s about and around our science. And if us dumb humans are going to keep doing science then it will continue to be an important part of the way we go about that.

Reflecting on a Wave: The Demo at Science Online London 2009

Yesterday, along with Chris Thorpe and Ian Mulvany I was involved in what I imagine might be the first of a series of demos of Wave as it could apply to scientists and researchers more generally. You can see the backup video I made in case we had no network on Viddler. I’ve not really done a demo like that live before so it was a bit difficult to tell how it was going from the inside but although much of the tweetage was apparently underwhelmed the direct feedback afterwards was very positive and perceptive.

I think we struggled to get across an idea of what Wave is, which confused a significant proportion of the audience, particularly those who weren’t already aware of it or who didn’t have a pre-conceived idea of what it might do for them. My impression was that those in the audience who were technically oriented were excited by what they saw. If I was to do a demo again I would focus more on telling a story about writing a paper – really give people a context for what is going on. One problem with Wave is that it is easy to end up with a document littered with chat blips and I think this confused an audience more used to thinking about documents.

The other problem is perhaps that a bunch of things “just working” is underwhelming when people are used to the idea of powerful applications that they do their work in. Developers get the idea that this is all happening and working in a generic environment, not a special purpose built one, and that is exciting. Users just expect things to work or they’re not interested. Especially scientists. And it would be fair to say that the robots we demonstrated, mostly the work of a few hours or a few days, aren’t incredibly impressive on the surface. In addition, when it is working at its best the success of Wave is that it can make it look easy, if not yet polished. Because it looks easy people then assume it is so its not worth getting excited about. The point is not that it is possible to automatically mark up text, pull in data, and then process it. It is that you can do this effectively in your email inbox with unrelated tools, that are pretty easy to build, or at least adapt. But we also clearly need some flashier demos for scientists.

Ian pulled off a great coup in my view by linking up the output of one Robot to a visualization provided by another. Ian has written a robot called Janey which talks to the Journal/Author Name Estimator service. It can either suggest what journal to send a paper to based on the abstract or suggest other articles of interest. Ian had configured the Robot the night before so it could also get the co-authorship graph for a set of papers and put that into a new blip in the form of a list of edges (or co-authorships).

The clever bit was that Ian had found another Robot, written by someone entirely different that visualizes connection graphs. Ian set the blip that Janey was writing to to be one that the Graph robot was watching and the automatically pulled data was automatically visualized [see a screencast here]. Two Robots written by different people for different purposes can easily be hooked up together and just work. I’m not even sure whether Ian had had a chance to test it or not prior to the demo…but it looked easy, why wouldn’t people expect two data processing tools to work seamlessly together? I mean, it should just work.

The idea of a Wave as a data processing workflow was implicit in what I have written previously but Ian’s demo, and a later conversation with Alan Cann really sharpened that up in my mind. Alan was asking about different visual representations of a wave. The current client essentially uses the visual metaphor of an email system. One of the points for me that came from the demo is that it will probably be necessary to write specific clients that make sense for specific tasks. Alan asked about the idea of a Yahoo Pipes type of interface. This suggests a different way of thinking about Wave, instead of a set of text or media elements, it becomes a way to wire up Robots, automated connections to webservices. Essentially with a set of Robots and an appropriate visual client you could build a visual programming engine, a web service interface, or indeed a visual workflow editing environment.

The Wave client has to walk a very fine line between presenting a view of the Wave that the target user can understand and working with and the risk of constraining the users thinking about what can be done. The amazing thing about Wave as a framework is that these things are not only do-able but often very easy. The challenge is actually thinking laterally enough to even ask the question in the first place. The great thing about a public demo is that the challenges you get from the audience make you look at things in different ways.

Allyson Lister blogged the session, there was a FriendFeed discussion, and there should be video available at some point.

Where is the best place in the Research Stack for the human API?

Interesting conversation yesterday on Twitter with Evgeniy Meyke of EarthCape prompted in part by my last post. We started talking about what a Friendfeed replacement might look like and how it might integrate more directly into scientific data. Is it possible to build something general or will it always need to be domain specific. Might this in fact be an advantage? Evgeniy asked:

@CameronNeylon do you think that “something new” could be more vertically oriented rather then for “research community” in general?

His thinking being, as I understand it that to get at domain specific underlying data is always likely to take local knowledge. As he said in his next tweet:

@CameronNeylon It might be that the broader the coverage the shallower is integration with underlining research data, unless api is good

This lead me to thinking about integration layers between data and people and recalled something that I said in jest to someone some time ago;

“If you’re using a human as your API then you need to work on your user interface.”

Thinking about the way Friendfeed works there is a real sense in which the system talks to a wide range of automated APIs but at the core there is a human layer that firstly selects feeds of interest and then when presented with other feeds selects from them specific items. What Friendfeed does very well in some senses is provide a flexible API between feeds and the human brain. But Evegeniy made the point that this “works only 4 ‘discussion based’ collaboration (as in FF), not 4 e.g. collab. taxonomic research that needs specific data inegration with taxonomic databases”.

Following from this was an interesting conversation [Webcite Archived Version] about how we might best integrate the “human API” for some imaginary “Science Stream” with domain specific machine APIs that work at the data level. In a sense this is the core problem of scientific informatics. How do you optimise the ability of machines to abstract and use data and meaning while at the same time fully exploiting the ability of the human scientist to contribute their own unique skills, pattern recognition, insight, lateral thinking. And how do you keep these in step with each other so both are optimally utilised? Thinking in computational terms about the human as a layer in the system with its own APIs could be a useful way to design systems.

Friendfeed in this view is a peer to peer system for pushing curated and annotated data streams. It mediates interactions with the underlying stream but also with other known and unknown users. Friendfeed seems to get three things very right: 1) Optimising the interaction with the incoming data stream; 2) Facilitating the curation and republication of data into a new stream for consumption by others, creating a virtuous feedback look in fact; and 3) Facilitating discovery of new peers. Friendfeed is actually a bittorrent for sharing conversational objects.

This conversational layer, a research discourse layer if you like, is at the very top of the stack, keeping the humans to a high level abstracted level of conversation, where we are probably still at our best. And my guess is that something rather like Friendfeed is pretty good at being the next layer down, the API to feeds of interesting items.  But Evgeniy’s question was more about the bottom of the stack, where the data is being generated and needs to be turned into a useful and meaningful feed, ready to be consumed. The devil is always in the details and vertical integration is likely to help her. So what do these vertical segments look like?

In some domains these might be lab notebooks, in some they might be specific databases, or they might be a mixture of both and of other things. At the coal face it is likely to be difficult to find a way of describing the detail in a way that is both generic enough to be comprehensible and detailed enough to be useful. The needs of the data generator are likely to be very different to those of a generic data consumer. But if there is a curation layer, perhaps human or machine mediated, that partly abstracts this then we may be on the way to generating the generic feeds that will be finally consumed at the top layer.  This curation layer would enable semantic markup, ideally automatically, would require domain specific tooling to translate from the specific to the generic, and provide a publishing mechanism. In short it sounds (again) quite a bit like Wave. Actually it might just as easily be Chem4Word or any other domain specific semantic authoring tool, or just a translation engine that takes in detailed domain specific info and correlates it with a wider vocabulary.

One of the things that appeals to me about Wave, and Chem4Word, is that they can (or at least have the potential to) hide the complexities of the semantics within a straightforward and comprehensible authoring environment. Wave can be integrated into domain specific systems via purpose built Robots making it highly extensible. Both are capable of “speaking web” and generating feeds that can be consumed and processed in other places and by other services. At the bottom layer we can chew the problem off one piece at a tim, including human processing where it is appropriate and avoiding it where we can.

The middleware is of coures, as always, the problem. The middleware is agreed and standardised vocabularies and data formats. While in the past I have thought this near intractable actually it seems as though many of the pieces are actually falling into place. There is still a great need for standardisation and perhaps a need for more meta-standards but it seems like a lot of this is in fact on the way. I’m still not convinced that we have a useful vocabulary for actually describing experiments but enough smart people disagree with me that I’m going to shut up on that one until I’ve found the time to have a closer look at the various things out there in more detail.

These are half baked thoughts – but I think the idea of where we optimally place the human in the system is a useful question. It also hasn’t escaped my notice that I’m talking about something very similar to the architecture that Simon Coles of Amphora Research Systems always puts up in his presentations on Electronic Lab Notebooks. Fundamentally because the same core drivers are there.