Why good intentions are not enough to get negative results published

There are a set of memes that seem to be popping up with increasing regularity in the last few weeks. The first is that more of the outputs of scientific research need to be published. Sometimes this means the publication of negative results, other times it might mean that a community doesn’t feel they have an outlet for their particular research field. The traditional response to this is “we need a journal” for this. Over the years there have been many attempts to create a “Journal of Negative Results”. There is a Journal of Negative Results – Ecology and Evolutionary Biology (two papers in 2008), a Journal of Negative Results in Biomedicine (four papers in 2009, actually looks pretty active) , a Journal of Interesting Negative Results in Natural Language (one paper), and a Journal of Negative Results in Speech and Audio Sciences, which appears to be defunct.

The idea is that there is a huge backlog of papers detailing negative results that people are gagging to get out if only there was somewhere to publish them. Unfortunately there are several problems with this. The first is that actually writing a paper is hard work. Most academics I know do not have the problem of not having anything to publish, they have the problem of getting around to writing the papers, sorting out the details, making sure that everything is in good shape. This leads to the second problem, that getting a negative result to a standard worthy of publication is much harder than for a positive result. You only need to make that compound, get that crystal, clone that gene, get the microarray to work once and you’ve got the data to analyse for publication. To show that it doesn’t work you need to repeat several times, make sure your statistics are in order, and establish your working condition. Partly this is a problem with the standards we apply to recording our research; designing experiments so that negative results are well established is not high on many scientists’ priorities. But partly it is the nature of beast. Negative results need to be much more tightly bounded to be useful .

Finally, even if you can get the papers, who is going to read them? And more importantly who is going to cite them? Because if no-one cites them then the standing of your journal is not going to be very high. Will people pay to have papers published there? Will you be able to get editors? Will people referee for you? Will people pay for subscriptions? Clearly this journal will be difficult to fund and keep running. And this is where the second meme comes in, one which still gets suprising traction, that “publishing on the web is free”. Now we know this isn’t the case, but there is a slighlty more sophisticated approach which is “we will be able to manage with volunteers”. After all with a couple of dedicated editors donating the time, peer review being done for free, and authors taking on the formatting role, the costs can be kept manageable surely? Some journals do survive on this business model, but it requires real dedication and drive, usually on the part of one person. The unfortunate truth is that putting in a lot of your spare time to support a journal which is not regarded as high impact (however it is measured) is not very attractive.

For this reason, in my view, these types of journals need much more work put into the business model than for a conventional specialist journal. To have any credibility in the long term you need a business model that works for the long term. I am afraid that “I think this is really important” is not a business model, no matter how good your intentions. A lot of the standing of a journal is tied up with the author’s view of whether it will still be there in ten years time. If that isn’t convincing, they won’t submit, if they don’t submit you have no impact, and in the long term a downward spiral until you have no journal.

The fundamental problem is that the “we need a journal” approach is stuck in the printed page paradigm. To get negative results published we need to reduce the barriers to publication much lower than they currently are, while at the same time applying either a pre- or post-publication filter. Rich Apodaca, writing on Zusammen last week talked about micropublication in chemistry, the idea of reducing the smallest publishable unit by providing routes to submit smaller packages of knowledge or data to some sort of archive. This is technically possible today, services like ChemSpider, NMRShiftDB, and others make it possible to submit small pieces of information to a central archive. More generally the web makes it possible to publish whatever we want, in whatever form we want, but hopefully semantic web tools will enable us to do this in an increasingly more useful form in the near future.

Fundamentally my personal belief is that the vast majority of “negative results” and other journals that are trying to expand the set of publishable work will not succeed. This is precisely because they are pushing the limits of the “publish through journal” approach by setting up a journal. To succeed these efforts need to embrace the nature of the web, to act as a web-native resource, and not as a printed journal that happens to be viewed in a browser. This does two things, it reduces the barrier to authors submitting work, making the project more likely to be successful, and it can also reduce costs. It doesn’t in itself provide a business model, nor does it provide quality assurance, but it can provide a much richer set of options for developing both of these that are appropriate to the web. Routes towards quality assurance are well established, but suffer from the ongoing problem of getting researchers involved in the process, a subject for another post. Micropublication might work through micropayments, the whole lab book might be hosted for a fee with a number of “publications” bundled in, research funders may pay for services directly, or more interestingly the archive may be able to sell services built over the top of the data, truly adding value to the data.

But the key is a low barriers for authors and a robust business model that can operate even if the service is perceived as being low impact. Without these you are creating a lot of work for yourself, and probably a lot of grief. Nothing comes free, and if there isn’t income, that cost will be your time.

The API for the physical world is coming – or – how a night out in Bath ended with chickens and sword dancing

Last night I made it for the first time to one of the BathCamp evening events organised by Mike Ellis in Bath. The original BathCamp was a BarCamp held in late summer last year that I couldn’t get to, but Mike has been organising a couple of evenings since. I thought I’d go along because it sounded like fun, and I knew a few of the people were good value, and that I didn’t know any of the others.

Last night there were two talks, the first was from Dale Lane, talking about home electricity use monitoring, with a focus on what can be done with the data. He showed monitoring, mashups, competitions for who could use the least electricity, and widgets to make your phone beep when consumption jumped; the latter is apparently really irritating as your phone goes off every time the fridge goes on. Irritating, but makes you think. Dale also touched on the potential privacy issues, a very large grey area which is of definite interest to me; who owns this data, and who has the right to use it?

Dale’s talk was followed by one from Ben Tomlinson, about the Arduino platform. Now maybe everyone knows about this already but it was new to me. Essentially Arduino is a cheap open source platform for building programmable interactive devices. The core is a £30 board with enough processing power and memory to run reasonably sensible programs. There is already quite a big code base which you can download and then flash onto the boards. The boards themselves take some basic analogue and digital inputs. But then there are all the other bits and bobs you can get, both unrelated with standard interfaces, but also built to fit; literally to plug and play, including motion sensors, motor actuators, ethernet, wireless, even web servers!

Now these were developed almost as toys, or for creatives, but what struck me was that this was a platform for building simple, cheap scientific equipment. Things to do basic measurement jobs and then stream out the results. One thing we’ve learnt over the years is that the best way to make well recorded and reliable measurements of something is to take humans out of the equation as far as possible. This platform basically makes it possible to knock up sensors, communication devices, and responsive systems for less than the price of decent mp3 player. It is a commoditized instrument platform, and one that comes with open source tools to play with.

Both Dale and Ben mentioned a site I hadn’t come across before, called Pachube (I think it is supposed to be a contraction of “patch” and “you tube” but not sure, no-one seemed to know how to pronounce it) which provides a simple platform for hooking up location centric streamed data. Dale is streaming his electric consumption data in, someone is streaming the flow of the Colarado River. What could we do with lab sensor data? Or even experimental data. What happens when you can knock up an instrument to do a measurement in the morning and then just leave it recording for six months? What if that instrument then starts interacting with the world?

All of this has been possible for some time if you have the time and the technical chops to put it together. What is changing is two things; the ease of being able to plug-and-play with these systems, download a widget, plug in a sensor, connect to the local wireless. As this becomes accessible then people start to play. It was no accident that Ben’s central example were hacked toy chickens. The second key issue is that as these become commoditized the prices drop until, as Dale mentioned, electric companies are giving away the meters, and the Arduino kit can be got together for less than £50. Not only will this let us do new science, it will let us involved the whole world in doing that science.

When you abstract away the complexities and details of dealing with a system into a good API you enable much more than the techies to start wiring things up. When artists, children, scientists, politicians, school teachers, and journalists feel comfortable with putting these things together amazing things start to happen. These tools are very close to being the API to link the physical and online worlds.

The sword dancing? Hard to explain but probably best to see it for yourself…

What is the cost of peer review? Can we afford (not to have) high impact journals?

Late last year the Research Information Network held a workshop in London to launch a report, and in many ways more importantly, a detailed economic model of the scholarly publishing industry. The model aims to capture the diversity of the scholarly publishing industry and to isolate costs and approaches to enable the user to ask questions such as “what is the consequence of moving to a 95% author pays model” as well as to simply ask how much money is going in and where it ends up. I’ve been meaning to write about this for ages but a couple of things in the last week have prompted me to get on and do it.

The first of these was an announcement by email [can’t find a copy online at the moment] by the EPSRC, the UK’s main funder of physical sciences and engineering. While the requirement for a two page enconomic impact statement for each grant proposal got more headlines, what struck me as much more important were two other policy changes. The first was that, unless specifically invited, rejected proposals can not be resubmitted. This may seem strange, particularly to US researchers, where a process of refinement and resubmission, perhaps multiple times, is standard, but the BBSRC (UK biological sciences funder) has had a similar policy for some years. The second, frankly somewhat scarey change, is that some proportion of researchers that have a history of rejection will be barred from applying altogether. What is the reason for these changes? Fundamentally the burden of carrying out peer review on all of the submitted proposals is becoming too great.

The second thing was that, for the first time, I have been involved in refereeing a paper for a Nature Publishing Group journal. Now I like to think, like I guess everyone else does, that I do a reasonable job of paper refereeing. I wrote perhaps one and a half sides of A4 describing what I thought was important about the paper and making some specific criticisms and suggestions for changes. The paper went around the loop and on the second revision I saw what the other referees had written; pages upon pages of closely argued and detailed points. Now the other referees were much more critical of the paper but nonetheless this supported a suspicion that I have had for some time, that refereeing at some high impact journals is qualitatively different to what the majority of us receive, and probably deliver; an often form driven exercise with a couple of lines of comments and complaints. This level of quality peer review takes an awful lot of time and it costs money; money that is coming from somewhere. Nonetheless it provides better feedback for authors and no doubt means the end product is better than it would otherwise have been.

The final factor was a blog post from Molecular Philosophy discussing why the author felt Open Access Publishers are, if not doomed to failure, then face a very challenging road ahead. The centre of the argument as I understand it focused around the costs of high impact journals, particularly the costs of selection, refinement, and preparation for print. Broadly speaking I think it is generally accepted that a volume model of OA publication, such as that practiced by PLoS ONE and BMC can be profitable. I think it is also generally accepted that a profitable business model for high impact OA publication has yet to be convincingly demonstrated. The question I would like to ask though is different. The Molecular Philosophy post skips the zeroth order questions. Can we afford high impact publications?

Returning to the RIN funded study and model of scholarly publishing some very interesting points came out [see Daniel Hull’s presentation for most of the data here]. The first of these, which in retrospect is obvious but important, is that the vast majority of the costs of producing a paper are incurred in doing the research it describes (£116G worldwide). The second biggest contributor? Researchers reading the papers (£34G worldwide). Only about 14% of the costs of the total life cycle are actually taken up with costs directly attributable to publication. But that is the 14% we are interested in, so how does it divide up?

The “Scholarly Communication Process” as everything in the middle is termed in the model is divided up into actual publication/distribution costs (£6.4G), access provision costs (providing libraries and internet access, £2.1G) and the costs of researchers looking for articles (£16.4G). Yes, the biggest cost is the time you spend trying to find those papers. Arguably that is a sunk cost in as much as once you’ve decided to do research searching for information is a given, but it does make the point that more efficient searching has the potential to save a lot of money. In any case it is a non-cash cost in terms of journal subscriptions or author charges.

So to find the real costs of publication per se we need to look inside that £6.4. Of the costs of actually publishing the articles the biggest single cost is peer review weighing in at around £1.9G globally, just ahead of fixed “first copy” publication costs of £1.8G. So 29% of the total costs incurred in publication and distribution of scholarly articles arises from the cost of peer review.

There are lots of other interesting points in the reports and models (the UK is a net exporter of peer review, but the UK publishes more articles than would be expected based on its subscription expenditure) but the most interesting aspect of the model is its ability to model changes in the publishing landscape. The first scenario presented is one in which publication moves to being 90% electronic. This actually leads to a fairly modest decrease in costs overall with a total overall saving of a little under £1G (less than 1%). Modeling a move to a 90% author pays model (assuming 90% electronic only) leads to very little change overall, but interestingly that depends significantly on the cost of systems put in place to make author payments. If these are expensive and bureaucratic then the costs can rise as many small payments are more expensive than few big ones. But overall the costs shouldn’t need to change much, meaning if mechanisms can be put in place to move the money around, the business models should ultimately be able to make sense. None of this however helps in figuring out how to manage a transition from one system to another, when for all useful purposes costs are likely to double in the short term as systems are duplicated.

The most interesting scenario, though was the third. What happens as research expands. A 2.5% real increase year on year for ten years was modeled. This may seem profligate in today’s economic situation but with many countries explicitly spending stimulus money on research, or already engaged in large scale increases of structural research funding it may not be far off. This results in 28% more articles, 11% more journals, a 12% increase in subscription costs (assuming of course that only the real cost increases are passed on) and a 25% increase in the costs of peer review (£531M on a base of £1.8G).

I started this post talking about proposal refereeing. The increased cost in refereeing proposals as the volume of science increases would be added on top of that for journals. I think it is safe to say that the increase in cost would be of the same order. The refereeing system is already struggling under the burden. Funding bodies are creating new, and arguably totally unfair, rules to try and reduce the burden, journals are struggling to find referees for paper. Increases in the volume of science, whether they come from increased funding in the western world or from growing, increasingly technology driven, economies could easily increase that burden by 20-30% in the next ten years. I am sceptical that the system, as it currently exists, can cope and I am sceptical that peer review, in its current form is affordable in the medium to long term.

So, bearing in mind Paulo’s admonishment that I need to offer solutions as well as problems, what can we do about this? We need to find a way of doing peer review effectively, but it needs to be more efficient. Equally if there are areas where we can save money we should be doing that. Remember that £16.4G just to find the papers to read? I believe in post-publication peer review because it reduces the costs and time wasted in bringing work to community view and because it makes the filtering and quality assurance of that published work continuous and ongoing. But in the current context it offers significant cost savings. A significant proportion of published papers are never cited. To me it follows from this that there is no point in peer reviewing them. Indeed citation is an act of post-publication peer review in its own right and it has recently been shown that Google PageRank type algorithms do a pretty good job of identifying important papers without any human involvement at all (beyond the act of citation). Of course for PageRank mechanisms to work well the citation and its full context are needed making OA a pre-requisite.

If refereeing can be restricted to those papers that are worth the effort then it should be possible to reduce the burden significantly. But what does this mean for high impact journals? The whole point of high impact journals is that they are hard to get into. This is why both the editorial staff and peer review costs are so high for them. Many people make the case that they are crucial for helping to filter out the important papers (remember that £16.4G again). In turn I would argue that they reduce value by making the process of deciding what is “important” a closed shop, taking that decision away, to a certain extent, from the community where I feel it belongs. But at the end of the day it is a purely economic argument. What is the overall cost of running, supporting through peer review, and paying for, either by subscription or via author charges, a journal at the very top level? What are the benefits gained in terms of filtering and how do they compare to other filtering systems. Do the benefits justify the costs?

If we believe that there are better filtering systems possible, then they need to be built, and the cost benefit analysis done. The opportunity is coming soon to offer different, and more efficient, approaches as the burden becomes too much to handle. We either have to bear the cost or find better solutions.

[This has got far too long already – and I don’t have any simple answers in terms of refereeing grant proposals but will try to put some ideas in another post which is long overdue in response to a promise to Duncan Hull]

Contributor IDs – an attempt to aggregate and integrate

Following on from my post last month about using OpenID as a way of identifying individual researchers,  Chris Rusbridge made the sensible request that when conversations go spreading themselves around the web it would be good if they could be summarised and aggregated back together. Here I am going to make an attempt to do that – but I won’t claim that this is a completely unbiased account. I will try to point to as much of the conversation as possible but if I miss things out or misprepresent something please correct me in the comments or the usual places.

The majority of the conversation around my post occured on friendfeed, at the item here, but also see commentary around Jan Aert’s post (and friendfeed item) and Bjoern Bremb’s summary post. Other commentary included posts from Andy Powell (Eduserv), Chris Leonard (PhysMathCentral), Euan, Amanda Hill of the Names project, and Paul Walk (UKOLN). There was also a related article in Times Higher Education discussing the article (Bourne and Fink) in PLoS Comp Biol that kicked a lot of this off [Ed – Duncan Hull also pointed out there is a parallel discussion about the ethics of IDs that I haven’t kept up with – see the commentary at the PLoS Comp Biol paper for examples]. David Bradley also pointed out to me a post he wrote some time ago which touches on some of the same issues although from a different angle. Pierre set up a page on OpenWetWare to aggregate material to, and Martin Fenner has a collected set of bookmarks with the tag authorid at Connotea.

The first point which seems to be one of broad agreement is that there is a clear need for some form of unique identifier for researchers. This is not necessarily as obvious as it might seem. With many of these proposals there is significant push back from communities who don’t see any point in the effort involved. I haven’t seen any evidence of that with this discussion which leads me to believe that there is broad support for the idea from researchers, informaticians, publishers, funders, and research managers. There is also strong agreement that any system that works will have to be credible and trustworthy to researchers as well as other users, and have a solid and sustainable business model. Many technically minded people pointed out that building something was easy – getting people to sign up to use it was the hard bit.

Equally, and here I am reading between the lines somewhat, any workable system would have to be well designed and easy to use for researchers. There was much backwards and forwards about how “RDF is too hard”, “you can’t expect people to generate FOAF” and “OpenID has too many technical problems for widespread uptake”. Equally people thinking about what the back end would have to look like to even stand a chance of providing an integrated system that would work felt that FOAF, RDF, OAuth, and OpenID would have to provide a big part of the gubbins. The message for me was that the way the user interface(s) is presented have to be got right. There are small models of aspects of this that show that easy interfaces can be built to capture sophisticated data, but getting it right at scale will be a big challenge.

Where there is less agreement is on the details, both technical and organisational of how best to go about creating a useful set of unique identifiers. There was some to-and-fro as to whether CrossRef was the right organisation to manage such a system. Partly this turned on concern over centralised versus distributed systems and partly over issues of scope and trust. Nonetheless the majority view appeared to be that CrossRef would be right place to start and CrossRef do seem to have plans in this direction (from Geoffry Bilder see this Friendfeed item).

There was also a lot of discussion around identity tokens versus authorisation. Overall it seemed that the view was that these can be productively kept separate. One of the things that appealed to me in the first instance was that OpenIDs could be used as either tokens (just a unique code that is used as an identifier) as well as a login mechanism. The machinery is already in place to make that work. Nonetheless it was generally accepted, I think, that the first important step is an identifier. Login mechansisms are not necessarily required, or even wanted, at the moment.

The discussion as to whether OpenID is a good mechanism seemed in the end to go around in circles. Many people brought up technical problems they had with getting OpenIDs to work, and there are ongoing problems both with the underlying services that support and build on the standard as well as with the quality of some of the services that provide OpenIDs. This was at the core of my original proposal to build a specialist provider, that had an interface, and functionality that worked for researchers. As Bjoern pointed out, I should of course be applying my own five criteria for successful web services (got to the last slide) to this proposal. Key questions: 1) can it offer something compelling? Well no, not unless someone, somewhere requires you to have this thing 2) can you pre-populate? Well yes, and maybe that is the key…(see later). In the end, as with the concern over other “informatics-jock” terms and approaches, the important thing is that all of the technical side is invisible to end users.

Another important discussion, that again, didn’t really come to a conclusion, was who would pass out these identifiers? And when? Here there seemed to be two different perspectives. Those who wanted the identifiers to be completely separated from institutional associations, at least at first order. Others seemed concerned that access to identifiers be controlled via institutions. I definitely belong in the first camp. I would argue that you just give them to everyone who requests them. The problem then comes with duplication, what if someone accidentally (or deliberately) ends up with two or more identities. At one level I don’t see that it matters to anyone except to the person concerned (I’d certainly be trying to avoid having my publication record cut in half). But at the very least you would need to have a good interface for merging records when it was required. My personal belief is that it is more important to allow people to contribute than to protect the ground. I know others disagree and that somewhere we will need to find a middle path.

One thing that was helpful was the fact that we seemed to do a pretty good job of getting various projects in this space aggregated together (and possibly more aware of each other). Among these is ResearcherID, a commercial offering that has been running for a while now, the Names project, a collaboration of Mimas and the British Library funded by JISC, ClaimID is an OpenID provider that some people use that provides some of the flexible “home page” functionality (see Maxine Clark’s for instance) that drove my original ideas, PublicationsList.org provides an online homepage but does what ClaimID doesn’t, providing a PubMed search that makes it easier (as long as your papers are in PubMed) to populate that home page with your papers (but not easier to include datasets, blogs, or wikis – see here for my attempts to include a blog post on my page). There are probably a number of others, feel free to point out what I’ve missed!

So finally where does this leave us? With a clear need for something to be done, with a few organisations identified as the best ones to take it forward, and with a lot of discussion required about the background technicalities required. If you’re still reading this far down the page then you’re obviously someone who cares about this. So I’ll give my thoughts, feel free to disagree!

  1. We need an identity token, not an authorisation mechanism. Authorisation can get easily broken and is technically hard to implement across a wide range of legacy platforms. If it is possible to build in the option for authorisation in the future then that is great but it is not the current priority.
  2. The backend gubbins will probably be distributed RDF. There is identity information all over the place which needs to be aggregated together. This isn’t likely to change so a centralised database, to my mind, will not be able to cope. RDF is built to deal with these kinds of problems and also allows multiple potential identity tokens to be pulled together to say they represent one person.
  3. This means that user interfaces will be crucial. The simpler the better but the backend, with words like FOAF and RDF needs to be effectively invisible to the user. Very simple interfaces asking “are you the person who wrote this paper” are going to win, complex signup procedures are not.
  4. Publishers and funders will have to lead. The end view of what is being discussed here is very like a personal home page for researchers. But instead of being a home page on a server it is a dynamic document pulled together from stuff all over the web. But researchers are not going to be interested for the most part in having another home page that they have to look after. Publishers in particular understand the value (and will get most value out of in the short term) unique identifiers so with the most to gain and the most direct interest they are best placed to lead, probably through organisations like CrossRef that aggregate things of interest across the industry. Funders will come along as they see the benefits of monitoring research outputs, and forward looking ones will probably come along straight away, others will lag behind. The main point is that pre-populating and then letting researchers come along and prune and correct is going to be more productive than waiting for ten millions researchers to sign up to a new service.
  5. The really big question is whether there is value in doing this specially for researchers. This is not a problem unique to research and one in which a variety of messy and disparate solutions are starting to arise. Maybe the best option is to sit back and wait to see what happens. I often say that in most cases generic services are a better bet than specially built ones for researchers because the community size isn’t there and there simply isn’t a sufficient need for added functionality. My feeling is that for identity that there is a special need, and that if we capture the whole research community that it will be big enough to support a viable service. There is a specific need for following and aggregating the work of people that I don’t think is general, and is different to the authentication issues involved in finance. So I think in this case it is worth building specialist services.

The best hope I think lies in individual publishers starting to disambiguate authors across their existing corpus. Many have already put a lot of effort into this. In turn, perhaps through CrossRef, it should be possible to agree an arbitrary identifier for each individual author. If this is exposed as a service it is then possible to start linking the information up. People can and will and the services will start to grow around that. Once this exists then some of the ideas around recognising referees and other efforts will start to flow.

The growth of linked up data in chemistry – and good community projects

It’s been an interesting week or so in the Chemistry online world. Following on from my musings about data services and the preparation I was doing for a talk the week before last I asked Tony Williams whether it was possible to embed spectra from ChemSpider on a generic web page in the same way that you would embed a YouTube video, Flickr picture, or Slideshare presentation. The idea is that if there are services out on the cloud that make it easier to put some rich material in your own online presence by hosting it somewhere that understands about your data type, then we have a chance of pulling all of these disparate online presences together.

Tony went on to release two features, one that enables you to embed a molecule, which Jean-Claude has demonstrated over on the ONS Challenge Wiki. Essentially by cutting and pasting a little bit of text from ChemSpider into Wikispaces you get a nicely drawn image of the molecule, and the machinery is in place to enable good machine readability of the displayed page (by embedding chemical identifiers within the code) as well as enabling the aggregation of web based information about the molecule back at Chemspider.

The second feature was the one I had asked about, the embedding of spectra. Again this is really useful because it means that as an experimentalist you can host spectra on a service that gets what they are, but you can also incorporate them in a nice way back into your lab book, online report, or whatever it is you are doing. This has already enabled Andy Lang and Jean-Claude to build a very cool game, initially in Second Life but now also on the web. Using the spectral and chemical information from Chemspider the player is presented with the spectrum and three molecules; if they select the correct molecule they get some points, if they get it wrong they lose some. As Tony has pointed out, this is also a way of crowdsourcing the curation process – if the majority of people disagree with the “correct” assignment then maybe the spectrum needs a second look. Chemistry Captchas anyone?

The other even this week has been the efforts by Mitch over at the Chemistry Blog to set up an online resource for named reactions by crowdsourcing contributions and ultimately turning it into a book. Mitch deserves plaudits for this because he’s gone on and done something rather than just talked about it and we need more people like that. Some of us have criticised the details (also see comments at the original post) of how he is going about it but from my perspective this is definitely criticism motivated by the hope that he will succeed and that by making some changes early on, there is the chance to get much more out of the contributions that he gets.

In particular Egon asked whether it would be better to use Wikipedia as the platform for aggregating the named reaction; a point which I agree with. The problem that people see with Wikipedia is largely that of image. People are concerned about inaccurate editing, about the sometimes combative approach of senior editors that are not necessarily expert in the are. Part of the answer is to just get in there and do it – particularly in chemistry there are a range of people working hard to try and get stuff cleaned up. Lots of work has gone into the Chemical boxes and named reactions would be an obvious thing to move on to. Nonetheless it may not work for some people and to a certain extent as long as the material that is generated can be aggregated back to Wikipedia I’m not really fussed.

The bigger concern for us “chemoinformatics jocks” (I can’t help but feel that categorising me as a foo-informatics anything is a little off beam but never mind (-;) was the example pages Mitch put up where there was very little linking back of data to other resources. So there was no way, for instance, to know that this page was even about a specific class of chemicals. The schemes were shown as plain images, making it very hard for any information aggregation service to do anything useful. Essentially the pages didn’t make full use of the power of the web to connect information.

Mitch in turn has taken the criticism offered in a positive fashion and has thrown down the gauntlet; effectively asking the question, “well if you want this marked up, where are the tools to make it easy, and the instructions in plain English to show how to do it?”. He also asks, if named reactions aren’t the best place to start, then what would be a good collaborative project. Fair questions, and I would hope the Chemspider services start to point in the right direction. Instead of drawing an image of a molecule and pasting it on a web page, use the service and connect to the molecule itself, this connects the data up, and gives you a nice picture for free. It’s not perfect. The ideal situation would be a little chemical drawing palette. You draw the molecule you want, it goes to the data service of choice, finds the correct molecule (meaning the user doesn’t need to know what SMILES, InChis or whatever are), and then brings back whatever you want; image, data, vendors, price. This would be a really powerful demonstration of the power of linked data and it probably could be pulled together from existing services.

But what about the second question? What would be a good project? Well this is where that second Chemspider functionality comes in. What about flooding Chemspider with high quality, electronic copies, of NMR spectra. Not the molecules that your supervisor will kill you for releasing, but all those molecules you’ve published, all those hiding in undergraduate chemistry practicals. Grab the electronic files, lets find a way of converting them all to JCamp online, and get ’em up on Chemspider as Open Data.

What are the services we really want for recording science online?

There is an interesting meta-discussion going on in a variety of places at the moment which touch very strongly on my post and talk (slides, screencast) from last week about “web native” lab notebooks. Over at Depth First, Rich Apodaca has a post with the following little gem of a soundbite:

Could it be that both Open Access and Electronic Laboratory Notebooks are examples of telephone-like capabilities being used to make a better telegraph?

Web-Centric Science offers a set of features orthoginal to those of paper-centric science. Creating the new system in the image of the old one misses the point, and the opportunity, entirely.

Meanwhile a discussion on The Realm of Organic Synthesis blog was sparked off by a post about the need for a Wikipedia inspired chemistry resource (thanks again to Rich and Tony Williams for pointing the post and discussion out respectively). The initial idea here was something along the lines of;

I envision a hybrid of Doug Taber’s Organic Chemistry Portal, Wikipedia and a condensed version of SciFinder.  I’ll gladly contribute!  How do we get the ball rolling?

This in turn has led into a discussion of whether Chemspider and ChemMantis partly fill this role already. The key point being made here is the problem of actually finding and aggregating the relevant information. Tony Williams makes the point in the comments that Chemspider is not about being a central repository in the way that J proposes in the original TROS blog post but that if there are resources out there they can be aggregated into Chemspider. There are two problems here, capturing the knowledge into one “place” and then aggregating.

Finally there is an ongoing discussion in the margins of a post at RealClimate. The argument here is over the level of “working” that should be shown when doing analyses of data, something very close to my heart. In this case both the data and the MatLab routines used to process the data have been made available. What I believe is missing is the detailed record, or log, of how those routines were used to process the data. The argument rages over the value of providing this, the amount of work involved, and whether it could actually have a chilling effect on people doing independent validation of the results. In this case there is also the political issue of providing more material for antagonistic climate change skeptics to pore over and look for minor mistakes that they will then trumpet as holes. It will come as no suprise that I think the benefits of making the log available outweigh the problems. But that we need the tools to do it. This is beautifully summed up in one comment by Tobias Martin at number 64:

So there are really two main questions: if this [making the full record available – CN] is hard work for the scientist, for heaven’s sake, why is it hard work? (And the corrolary: how are you confident that your results are correct?)

And the answer – it is hard work because we still think in terms of a paper notebook paradigm, which isn’t well matched to the data analysis being done within MatLab. When people actually do data analysis using computational systems they very rarely keep a systematic log of the process. It is actually a rather difficult thing to do – even though in principle the system could (and in some cases does) keep that entire record for you.

My point is that if we completely re-imagine the shape and functionality of the laboratory record, in the way Rich and I, and others, have suggested; if the tools are built to capture what happens and then provide that to the outside world in a useful form (when the researcher chooses), then not only will this record exist but it will provide the detailed “clickstream” records that Richard Akerman refers to in answer to a Twitter proposal from Michael Barton:

Michael Barton: Website idea: Rank scientific articles by relevance to your research; get uptakes on them via citations and pubmed “related articles”

Richard Akerman: This is a problem of data, not of technology. Amazon has millions of people with a clear clickstream through a website. We’ve got people with PDFs on their desktops.

Exchange “data files” and “Matlab scripts” for PDFs and you have a statement of the same problem that they guys at RealClimate face. Yes it is there somewhere, but it is a pain in the backside to get it out and give it to someone.

If that “clickstream” and “file use stream” and “relationship” stream was automatically captured then we get closer to the thing that I think many of us a yearning for (and have been for some time). The Amazon recommendation tool for science.

Licenses and protocols for Open Science – the debate continues

This is an important discussion that has been going on in disparate places, but primarily at the moment is on the Open Science mailing list maintained by the OKF (see here for an archive of the relevant thread). To try and keep things together and because Yishay Mor asked, I thought I would try to summarize the current state of the debate.

The key aim here is to find a form of practice that will enhance data availability, and protect it into the future.

There is general agreement that there is a need for some sort of declaration associated with making data available. Clarity is important and the minimum here would be a clear statement of intention.Where there is disagreement is over what form this should take. Rufus Pollock started by giving the reasons why this should be a formal license. Rufus believes that a license provides certainty and clarity in a way that a protocol, statement of principles, or expression of community standards can not.  I, along with Bill Hooker and John Wilbanks [links are to posts on mailing list], expressed a concern that actually the use of legal language, and the notion of “ownership” of this by lawyers rather than scientists would have profound negative results. Andy Powell points out that this did not seem to occur either in the Open Source movement or with much of the open content community. But I believe he also hits the nail on the head with the possible reason:

I suppose the difference is that software space was already burdened with heavily protective licences and that the introduction of open licences was perceived as a step in the right direction, at least by those who like that kind of thing.         

Scientific data has a history of being assumed to be in public domain (see the lack of any license at PDB or Genbank or most other databases) so there isn’t the same sense of pushing back from an existing strong IP or licensing regime. However I think there is broad agreement that this protocol or statement would look a lot like a license and would aim to have the legal effect of at least providing clarity over the rights of users to copy, re-purpose, and fork the objects in question.

Michael Nielsen and John Wilbanks have expressed a concern about the potential for license proliferation and incompatibility. Michael cites the example of Apache, Mozilla, and GPL2 licenses. This feeds into the issue of the acceptability, or desirability of share-alike provisions which is an area of significant division. Heather Morrison raises the issue of dealing with commercial entities who may take data and use technical means to effectively take it out of the public domain, citing the takeover of OAIster by OCLC as a potential example.

This is a real area of contention I think because some of us (including me) would see this in quite a positive light (data being used effectively in a commercial setting is better than it not being used at all) as long as the data is still both legally and technically in the public domain. Indeed this is at the core of the power of a public domain declaration. The issue of finding the resources that support the preservation of research objects in the (accessible) public domain is a separate one but in my view if we don’t embrace the idea that money can and should be made off data placed in the public domain then we are going to be in big trouble sooner or later because the money will simply run out.

On the flip side of the argument is a strong tradition of arguing that viral licensing and share alike provisions protect the rights and personal investment of individuals and small players against larger commercial entities. Many of the people who support open data belong to this tradition, often for very good historical reasons. I personally don’t disagree with the argument on a logical level, but I think for scientific data we need to provide clear paths for commercial exploitation because using science to do useful things costs a lot of money. If you want people want to invest in using the outputs of publicly funded research you need to provide them with the certainty that they can legitimately use that data within their current business practice. I think it is also clear that those of us who take this line need to come up with a clear and convincing way of expressing this argument because it is at the centre of the objection to “protection” via licenses and share alike provisions.

Finally Yishay brings us back to the main point. Something to keep focussed on:

I may be off the mark, but I would argue that there’s a general principle to consider here. I hold that any data collected by public money should be made freely available to the public, for any use that contributes to the public good. Strikes me as a no-brainer, but of course – we have a long way to go. If we accept this principle, the licensing follows.         

Obviously I don’t agree with the last sentence – I would say that dedication to the public domain follows – but the principle I think is something we can agree that we are aiming for.

Best practice for data availability – the debate starts…well over there really

The issue of licensing arrangements and best practice for making data available has been brewing for some time but has just recently come to a head. John Wilbanks and Science Commons have a reasonably well established line that they have been developing for some time. Michael Nielsen has a recent blog post and Rufus Pollock, of the Open Knowledge Foundation, has also just synthesised his thoughts in response into a blog essay. I highly recommend reading John’s article on licensing at Nature Precedings, Michael’s blog post, and Rufus’ essay before proceeding. Another important document is the discussion of the license that Victoria Stodden is working to develop. Actually if you’ve read them go and read them again anyway – it will refresh the argument.

To crudely summarize, Rufus makes a cogent argument for the use of explicit licenses applied to collections of data, and feels that share-alike provisions in licenses or otherwise do not cause major problems and that the benefit that arises from enforcing re-use outweighs the problem. John’s position is that it far better for standards to be applied through social pressure (“community norms”) rather than licensing arrangements. He also believes that share-alike provisions are bad because they break interoperability between different types of objects and domains. One point that I think is very important and (I think) is a point of agreement is that some form of license or at dedication to the public domain will be crucial to developing best practice. Even if the final outcome of debate is that everything will go in the public domain it should be part of best practice to make that explicit.

Broadly speaking I belong to John’s camp but I don’t want to argue that case with this post. What is important in my view is that the debate takes place and that we are clear about what the aims of that debate are. What is it we are trying to achieve in the process of coming to (hopefully) some consensus of what best practice should look like?
It is important to remember that anyone can assert a license (or lack thereof) on any object that they (assert they) own or have rights over. We will never be able to impose a specific protocol on all researchers, all funders. Therefore what we are looking for is not the perfect arrangement but a balance between what is desired, what can be practically achieved, and what is politically feasible. We do need a coherent consensus view that can be presented to research communities and research funders. That is why the debate is important. We also need something that works, and is extensible into the future, where it will stand up to the development of new types of research, new types of data, new ways of making that data available, and perhaps new types of researchers altogether.

I think we agree that the minimal aim is to enable, encourage, and protect into the future the ability to re-use and re-purpose the publicly published products of publicly funded research. Arguments about personal or commercial work are much harder and much more subtle. Restricting the argument to publicly funded researchers makes it possible to open a discussion with a defined number of funders who have a public service and public engagement agenda. It also makes the moral arguments much clearer.

In focussing on research that is being made public we short circuit the contentious issue of timing. The right, or the responsibility, to commercially exploit research outputs and the limitations this can place on data availability is a complex and difficult area and one in which agreement is unlikely any time soon. I would also avoid the word “Open”. This is becoming a badly overloaded term with both political and emotional overtones, positive and negative. Focussing on what should happen after the decision has been to go public reduces the argument to “what is best practice for making research outputs available”. The question of when to make them available can then be kept separate. The key question for the current debate is not when but how.

So what I believe the debate should be about is the establishment, if possible, of a consensus  protocol or standard or license for enabling and ensuring the availability of the research outputs associated with publicly published, publicly funded research.  Along side this is the question of establishing mechanisms, for researchers to implement and be supported to observe these standards, as well as for “enforcement”. These might be trademarks, community standards, or legal or contractual approaches as well as systems and software to make all of this work, including trackbacks, citation aggregators, and effective data repositories. In addition we need to consider the public relations issue of selling such standards to disparate research funders and research communities.

Perhaps a good starting point would be to pinpoint the issues where there is general agreement and map around those. If we agree some central principles then we can take an empirical approach to the mechanisms. We’re scientists after all aren’t we?

Euan Adie asks for help characterising PLoS comments

Euan Adie has asked for some help to do further analysis on the comments made on PLoS ONE articles. He is doing this via crowd sourcing through a specially written app at appspot to get people to characterize all the comments in PLoS ONE. Euan is very good at putting these kind of things together and again this shows the power of Friendfeed as a way of getting the message out. Dividing the job up into bite sized chunks so people can help even with a little bit of time, providing the right tools, and getting them in the hands of people who care enough to dedicate a little time. If anything counts as Science2.0 then this must be pretty close.

Third party data repositories – can we/should we trust them?

This is a case of a comment that got so long (and so late) that it probably merited it’s own post. David Crotty and Paul (Ling-Fung Tang) note some important caveats in comments on my last post about the idea of the “web native” lab notebook. I probably went a bit strong in that post with the idea of pushing content onto outside specialist services in my effort to try to explain the logic of the lab notebook as a feed. David notes an important point about any third part service (do read the whole comment at the post):

Wouldn’t such an approach either:
1) require a lab to make a heavy investment in online infrastructure and support personnel, or
2) rely very heavily on outside service providers for access and retention of one’s own data? […]

Any system that is going to see mass acceptance is going to have to give the user a great deal of control, and also provide complete and redundant levels of back-up of all content. If you’ve got data scattered all over a variety of services, and one goes down or out of business, does that mean having to revise all of those other services when/if the files are recovered?

This is a very wide problem that I’ve also seen in the context of the UK web community that supports higher education (see for example Brian Kelly‘s risk assessment for use of third party web services). Is it smart, or even safe, to use third party services? The general question divides into two sections: is the service more or less reliable than you own hard drive or locally provided server capacity (technical reliability, or uptime); and what is the long term reliability of the service remaining viable (business/social model reliability). Flickr probably has higher availability than your local institutional IT services but there is no guarantee that it will still be there tomorrow. This is why data portability is very important. If you can’t get your data out, don’t put it in there in the first place.

In the context of my previous post these data services could be local, they could be provided by the local institution, or by a local funder, or they could even be a hard disk in the lab. People are free to make those choices and to find the best balance of reliability, cost, and maintenance that suits them. My suspicion is that after a degree of consolidation we will start to see institutions offering local data repositories as well as specialised services on the cloud that can provide more specialised and exciting functionality. Ideally these could all talk to each other so that multiple copies are held in these various services.

David says:

I would worry about putting something as valuable as my own data into the “cloud” […]

I’d rather rely on an internally controlled system and not have to worry about the business model of Flickr or whether Google was going to pull the plug on a tool I regularly use. Perhaps the level to think on is that of a university, or company–could you set up a system for all labs within an institution that’s controlled (and heavily backed up) by that institution? Preferably something standardized to allow interaction between institutions.

Then again, given the experiences I’ve had with university IT departments, this might not be such a good approach after all.

Which I think encapsulates a lot of the debate. I actually have greater faith in Flickr keeping my pictures safe than my own hard disk. And more faith in both than insitutional repository systems that don’t currently provide good data functionality and that I don’t understand. But I wouldn’t trust either in isolation. The best situation is to have everything everywhere, using interchange standards to keep copies in different places; specialised services out on the cloud to provide functionality (not every institution will want to provide a visualisation service for XAFS data), IRs providing backup archival and server space for anything that doesn’t fit elsewhere, and ultimately still probably local hard disks for a lot of the short to medium term storage. My view is that the institution has the responsibility of aggregating, making available, and archiving the work if its staff, but I personally see this role as more harvester than service provider.

All of which will turn on the question of business models. If the data stores a local, what is the business model for archival? If they are institutional how much faith do you have that the institution won’t close them down. And if they are commercial or non-profit third parties, or even directly government funded service, does the economics make sense in the long term. We need a shift in science funding if we want to archive and manage data in the longer term. And with any market some services will rise and some will die. The money has to come from somewhere and ultimately that will always be the research funders. Until there is a stronger call from them for data preservation and the resources to back it up I don’t think we will see much interesting development. Some funders are pushing fairly hard in this direction so it will be interesting to see what develops. A lot will turn on who has the responsibility for ensuring data availability and sharing. The researcher? The institution? The funder?

In the end you get what you pay for. Always worth remembering that sometimes even things that are free at point of use aren’t worth the price you pay for them.