Hoist by my own petard: How to reduce your impact with restrictive licences

No Television
Image via Wikipedia

I was greatly honoured to be asked to speak at the symposium held on Monday to recognize Peter Murray-Rusts’ contribution to scholarly communication. The lineup was spectactular, the talks insightful and probing, and the discussion serious, but also no longer trapped in the naive yes/no discussions of openness and machine readability, but moving on into detail, edge cases, problems and issues.

For my own talk I wanted to do something different to what I’ve been doing in recent talks. Following the example of Deepak Singh, John Wilbank and others I’ve developed what seems to be a pretty effective way of doing an advocacy talk, involving lots of slides, big images, few words going by at a fast rate. Recently I did 118 slides in 20 minutes. The talk for Peter’s symposium required something different so I eschewed slides and just spoke for 40 minutes wanting to explore the issues deeply rather than skate over the surface in the way the rapid fire approach tends to do.

The talk was, I think, reasonably well received and provoked some interesting (and heated) discussion. I’ve put the draft text I was working from up on an Etherpad. However due to my own stupidity the talk was neither livestreamed nor recorded. In a discussion leading up to talk I was asked whether I wanted to put up a pretty picture as a backdrop and I thought it would be good to put up the licensing slide that I use in all of my talks to show that livestreaming, twittering, etc, is fine and encouraging people to do it. The trouble is that I navigated to the slideshare deck that has that slide and just hit full screen without thinking. What the audience therefore saw was the first slide, which looks like this.

A restrictive talk licence prohibiting live streaming, tweeting, etc.

I simply didn’t notice as I was looking the other way. The response to this was both instructive and interesting. The first thing that happened as soon as the people running the (amazingly effective given the resources they had) livestream and recording saw the slide they shut down everything. In a sense this is really positive, it shows that people respect the requests of the speaker by default.

Across the audience people didn’t tweet, and indeed in a couple of cases deleted photographs that they had taken. Again the respect for the request people thought I was making was solid. Even in an audience full of radicals and open geeks no-one questioned the request. I’m slightly gobsmacked in fact that no-one shouted at me to ask what the hell I thought I was doing. Some thought I was being ironic, which I have to say would have been too clever by half. But again it shows, if you ask, people do for the most part respect that request.

Given the talk was about research impact, and how open approaches will enable it, it is rather ironic that by inadvertantly using the wrong slide I probably significantly reduced the impact of the talk. There is no video that I can upload, no opportunity for others to see the talk. Several people who I know were watching online whose opinion I value didn’t get to see the talk, and the tweetstream that I might have hoped would be full of discussion, disagreement, and alternative perspectives was basically dead. I effectively made my own point, reducing what I’d hoped might kick off a wider discussion to a dead talk that only exists in a static document and memories of the limited number of people who were in the room.

The message is pretty clear. If you want to reduce the effectiveness and impact of the work you’re doing, if you want to limit the people you can reach, then use restrictive terms. If you want our work to reach people and to maximise the chance it has to make a difference, make it clear and easy for people to understand that they are encouraged to copy, share, and cite your work. Be open. Make a difference.

Enhanced by Zemanta

Some notes on Open Access Week

Open Access logo, converted into svg, designed...
Image via Wikipedia

Open Access Week kicks off for the fourth time tomorrow with events across the globe. I was honoured to be asked to contribute to the SPARC video that will be released tomorrow. The following are a transcription of my notes – not quite what I said but similar. The video was released at 9:00am US Eastern Time on Monday 18 October.

It has been a great year for Open Access. Open Access publishers are steaming ahead, OA mandates are spreading and growing and the quality and breadth of repositories is improving across institutions, disciplines, and nations. There have problems and controversies as well, many involving shady publishers seeking to take advantage of the Open Access brand, but even this in its way is a measure of success.

Beyond traditional publication we’ve also seen great strides made in the publication of a wider diversity of research outputs. Open Access to data, to software, and to materials is moving up the agenda. There have been real successes. The Alzheimer’s Disease Network showed what can change when sharing becomes a part of the process. Governments and Pharmaceutical companies are releasing data. Publicly funded researchers are falling behind by comparison!

For me although these big stories are important, and impressive, it is the little wins that matter. The thousands or millions of people who didn’t have to wait to read a paper, who didn’t need to write an email to get a dataset, who didn’t needlessly repeat and experiment known not to work. Every time a few minutes, a few hours, a few weeks, months, or years is saved we deliver more for the people who pay for this research. These small wins are the hardest to measure, and the hardest to explain, but they make up the bulk of the advantage that open approaches bring.

But perhaps the most important shift this year is something more subtle. Each morning I listen to the radio news, and every now and then there is a science story. These stories are increasingly prefaced with “…the research, published in the journal of…” and increasingly that journal is Open Access. A long running excuse for not referring the wider community to original literature has been its inaccessibility. That excuse is gradually disappearing. But more importantly there is a whole range of research outcomes that people, where they are interested, where they care enough to dig deeper, can inform themselves about. Research that people can use to reach their own conclusions about their health, the environment, technology, or society.

I find it difficult to see this as anything but a good thing, but nonetheless we need to recognize that it brings challenges. Challenges of explaining clearly, challenges in presenting the balance of evidence in a useful form, but above all challenges of how to effectively engage those members of the public who are interested in the details of the research. The web has radically changed the expectations of those who seek and interact with information. Broadcast is no longer enough. People expect to be able to talk back.

The last ten years of the Open Access movement has been about how to make it possible for people to touch, read, and interact with the outputs of research. Perhaps the challenge for the next ten years is to ask how we can create access opportunities to the research itself. This won’t be easy, but then nothing that is worthwhile ever is.

Open Access Week 2010 from SPARC on Vimeo.

Enhanced by Zemanta

Free…as in the British Museum

Great Court - Quadrangle and Sydney Smirke's 1...
Image via Wikipedia

Richard Stallman and Richard Grant, two people who I wouldn’t ever have expected to group together except based on their first name, have recently published articles that have made me think about what we mean when we talk about “Open” stuff. In many ways this is a return right to the beginning of this blog, which started with a post in which I tried to define my terms as I understood them at the time.

In Stallman’s piece he argues that “open” as in “open source” is misleading because it sounds limiting. It makes it sound as though the only thing that matters is having access to the source code. He dismisses the various careful definitions of open as specialist pleading, definitions that only the few are aware of, and that using them will confuse most others. He is of course right, no matter how carefully we define open it is such a commonly used word and so open to interpretation itself that there will always be ambiguity.

Many efforts have been made in various communities to find new and more precise terms, “gratis” and “libre”, “green” vs “gold”, but these never stick, largely because the word “open” captures the imagination in a way more precise terms do not, and largely because these terms capture the issues that divide us, rather than those that unite us.

So Stallman has a point but he then goes on to argue that “free” does not suffer from the same issues because it does capture an important aspect of Free Software. I can’t agree here because it seems clear to me we have exactly the same confusions. “Free as in beer”, “free as in free speech” capture exactly the same types of confusion, and indeed exactly the same kind of issues as all the various subdefinitions of open. But worse than that it implies these things are in fact free, that they don’t actually cost anything to produce.

In Richard Grant’s post he argues against the idea that the Faculty of 1000, a site that provides expert assessment of researcher papers by a hand picked group of academics, “should be open access”. His argument is largely pragmatic, that running the service costs money. That money needs to be recovered in some way or there would be no service. Now we can argue that there might be more efficient and cheaper ways of providing that service but it is never going to be free. The production of the scholarly literature is likewise never going to be free. Archival, storage, people keeping the system running, just the electricity, these all cost money and that has to come from somewhere.

It may surprise overseas readers but access to many British museums is free to anyone. The British Museum, National Portrait Gallery and others are all free to enter. That they are not “free” in terms of cost is obvious. This access is subsidised by the taxpayer. The original collection of the British Museum was in fact donated to the British people, but in taking that collection on the government was accepting a liability. One that continues to run into millions of pounds a year, just to stop the collection from falling apart, let alone enhancing, displaying it, or researching it.

The decision to make these museums openly accessible is in part ideological, but it can also be framed as a pragmatic decision. Given the enormous monetary investment there is a large value in subsidising free access to maximise the social benefits that universal access can provide. Charging for access would almost certainly increase income, or at least decrease costs, but there would be significant opportunity cost in terms of social return on investment by barring access.

Those of us who argue for Open Access to the scholarly literature or for Open Data, Process, Materials or whatever need to be careful that we don’t pretend this comes free. We also need to educate ourselves more about the costs. Writing costs money, peer review costs money, editing the formats, running the web servers, and providing archival services costs money. And it costs money whether it is done by publishers operating a subscription or  author-pays business models, or by institutional or domain repositories. We can argue for Open Access approaches on economic efficiency grounds, and we can argue for it based on maximizing social return on investment, essentially that for a small additional investment, over and above the very large existing investment in research, significant potential social benefits will arise.

Open Access scholarly literature is free like the British Museum or a national monument like the Lincoln Memorial is free. We should strive to bring costs down as far as we can. We should defend the added value of investing in providing free access to view and use content. But we should never pretend that those costs don’t exist.

Enhanced by Zemanta

The BMC 10th Anniversary Celebrations and Open Data Prize

Anopheles gambiae mosquito
Image via Wikipedia

Last Thursday night I was privileged to be invited to the 10th anniversary celebrations for BioMedCentral and to help announce and give the first BMC Open Data Prize. Peter Murray-Rust has written about the night and the contribution of Vitek Tracz to the Open Access movement. Here I want to focus on the prize we gave, the rationale behind it, and the (difficult!) process we went through to select a winner.

Prizes motivate behaviour in researchers. There is no question that being able to put a prize down on your CV is a useful thing. I have long felt, originally following a suggestion from Jeremiah Faith, that a prize for Open Research would be a valuable motivator and publicity aid to support those who are making an effort. I was very happy therefore to be asked to help judge the prize, supported by Microsoft, to be awarded at the BMC celebration for the paper in a BMC journal that was an oustanding example of Open Data. Iain Hrynaszkiewicz and Matt Cockerill from BMC, Lee Dirks from Microsoft Research, along with myself, Rufus Pollock, John Wilbanks, and Peter Murray-Rust tried to select from a very strong list of contenders a shortlist and a prize winner.

Early on we decided to focus on papers that made data available rather than software frameworks or approaches that supported data availability. We really wanted to focus attention on conventional scientists in traditional disciplines that were going beyond the basic requirements. This meant in turn that a whole range of very important contributions from developers, policy experts, and others were left out. Particularly noteable examples were “Taxonomic information exchange and copyright: the Plazi approach” and “The FANTOM web resource: from mammalian transcriptional landscape to its dynamic regulation“.

This still left a wide field of papers making significant amounts of data available. To cut down at this point we looked at the licences (or lack thereof) under which resources were being made available. Anything that wasn’t broadly speaking “open” was rejected at this point. This included code that wasn’t open source, data that was only available via a login, or that had non-commercial terms. None of the data provided was explicitly placed in the public domain, as recommended by Science Commons and the Panton Principles, but a reasonable amount was made available in an accessible form with no restrictions beyond a request for citation. This is an area where we expect best practice to improve and we see the prize as a way to achieve that. To be considered any external resource will have to be compliant ideally with all of Science Commons Protocols, the Open Knowledge Definition, and the Panton Principles. This means an explicit dedication of data to the public domain via PDDL or ccZero.

Much of the data that we looked at was provided in the form of Excel files. This is not ideal but in terms of accessibility it’s actually not so bad. While many of us might prefer XML, RDF, or at any rate CSV files the bottom line is that it is possible to open most Excel files with freely available open source software, which means the data is accessible to anyone. Note that “most” though. It is very easy to create Excel files that make data very hard to extract. Column headings are crucial (and were missing or difficult to understand in many cases) and merging and formatting cells is an absolute disaster. I don’t want to point to examples but a plea to those who are trying to make data available: if you must use Excel just put column headings and row headings. No merging, no formatting, no graphs. And ideally export it as CSV as well. It isn’t as pretty but useful data isn’t about being pretty. The figures and tables in your paper are for the human readers, for supplementary data to be useful it needs to be in a form that computers can easily access.

We finally reduced our shortlist to only about ten papers where we felt people had gone above and beyond the average. “Large-scale insertional mutagenesis of a coleopteran stored grain pest, the red flour beetle Tribolium castaneum, identifies embryonic lethal mutations and enhancer traps” received particular plaudits for making not just data but the actual beetles available. “Assessment of methods and analysis of outcomes for comprehensive optimization of nucleofection” and “An Open Access Database of Genome-wide Association Results” were both well received as efforts to make a comprehensive data resource available.

In the end though we were required to pick just one winner. The winning paper got everyone’s attention right from the beginning as it came from an area of science not necessarily known for widespread data publication. It simply provided all of the pieces of information, almost without comment, in the form of clearly set out tables. They are in Excel and there are some issues with formatting and presentation, multiple sheets, inconsistent tabulation. It would have been nice to see more of the analysis code used as well. But what appealed most was that the data were simply provided above and beyond what appeared in the main figures as a natural part of the presentation and that the data were in a form that could be used beyond the specific study. So it was a great pleasure to present the prize to Yoosuk Lee on behalf of the authors of “Ecological and genetic relationships of the Forest-M form among chromosomal and molecular forms of the malaria vector Anopheles gambiae sensu stricto“.

Many challenges remain, making this data discoverable, and improving the licensing and accessibility all round. Given that it is early days, we were impressed by the range of scientists making an effort to make data available. Next year we will hope to be much stricter on the requirements and we also hope to see many more nominations. In a sense for me, the message of the evening was that the debate on Open Access publishing is over, its only a question of where the balance ends up. Our challenge for the future is to move on and solve the problems of making data, process, and materials more available and accessible so as to drive more science.

Enhanced by Zemanta

The Panton Principles: Finding agreement on the public domain for published scientific data

Drafters of the Panton principlesI had the great pleasure and privilege of announcing the launch of the Panton Principles at the Science Commons Symposium – Pacific Northwest on Saturday. The launch of the Panton Principles, many months after they were first suggested is really largely down to the work of Jonathan Gray. This was one of several projects that I haven’t been able to follow through properly on and I want to acknowledge the effort that Jonathan has put into making that happen. I thought it might be helpful to describe where they came from, what they are intended to do and perhaps just as importantly what they don’t.

The Panton Principles aim to articulate a view of what best practice should be with respect to data publication for science. They arose out of an ongoing conversation between myself Peter Murray-Rust and Rufus Pollock. Rufus founded the Open Knowledge Foundation, an organisation that seeks to promote and support open culture, open source, and open science, with the emphasis on the open. The OKF position on licences has always been that share-alike provisions are an acceptable limitation to complete freedom to re-use content. I have always taken the Science Commons position that share-alike provisions, particularly on data have the potential to make it difficult or impossible to get multiple datasets or systems to interoperate. In another post I will explore this disagreement which really amounts to a different perspective on the balance of the risks and consequences of theft vs things not being used or useful. Peter in turn is particularly concerned about the practicalities – really wanting a straightforward set of rules to be baked right into publication mechanisms.

The Principles came out of a discussion in the Panton Arms a pub near to the Chemistry Department of Cambridge University, after I had given a talk in the Unilever Centre for Molecular Informatics. We were having our usual argument trying to win the others over when we actually turned to what we could agree on. What sort of statement could we make that would capture the best parts of both positions with a focus on science and data. We focussed further by trying to draw out one specific issue. Not the issue or when people should share results, or the details of how, but the mechanisms that should be used for re-use. The principles are intended to focus on what happens when a decision has been made to publish data and where we assume that the wish is for that data to be effectively re-used.

Where we found agreement was that for science, and for scientific data, and particularly science funded by public investment, that the public domain was the best approach and that we would all recommend it. We brought John Wilbanks in both to bring the views of Creative Commons and to help craft the words. It also made a good excuse to return to the pub. We couldn’t agree on everything – we will never agree on everything – but the form of words chosen – that placing data explicitly, irrevocably, and legally in the public domain satisfies both the Open Knowledge Definition and the Science Commons Principles for Open Data was something that we could all personally sign up to.

The end result is something that I have no doubt is imperfect. We have borrowed inspiration from the Budapest Declaration, but there are three B’s. Perhaps it will take three P’s to capture all the aspects that we need. I’m certainly up for some meetings in Pisa or Portland, Pittsburgh or Prague (less convinced about Perth but if it works for anyone else it would make my mother happy). For me it captures something that we agree on – a way forwards towards making the best possible practice a common and practical reality. It is something I can sign up to and I hope you will consider doing so as well.

Above all, it is a start.

Reblog this post [with Zemanta]

It wasn’t supposed to be this way…

I’ve avoided writing about the Climate Research Unit emails leak for a number of reasons. Firstly it is clearly a sensitive issue with personal ramifications for some and for many others just a very highly charged issue. Probably more importantly I simply haven’t had the time or energy to look into the documents myself. I haven’t, as it were, examined the raw data for myself, only other people’s interpretations. So I’ll try to stick to a very general issue here.

There are appear to be broadly two responses from the research community to this saga. One is to close ranks and to a certain extent say “nothing was done wrong here”. This is at some level, the tack taken by the Nature Editorial of 3 December, which was headed up with “Stolen e-mails have revealed no scientific conspiracy…”. The other response is that the scandal has exposed the shambolic way that we deal with collecting, archiving, and making available both data and analysis in science, as well as the endemic issues around the hoarding of data by those who have collected it.

At one level I belong strongly in the latter camp, but I also appreciate the dismay that must be felt by those who have looked at, and understand what the emails actually contain, and their complete inability to communicate this into the howling winds of what seems to a large extent a media beatup. I have long felt that the research community would one day be shocked by the public response when, for whatever reason, the media decided to make a story about the appalling data sharing practices of publicly funded academic researchers like myself. If I’d thought about it more deeply I should have realised that this would most likely be around climate data.

Today the Times reports on its front page that the UK Metererology Office is to review 160 years of climate data and has asked a range of contributing organisations to allow it to make data public. The details of this are hazy but if the UK Met Office is really going to make the data public this is a massive shift. I might be expected to be happy about this but I’m actually profoundly depressed. While it might in the longer term lead to more strongly worded and enforced policies it will also lead to data sharing being forever associated with “making the public happy”. My hope has always been that the sharing of the research record would come about because people started to see the benefits, because they could see the possibilities in partnership with the wider community, and that it made their research more effective. Not because the tabloids told us we should.

Collecting the best climate data and doing the best possible analysis on it is not an option. If we get this wrong and don’t act effectively then with some probability that is significantly above zero our world ends. The opportunity is there to make this the biggest, most important, and most effective research project ever undertaken. To actively involve the wider community in measurement. To get an army of open source coders to re-write, audit, and re-factor the analysis software. Even to involve the (positively engaged) sceptics, to use their interest and ability to look for holes and issues. Whether politicians will act on data is not the issue that the research community can or should address; what we need to be clear on is that we provide the best data, the best analysis, and an honest view of the uncertainties. Along with the ability of anyone to critically analyse the basis for those conclusions.

There is a clear and obvious problem with this path. One of the very few credible objections to open research that I have come across is that by making material available you open your inbox to a vast community of people who will just waste your time. The people who can’t be bothered to read the background literature or learn to use the tools; the ones who just want the right answer. This is nowhere more the case than it is with climate research and it forms the basis for the most reasonable explanation of why the CRU (and every other repository of climate data as far as I am aware) have not made more data or analysis software directly available.

There are no simple answers here, and my concern is that in a kneejerk response to suddenly make things available no-one will think to put in place the social and technical infrastructure that we need to support positive engagement, and to protect active researchers, both professional and amateur from time-wasters. Interestingly I think this infrastructure might look very similar to that which we need to build to effectively share the research we do, and effectively discover the relevant work of others. Infrastructure is never sexy, particularly in the middle of a crisis. But there is one thing in the practice of research that we forget at our peril. Any given researcher needs to earn the right to be taken seriously. No-one ever earns the right to shut people up. Picking out the objection that happens to be important is something we have to at least attempt to build into our systems.

Open Data, Open Source, Open Process: Open Research

There has been a lot of recent discussion about the relative importance of Open Source and Open Data (Friendfeed, Egon Willighagen, Ian Davis). I don’t fancy recapitulating the whole argument but following a discussion on Twitter with Glyn Moody this morning [1, 2, 3, 4, 5, 6, 7, 8] I think there is a way of looking at this with a slightly different perspective. But first a short digression.

I attended a workshop late last year on Open Science run by the Open Knowledge Foundation. I spent a significant part of the time arguing with Rufus Pollock about data licences, an argument that is still going on. One of Rufus’ challenges to me was to commit to working towards using only Open Source software. His argument was that there wasn’t really any excuses any more. Open Office could do the job of MS Office, Python with SciPy was up to the same level as MatLab, and anything specialist needed to be written anyway so should be open source from the off.

I took this to heart and I have tried, I really have tried. I needed a new computer and, although I got a Mac (not really ready for Linux yet), I loaded it up with Open Office, I haven’t yet put my favourite data analysis package on the computer (Igor if you must know), and have been working in Python to try to get some stuff up to speed. But I have to ask whether this is the best use of my time. As is often the case with my arguments this is a return on investment question. I am paid by the taxpayer to do a job. At what point is the extra effort I am putting into learning to use, or in some cases fight with, new tools cost more than the benefit that is gained, by making my outputs freely available?

Sometimes the problems are imposed from outside. I spent a good part of yesterday battling with an appalling, password protected, macroed-to-the-eyeballs Excel document that was the required format for me to fill in a form for an application. The file crashed Open Office and only barely functioned in Mac Excel at all. Yet it was required, in that format, before I could complete the application. Sometimes the software is just not up to scratch. Open Office Writer is fine, but the presentation and spreadsheet modules are, to be honest, a bit ropey compared to the commercial competitors. And with a Mac I now have Keynote which is just so vastly superior that I have now transferred wholesale to that. And sometimes it is just a question of time. Is it really worth me learning Python to do data analysis that I could knock in Igor in a tenth of the time?

In this case the answer is, probably yes. Because it means I can do more with it. There is the potential to build something that logs process the way I want to , the potential to convert it to run as a web service. I could do these things with other OSS projects as well in a way that I can’t with a closed product. And even better because there is a big open community I can ask for help when I run into problems.

It is easy to lose sight of the fact that for most researchers software is a means to an end. For the Open Researcher what is important is the ability to reproduce results, to criticize and to examine. Ideally this would include every step of the process, including the software. But for most issues you don’t need, or even want, to be replicating the work right down to the metal. You wouldn’t after all expect a researcher to be forced to run their software on an open source computer, with an open source chipset. You aren’t necessarily worried what operating system they are running. What you are worried about is whether it is possible read their data files and reproduce their analysis. If I take this just one step further, it doesn’t matter if the analysis is done in MatLab or Excel, as long as the files are readable in Open Office and the analysis is described in sufficient detail that it can be reproduced or re-implemented.

Lets be clear about this: it would be better if the analysis were done in an OSS environment. If you have the option to work in an OSS environment you can also save yourself time and effort in describing the process and others have a much better chances of identifying the sources of problems. It is not good enough to just generate an Excel file, you have to generate an Excel file that is readable by other software (and here I am looking at the increasing number of instrument manufacturers providing software that generates so called Excel files that often aren’t even readable in Excel). In many cases it might be easier to work with OSS so as to make it easier to generate an appropriate file. But there is another important point; if OSS generates a file type that is undocumented or worse, obfuscated, then that is also unacceptable.

Open Data is crucial to Open Research. If we don’t have the data we have nothing to discuss. Open Process is crucial to Open Research. If we don’t understand how something has been produced, or we can’t reproduce it, then it is worthless. Open Source is not necessary, but, if it is done properly, it can come close to being sufficient to satisfy the other two requirements. However it can’t do that without Open Standards supporting it for documenting both file types and the software that uses them.

The point that came out of the conversation with Glyn Moody for me was that it may be more productive to focus on our ability to re-implement rather than to simply replicate. Re-implementability, while an awful word, is closer to what we mean by replication in the experimental world anyway. Open Source is probably the best way to do this in the long term, and in a perfect world the software and support would be there to make this possible, but until we get there, for many researchers, it is a better use of their time, and the taxpayer’s money that pays for that time, to do that line fitting in Excel. And the damage is minimal as long as source data and parameters for the fit are made public. If we push forward on all three fronts, Open Data, Open Process, and Open Source then I think we will get there eventually because it is a more effective way of doing research, but in the meantime, sometimes, in the bigger picture, I think a shortcut should be acceptable.

The growth of linked up data in chemistry – and good community projects

It’s been an interesting week or so in the Chemistry online world. Following on from my musings about data services and the preparation I was doing for a talk the week before last I asked Tony Williams whether it was possible to embed spectra from ChemSpider on a generic web page in the same way that you would embed a YouTube video, Flickr picture, or Slideshare presentation. The idea is that if there are services out on the cloud that make it easier to put some rich material in your own online presence by hosting it somewhere that understands about your data type, then we have a chance of pulling all of these disparate online presences together.

Tony went on to release two features, one that enables you to embed a molecule, which Jean-Claude has demonstrated over on the ONS Challenge Wiki. Essentially by cutting and pasting a little bit of text from ChemSpider into Wikispaces you get a nicely drawn image of the molecule, and the machinery is in place to enable good machine readability of the displayed page (by embedding chemical identifiers within the code) as well as enabling the aggregation of web based information about the molecule back at Chemspider.

The second feature was the one I had asked about, the embedding of spectra. Again this is really useful because it means that as an experimentalist you can host spectra on a service that gets what they are, but you can also incorporate them in a nice way back into your lab book, online report, or whatever it is you are doing. This has already enabled Andy Lang and Jean-Claude to build a very cool game, initially in Second Life but now also on the web. Using the spectral and chemical information from Chemspider the player is presented with the spectrum and three molecules; if they select the correct molecule they get some points, if they get it wrong they lose some. As Tony has pointed out, this is also a way of crowdsourcing the curation process – if the majority of people disagree with the “correct” assignment then maybe the spectrum needs a second look. Chemistry Captchas anyone?

The other even this week has been the efforts by Mitch over at the Chemistry Blog to set up an online resource for named reactions by crowdsourcing contributions and ultimately turning it into a book. Mitch deserves plaudits for this because he’s gone on and done something rather than just talked about it and we need more people like that. Some of us have criticised the details (also see comments at the original post) of how he is going about it but from my perspective this is definitely criticism motivated by the hope that he will succeed and that by making some changes early on, there is the chance to get much more out of the contributions that he gets.

In particular Egon asked whether it would be better to use Wikipedia as the platform for aggregating the named reaction; a point which I agree with. The problem that people see with Wikipedia is largely that of image. People are concerned about inaccurate editing, about the sometimes combative approach of senior editors that are not necessarily expert in the are. Part of the answer is to just get in there and do it – particularly in chemistry there are a range of people working hard to try and get stuff cleaned up. Lots of work has gone into the Chemical boxes and named reactions would be an obvious thing to move on to. Nonetheless it may not work for some people and to a certain extent as long as the material that is generated can be aggregated back to Wikipedia I’m not really fussed.

The bigger concern for us “chemoinformatics jocks” (I can’t help but feel that categorising me as a foo-informatics anything is a little off beam but never mind (-;) was the example pages Mitch put up where there was very little linking back of data to other resources. So there was no way, for instance, to know that this page was even about a specific class of chemicals. The schemes were shown as plain images, making it very hard for any information aggregation service to do anything useful. Essentially the pages didn’t make full use of the power of the web to connect information.

Mitch in turn has taken the criticism offered in a positive fashion and has thrown down the gauntlet; effectively asking the question, “well if you want this marked up, where are the tools to make it easy, and the instructions in plain English to show how to do it?”. He also asks, if named reactions aren’t the best place to start, then what would be a good collaborative project. Fair questions, and I would hope the Chemspider services start to point in the right direction. Instead of drawing an image of a molecule and pasting it on a web page, use the service and connect to the molecule itself, this connects the data up, and gives you a nice picture for free. It’s not perfect. The ideal situation would be a little chemical drawing palette. You draw the molecule you want, it goes to the data service of choice, finds the correct molecule (meaning the user doesn’t need to know what SMILES, InChis or whatever are), and then brings back whatever you want; image, data, vendors, price. This would be a really powerful demonstration of the power of linked data and it probably could be pulled together from existing services.

But what about the second question? What would be a good project? Well this is where that second Chemspider functionality comes in. What about flooding Chemspider with high quality, electronic copies, of NMR spectra. Not the molecules that your supervisor will kill you for releasing, but all those molecules you’ve published, all those hiding in undergraduate chemistry practicals. Grab the electronic files, lets find a way of converting them all to JCamp online, and get ’em up on Chemspider as Open Data.

The people you meet on the train…

Yesterday on the train I had a most remarkable experience of synchronicity. I had been at the RIN workshop on the costs of scholarly publishing (more on that later) in London and was heading of to Oxford for a group dinner. On the train I was looking for a seat with a desk and took one up opposite a guy with a slightly battered looking mac laptop. As I pulled out my new Macbook (13” 2.4 GHz, 4 Gb memory since you ask) he leaned across to have a good look, as you do, and we struck up a conversation. He asked what I did and I talked a little about being a scientist and my role at work. He was a consultant who worked on systems integration.
At some stage he made a throwaway comment about the fact that he had been going back to learn or re-learn some fairly advanced statistics and that he had had a lot of trouble getting access to some academic papers, certainly he didn’t want to pay for them, but had managed to find free versions of what he wanted online. I managed to keep my mouth somewhat shut at this point, except to say I had been at a workshop looking at these issues. However it gets better, much better. He was looking into quantitative risk issues and this lead into a discussion about the problems of how science and particularly medicine reporting in the media doesn’t provide links back to the original research (which is generally not accessible anyway) and that, what is worse, the original data is usually not available (and this was all unprompted by me, honestly!). To paraphrase his comment “the trouble with science is that I can’t get at the numbers behind the headlines; what is the sample size, how was the trial run…” Well at this point, all thought of getting any work done went out the window and we had a great discussion about data availability, the challenges of recording it in the right form (his systems integration work includes efforts to deal with mining of large, badly organised data sets), drifted into identity management and trust networks and was a great deal of fun.
What do I take from this? That there is a a demand for this kind of information and data from an educated and knowledgable public. One of the questions he asked was whether as a scientist I ever see much in the way of demand from the public. My response was that, aside from pushing the taxpayer access to taxpayer funded research myself, I hadn’t seen much evidence of real demand. His argument was that there is a huge nascent demand there from people who haven’t thought about their need to get into the detail of news stories that effect them. People want the detail, they just have no idea of how to go about getting it. Spread the idea that access to that detail is a right and we will see the demand for access to the outputs of research grow rapidly. The idea that “no-one out there is interested or competent to understand the details” is simply not true. The more respect we have for the people who fund our research the better frankly.

A personal view of open science – Part III – Social issues

The third installment of the paper (first part, second part) where I discuss social issues around practicing more Open Science.

Scientists are inherently rather conservative in their adoption of new approaches and tools. A conservative approach has served the community well in the process of sifting ideas and claims; this approach is well summarised by the aphorism ‘extraordinary claims require extraordinary evidence’. New methodologies and tools often struggle to be accepted until the evidence of their superiority is overwhelming. It is therefore unreasonable to expect the rapid adoption of new web based tools and even more unreasonable to expect scientsits to change their overall approach to their research en masse. The experience of adoption of new Open Access journals is a good example of this.

Recent studies have shown that scientists are, in principle, in favour of publishing in Open Access journals yet show marked reluctance to publish in such journals in practice [1]. The most obvious reason for this is the perceived cost. Because most operating Open Access publishers charge a publication fee, and until recently such charges were not allowable costs for many research funders, it can be challenging for researchers to obtain the necessary funds. Although most OA publishers will waive these charges there is anecdotally a marked reluctance to ask for such a waiver. Other reasons for not submitting papers to OA journals include the perception that most OA journals are low impact and a lack of OA journals in specific fields. Finally, simple inertia can be a factor where the traditional publication outlets for a specific field are well defined and publishing outside the set of ‘standard’ journals runs the risk of the work simply not being seen by peers. As there is no perception of a reward for publishing in open access journals, and a perception of significant risk, uptake remains relatively small.

Making data available faces similar challenges but here they are more profound. At least when publishing in an open access journal it can be counted as a paper. Because there is no culture of citing primary data, but rather of citing the papers they are reported in, there is no reward for making data available. If careers are measured in papers published then making data available does not contribute to career development. Data availability to date has generally been driven by strong community norms, usually backed up by journal submission requirements. Again this links data publication to paper publication without necessarily encouraging the release of data that is not explicitly linked to a peer reviewed paper. The large scale DNA sequencing and astronomy facilities stand out as cases where data is automatically made available as it is taken. In both cases this policy is driven largely by the funders, or facility providers, who are in position to make release a condition of funding the data collection. This is not, however a policy that has been adopted by other facilities such as synchrotrons, neutron sources, or high power photon sources.

In other fields where data is more heterogeneous and particular where competition to publish is fierce, the idea of data availability raises many fears. The primary one is of being ‘scooped’ or data theft where others publish a paper before the data collector has had the ability to fully analyse the data. This again is partly answered by robust data citation standards but this does not prevent another group publishing an analysis quicker, potentially damaging the career or graduation prospects of the data collector. A principle of ‘first right to publish’ is often suggested. Other approaches include timed embargoes for re-use or release. All of these have advantages and disadvantages which depend to a large extent on how well behaved members of a specific field are. Another significant concern is that the release of substandard, non peer-reviewed, or simply innaccurate data into the public domain will lead to further problems of media hype and public misunderstanding. This must be balanced against the potential public good of having relevant research data available.

The community, or more accurately communities, in general, are waiting for evidence of benefits before adopting either open access publication or open data policies. This actually provides the opportunity for individuals and groups to take first mover advantages. While remaining controversial [3, 4] there is some evidence that publication in open access journals leads to higher citation counts for papers [5, 6] and that papers for which the supporting data is available receive more citations [7]. This advantage is likely to be at its greatest early in the adoption curve and will clearly disappear if these approaches become widespread. There are therefore clear advantages to be had in rapidly adopting more open approaches to research which can be balanced against the risks described above.

Measuring success in the application of open approaches and particularly quantifying success relative to traditional approaches is a challenge, as is demonstrated by the continuing controversy over the citation advantage of open access articles. However pointing to examples of success is relatively straightforward. In fact Open Science has a clear public relations advantage as the examples are out in the open for anyone to see. This exposure can be both good and bad but it makes publicising best practice easy. In many ways the biggest successes of open practice are the ones that we miss because they are right in front of us, the global databases of freely accessible data in biological databases such as the Protein Data Bank, NCBI, and many others that have driven the massive advances in biological sciences over the past 20 years. The ability to analyse and consider the implications of genome scale DNA sequence data, as it is being generated, is now

In the physical sciences, the arXiv has long stood as an example to other disciplines of how the research literature can be made available in an effective and rapid manner, and the availability of astronomical data from efforts such as the Sloan Digital Sky Survey make efforts combining public outreach and the crowdsourcing of data analysis such as Galaxy Zoo possible. There is likely to be a massive expansion in the availability of environmental and ecological data globally as the potential to combine millions of data gatherers holding mobile phones, and sophisticated data aggregation and manipulation tools is realised.

Closer to the bleeding edge of radical sharing there have been less high profile successes, a reflection both of the limited amount of time these approaches have been pursued and the limited financial and personnel resources that have been available. Nonetheless there are examples. Garret Lisi’s high profile preprint on the ArXiv, An exceptionally simple theory of everything, [8] is supported by a comprehensive online notebook at http://deferentialgeometry.org that contains all the arguments as well as the background detail and definitions that support the paper. The announcement by Jean-Claude Bradley of the successful identification of several compounds with activity against malaria [9] is an example where the whole research process was carried out in the open, from the decision on what the research target should be, through the design and in silico testing of a library of chemicals, to the synthesis and testing of those compounds. For every step of this process the data is available online and several of the collaborators that made the study possible made contact due to finding that material online. The potential for a coordinated global synthesis and screening effort is currently being investigated.

There are both benefits and risks associated with open practice in research. Often the discussion with researchers is focussed on the disadvantages and risks. In an inherently conservative pursuit it is perfectly valid to ask whether changes of the type and magnitude offer any benefits given the potential risks they pose. These are not concerns that should be dismissed or ridiculed, but ones that should be taken seriously, and considered. Radical change never comes without casualties, and while some concerns may be misplaced, or overblowm, there are many that have real potential consequences. In a competitive field people will necessarily make diverse decisions on the best way forward for them. What is important is providing as good information to them as is possible to help them balance the risks and benefits of any approach they choose to take.

The fourth and final part of this paper can be found here.

  1. Warlick S E, Vaughan K T. Factors influencing publication choice: why faculty choose open access. Biomedical Digital Libraries. 2007;4:1-12.
  2. Bentley D R. Genomic Sequence Information Should Be Released Immediately and Freely. Science. 1996;274(October):533-534.
  3. Piwowar H A, Day R S, Fridsma D B. Sharing Detailed Research Data Is Associated with Increased Citation Rate. PLoS ONE. 2007;1(3):e308.
  4. Davis P M, Lewenstein B V, Simon D H, Booth J G, Connolly M J. Open access publishing, article downloads, and citations: randomised controlled trial. BMJ. 2008;337(October):a568.
  5. Rapid responses to David et al., http://www.bmj.com/cgi/eletters/337/jul31_1/a568
  6. Eysenbach G. Citation Advantage of Open Access Articles. PLoS Biology. 2006;4(5):e157.
  7. Hajjem, C., Harnad, S. and Gingras, Y. (2005) Ten-Year Cross-Disciplinary Comparison of the Growth of Open Access and How it Increases Research Citation Impact. IEEE Data Engineering Bulletin, 28 (4). pp. 39-47. http://eprints.ecs.soton.ac.uk/12906/
  8. Lisi G, An exceptionally simple theory of everything, arXiv:0711.0770v1 [hep-th], November 2007.
  9. Bradley J C, We have antimalarial activity!, UsefulChem Blog, http://usefulchem.blogspot.com/2008/01/we-have-anti-malarial-activity.html, January 25 2008.