Open Data, Open Source, Open Process: Open Research
There has been a lot of recent discussion about the relative importance of Open Source and Open Data (Friendfeed, Egon Willighagen, Ian Davis). I don’t fancy recapitulating the whole argument but following a discussion on Twitter with Glyn Moody this morning [1, 2, 3, 4, 5, 6, 7, 8] I think there is a way of looking at this with a slightly different perspective. But first a short digression.
I attended a workshop late last year on Open Science run by the Open Knowledge Foundation. I spent a significant part of the time arguing with Rufus Pollock about data licences, an argument that is still going on. One of Rufus’ challenges to me was to commit to working towards using only Open Source software. His argument was that there wasn’t really any excuses any more. Open Office could do the job of MS Office, Python with SciPy was up to the same level as MatLab, and anything specialist needed to be written anyway so should be open source from the off.
I took this to heart and I have tried, I really have tried. I needed a new computer and, although I got a Mac (not really ready for Linux yet), I loaded it up with Open Office, I haven’t yet put my favourite data analysis package on the computer (Igor if you must know), and have been working in Python to try to get some stuff up to speed. But I have to ask whether this is the best use of my time. As is often the case with my arguments this is a return on investment question. I am paid by the taxpayer to do a job. At what point is the extra effort I am putting into learning to use, or in some cases fight with, new tools cost more than the benefit that is gained, by making my outputs freely available?
Sometimes the problems are imposed from outside. I spent a good part of yesterday battling with an appalling, password protected, macroed-to-the-eyeballs Excel document that was the required format for me to fill in a form for an application. The file crashed Open Office and only barely functioned in Mac Excel at all. Yet it was required, in that format, before I could complete the application. Sometimes the software is just not up to scratch. Open Office Writer is fine, but the presentation and spreadsheet modules are, to be honest, a bit ropey compared to the commercial competitors. And with a Mac I now have Keynote which is just so vastly superior that I have now transferred wholesale to that. And sometimes it is just a question of time. Is it really worth me learning Python to do data analysis that I could knock in Igor in a tenth of the time?
In this case the answer is, probably yes. Because it means I can do more with it. There is the potential to build something that logs process the way I want to , the potential to convert it to run as a web service. I could do these things with other OSS projects as well in a way that I can’t with a closed product. And even better because there is a big open community I can ask for help when I run into problems.
It is easy to lose sight of the fact that for most researchers software is a means to an end. For the Open Researcher what is important is the ability to reproduce results, to criticize and to examine. Ideally this would include every step of the process, including the software. But for most issues you don’t need, or even want, to be replicating the work right down to the metal. You wouldn’t after all expect a researcher to be forced to run their software on an open source computer, with an open source chipset. You aren’t necessarily worried what operating system they are running. What you are worried about is whether it is possible read their data files and reproduce their analysis. If I take this just one step further, it doesn’t matter if the analysis is done in MatLab or Excel, as long as the files are readable in Open Office and the analysis is described in sufficient detail that it can be reproduced or re-implemented.
Lets be clear about this: it would be better if the analysis were done in an OSS environment. If you have the option to work in an OSS environment you can also save yourself time and effort in describing the process and others have a much better chances of identifying the sources of problems. It is not good enough to just generate an Excel file, you have to generate an Excel file that is readable by other software (and here I am looking at the increasing number of instrument manufacturers providing software that generates so called Excel files that often aren’t even readable in Excel). In many cases it might be easier to work with OSS so as to make it easier to generate an appropriate file. But there is another important point; if OSS generates a file type that is undocumented or worse, obfuscated, then that is also unacceptable.
Open Data is crucial to Open Research. If we don’t have the data we have nothing to discuss. Open Process is crucial to Open Research. If we don’t understand how something has been produced, or we can’t reproduce it, then it is worthless. Open Source is not necessary, but, if it is done properly, it can come close to being sufficient to satisfy the other two requirements. However it can’t do that without Open Standards supporting it for documenting both file types and the software that uses them.
The point that came out of the conversation with Glyn Moody for me was that it may be more productive to focus on our ability to re-implement rather than to simply replicate. Re-implementability, while an awful word, is closer to what we mean by replication in the experimental world anyway. Open Source is probably the best way to do this in the long term, and in a perfect world the software and support would be there to make this possible, but until we get there, for many researchers, it is a better use of their time, and the taxpayer’s money that pays for that time, to do that line fitting in Excel. And the damage is minimal as long as source data and parameters for the fit are made public. If we push forward on all three fronts, Open Data, Open Process, and Open Source then I think we will get there eventually because it is a more effective way of doing research, but in the meantime, sometimes, in the bigger picture, I think a shortcut should be acceptable.