programming – Science in the Open

I spend a lot of my time arguing that many of the problems in the research community are caused by journals. We have too many, they are an ineffective means of communicating the important bits of research, and as a filter they are inefficient and misleading. Today I am very happy to be publicly launching the call for papers for a new journal. How do I reconcile these two statements?

Computation lies at the heart of all modern research. Whether it is the massive scale of LHC data analysis or the use of Excel to graph a small data set. From the hundreds of thousands of web users that contribute to Galaxy Zoo to the solitary chemist reprocessing an NMR spectrum we rely absolutely on billions of lines of code that we never think to look at. Some of this code is in massive commercial applications used by hundreds of millions of people, well beyond the research community. Sometimes it is a few lines of shell script or Perl that will only ever be used by the one person who wrote it. At both extremes we rely on the code.

We also rely on the people who write, develop, design, test, and deploy this code. In the context of many research communities the rewards for focusing on software development, of becoming the domain expert, are limited. And the cost in terms of time and resource to build software of the highest quality, using the best of modern development techniques, is not repaid in ways that advance a researcherâ€™s career. The bottom line is that researchers need papers to advance, and they need papers in journals that are highly regarded, and (say it softly) have respectable impact factors. I donâ€™t like it. Many others donâ€™t like it. But that is the reality on the ground today, and we do younger researchers in particular a disservice if we pretend it is not the case.

Open Research Computation is a journal that seeks to directly address the issues that computational researchers have. It is, at its heart, a conventional peer reviewed journal dedicated to papers that discuss specific pieces of software or services. A few journals now exist in this space that either publish software articles or have a focus on software. Where ORC will differ is in its intense focus on the standards to which software is developed, the reproducibility of the results it generates, and the accessibility of the software to analysis, critique and re-use.

The submission criteria for ORC Software Articles are stringent. The source code must be available, on an appropriate public repository under an OSI compliant license. Running code, in the form of executables, or an instance of a service must be made available. Documentation of the code will be expected to a very high standard, consistent with best practice in the language and research domain, and it must cover all public methods and classes. Similarly code testing must be in place covering, by default, 100% of the code. Finally all the claims, use cases, and figures in the paper must have associated with them test data, with examples of both input data and the outputs expected.

The primary consideration for publication in ORC is that your code must be capable of being used, re-purposed, understood, and efficiently built on. You work must be reproducible. In short, we expect the computational work published in ORC to deliver at the level that is expected in experimental research.

In research we build on the work of those that have gone before. Computational research has always had the potential to deliver on these goals to a level that experimental work will always struggle to, yet to date it has not reliably delivered on that promise. The aim of ORC is to make this promise a reality by providing a venue where computational development work of the highest quality can be shared, and can be celebrated. To provide a venue that will stand for the highest standards in research computation and where developers, whether they see themselves more as software engineers or as researchers who code, will be proud to publish descriptions of their work.

These are ambitious goals and getting the technical details right will be challenging. We have assembled an outstanding editorial board, but we are all human, and we donâ€™t expect to get it all right, first time. We will be doing our testing and development out in the open as we develop the journal and will welcome comments, ideas, and criticisms to editorial@openresearchcomputation.com. If you feel your work doesnâ€™t quite fit the guidelines as Iâ€™ve described them above get in touch and we will work with you to get it there. Our aim, at the end of the day is to help the research developer to build better software and to apply better development practice. We can also learn from your experiences and wider ranging review and proposal papers are also welcome.

In the end I was persuaded to start yet another journal only because there was an opportunity to do something extraordinary within that framework. An opportunity to make a real difference to the recognition and quality of research computation. In the way it conducts peer review, manages papers, and makes them available Open Research Computation will be a very ordinary journal. We aim for its impact to be anything but.

Other related posts:

Jan Aerts: Open Research Computation: A new journal from BioMedCentral

Computational science: …Error (nature.com)

Just a brief thought prompted by two, partly related, things streaming past my nose. Firstly Michael Nielsen discussed the views of Aristotle and Sunstein on collective intelligence. The thing that caught my attention was the idea that deliberation can make can make group functioning worse, leading to a collective decision that is muddled rather than actually identifying the best answer presented by members of the community. The exception to this is well posed questions, where deliberation can help. In science we are familiar with the idea that getting the question right (correct design of experiment, well organised theory) can be more important than the answer.

The second item was a blog post entitled “Data is good, code is a liability” from Greg Linden that was shared by Deepak Singh. Greg discussed a talk given by Peter Norvig which focusses on the idea that it is better to get a good sized dataset and use very sparing code to get at an answer rather than attempt to get at the answer de novo via complex code. Quoting from the post:

In one of several examples, Peter put up a slide showing an excerpt for a rule-based spelling corrector. The snippet of code, that was just part of a much larger program, contained a nearly impossible to understand let alone verify set of case and if statements that represented rules for spelling correction in English. He then put up a slide containing a few line Python program for a statistical spelling correction program that, given a large data file of documents, learns the likelihood of seeing words and corrects misspellings to their most likely alternative. This version, he said, not only has the benefit of being simple, but also easily can be used in different languages.

What struck me was the connection between being able to write a short, readable snippet of code, and the “well posed question”. The dataset provides the collective intelligence. So is it possible to propose the following?

“A well posed question is one which, given an appropriate dataset, can be answered by easily prepared and comprehensible code”

This could also possibly be turned on its head as “a good programming environment is one in which well posed questions can be readily converted to programs”. But it also raises an important point about how the structure of datasets relates to the questions you want to ask. The challenge in recording data is to structure it in such a way that the widest possible set of questions can be asked of that data. Data models all pre-suppose the kind of questions that will be asked. And any sufficiently general data model will be inefficient for most specific types of query.

Rajarshi Guha and Pierre Lindenbaum have been busy preparing different datastores for the solubility data being generated as part of the Open Notebook Science Challenge announced by Jean-Claude Bradley (more on this later). Rajarshi’s form based input has an SQL backend while Pierre has been working to extract the information as RDF. The point is not that one approach is better than the other, but that we need both, and possibly many more formats – and ideally we need to interconvert between them on the fly. A well posed question can easily founder on an inappropriately structured dataset (this is actually just a rephrasing of the Saunders Principle). It will be by enabling easy conversion between different formats that we might approach a situation where the aphorism I have suggested could become true.

Tag: programming

Open Research Computation: An ordinary journal with extraordinary aims.

Connecting the dots – the well posed question and code as a liability

Related articles