The software code that is written to support and manage research sits at a critical intersection of our developing practice of shared, reproducible, and re-useble research in the 21st century. Code is amongst the easiest things to usefully share, being both made up of easily transferable bits and bytes but also critically carrying its context with it in a way that digital data doesn’t do. Code at its best is highly reproducible: it comes with the tools to determine what is required to run it (make files, documentation of dependencies) and when run should (ideally) generate the same results from the same data. Where there is a risk that it might not, good code will provide tests of one sort or another than you can run to make sure that things are ok before proceeding. Testing, along with good documentation is what ensures that code is re-usable, that others can take it and efficiently build on it to create new tools, and new research.
The outside perspective, as I have written before, is that software does all of this better than experimental research. In practice the truth is that there are frameworks that make it possible for software to do a very good job on these things, but that in reality doing a good job takes work; work that is generally not done. Most software for research is not shared, is not well documented, generates results that are not easily reproducible, and does not support re-use and repurposing through testing and documentation. Indeed much like most experimental research. So how do we realise the potential of software to act as an exemplar for the rest of our research practice?
Nick Barnes of the Climate Code Foundation developed the Science Code Manifesto, a statement of how things ought to be (I was very happy to contribute and be a founding signatory) and while for many this may not go far enough (it doesn’t explicitly require open source licensing) it is intended as a practical set of steps that might be adopted by communities today. This has already garnered hundreds of endorsers and I’d encourage you to sign up if you want to show your support. The Science Code Manifesto builds on work over many years of Victoria Stodden in identifying the key issues and bringing them to wider awareness with both researchers and funders as well as the work of John Cook, Jon Claerbout, and Patrick Vanderwalle at ReproducibleResearch.net.
If the manifesto and the others work are actions that aim (broadly) to set out the principles and to understand where we need to go then Open Research Computation is intended as a practical step embedded in today’s practice. Researchers need the credit provided by conventional papers, so if we can link papers in a journal that garners significant prestige, with high standards in the design and application of the software that is described we can link the existing incentives to our desired practice. This is a high wire act. How far do we push those standards out in front of where most of the community is. We explicitly want ORC to be a high profile journal featuring high quality software, for acceptance to be a mark of quality that the community will respect. At the same time we can’t ask for the impossible. If we set standards so high that no-one can meet them then we won’t have any papers. And with no papers we can’t start the process of changing practice. Equally, allow too much in and we won’t create a journal with a buzz about it. That quality mark has to be respected as meaning something by the community.
I’ll be blunt. We haven’t had the number of submissions I’d hoped for. Lots of support, lots of enquiries, but relatively few of them turning into actual submissions. The submissions we do have I’m very happy with. When we launched the call for submissions I took a pretty hard line on the issue of testing. I said that, as a default, we’d expect 100% test coverage. In retrospect that sent a message that many people felt they couldn’t deliver on. Now what I meant by that was that when testing fell below that standard (as it would in almost all cases) there would need to be an explanation of what the strategy for testing was, how it was tackled, and how it could support people re-using the code. The language in the author submission guidelines has been softened a bit to try and make that clearer.
What I’ve been doing in practice is asking reviewers and editors to comment on how the testing framework provided can support others re-using the code. Are the tests provided adequate to help someone getting started on the process of taking the code, making sure they’ve got it working, and then as they build on it, giving them confidence they haven’t broken anything. For me this is the critical question, does the testing and documentation make the code re-usable by others, either directly in its current form, or as they build on it. Along the way we’ve been asking whether submissions provide documentation and testing consistent with best practice. But that always raises the question of what best practice is. Am I asking the right questions? And were should we ultimately set that bar?
Changing practice is tough, getting the balance right is hard. But the key question for me is how do we set that balance right? And how do we turn the aims of ORC, to act as a lever to change the way that research is done, into practice?
The Research Data Management movement is moving on apace. Tools are working and adoption is growing. Policy development is starting to back up the use of those tools and there are some big ambitious goals set out for the next few years. But has the RDM movement taken the vision of data intensive research to its heart? Does the collection, sharing, and analysis of data about research data management meet our own standards? And is policy development based on and assessed against that data? Can we be credible if it is not?
Watching the discussion on research data management over the past few years has been an exciting experience. The tools, that have been possible for some years, now show real promise as the somewhat raw and ready products of initial development are used and tested.
Practice is gradually changing, if unevenly across different disciplines, but there is a growing awareness of data and that it might be considered important. And all of this is being driven increasingly by the development of policies on data availability, data management, and data archiving that stress the importance of data as a core output of public research.
The vision of the potential of a data rich research environment is what is driving this change. It is not important whether individual researchers, or even whole community, gets how fundamental a change the capacity to share and re-use data really is. The change is driven by two forces fundamentally external to the community.
The first is political, the top down view from government that publicly funded research needs to gain from the benefits they see in data rich commerce. A handful of people really understand how data works at these scales but these people have the ear of government.
The second force is one of competition. In the short term adopting new practices, developing new ways of doing research, is a risk. In the longer term, those who adopt more effective and efficient approaches will simply out compete those who do not or can not. This is already starting happening in those disciplines already rich in shared data and the signs are there that other disciplines are approaching a tipping point.
Data intensive research enables new types of questions to be asked, and it allows us to answer questions that were previously difficult or impossible to get reliable answers on. Questions about weak effects, small correlations, and complex interactions. The kind of questions that bedevil strategic decision-making and evidence based policy.
So naturally you’d expect that the policy development in this area, being driven by people excited by the vision of data intensive research, would have deeply embedded data gathering, model building, and analysis of how research data is being collected, made available, and re-used.
I don’t mean opinion surveys, or dipstick tests, or case studies. These are important but they’re not the way data intensive research works. They don’t scale, they don’t integrate, and they can’t provide the insight into the weak effects in complex systems that are needed to support decision making about policy.
Data intensive research is about tracking everything, logging every interaction, going through download logs, finding every mention of a specific thing wherever on the web it might be.
It’s about capturing large amounts of weakly structured data and figuring out how to structure it in a way that supports answering the question of interest. And its about letting the data guide you the answers it suggests, rather than looking within it for what we “know†should be in there.
What I don’t see when I look at RDM policy development is the detailed analysis of download logs, the usage data, the click-throughs on website. Where are the analyses of IP ranges of users, automated reporting systems, and above all, when new policy directions are set where is the guidance on data collection and assessment of performance against those policies?
Without this, the RDM community is arguably doing exactly the same things that we complain about in researcher communities. Not taking a data driven view of what we are doing.
I know this is hard. I know it involves changing systems, testing things in new ways, collecting data in ways we are not used. Even imposing disciplinary approaches that are well outside the comfort zone of those involved.
I also know there are pockets of excellent practice and significant efforts to gather and integrate information. But they are pockets. And these are exactly the things that funders and RDM professionals and institutions are asking of researchers. They are the right things to be asking for, and we’re making real progress towards realizing the vision of what is possible with data intensive research.
But just imagine if we could support policy development with that same level of information. At a pragmatic and political level it makes a strong statement when we “eat our own dogfoodâ€. And there is no better way to understand which systems and approaches are working and not working than by using them ourselves.
Michael Nielsen‘s talk at Science Online was a real eye opener for many of us who have been advocating for change in research practice. He framed the whole challenge of change as an example of a well known problem, that of collective action. How do societies manage big changes when those changes often represent a disadvantage to many individuals, at least in the short term. We can all see the global advantages of change but individually acting on them doesn’t make sense.
Michael placed this in the context of other changes, that of countries changing which side of the road they drive on, or the development of trade unions, that have been studied in some depth by political economists and similar academic disciplines. The message of these studies is that change usually occurs in two phases. First local communities adopt practice (or at least adopt a view that they want things changed in the case of which side of the road they drive on) and then these communities discover each other and “agglomerate”, or in the language of physical chemistry there are points of nucleation which grow to some critical level and then the whole system undergoes a phase change, crystallising into a new form.
These two phases are driven by different sets of motivations and incentives. At a small scale processes are community driven, people know each other, and those interactions can drive and support local actions, expectations, and peer pressure. At a large scale the incentives need to be different and global. Often top down policy changes (as in the side of the road) play a significant role here, but equally market effects and competition can also fall into place in a way that drives adoption of new tools or changes in behaviour. Think about the way new research techniques get adopted: first they are used by small communities, single labs, with perhaps a slow rate of spread to other groups. For a long time it’s hard for the new approach to get traction, but suddenly at some point either enough people are using it that its just the way things are done, or conversely those who are using it are moving head so fast that everyone else has to pile in just to keep up. It took nearly a decade for PCR for instance to gain widespread acceptance as a technique in molecular biology but when it did it went from being something people were a little unsure of to being the only way to get things done very rapidly.
So what does this tell us about advocating for, or designing for, change. Michael’s main point was that narrow scope is a feature, not a bug, when you are in that first phase. Working with small scale use cases, within communities is the way to get started. Build for those communities and they will become your best advocates, but don’t try to push the rate of growth, let it happen at the right rate (whatever that might be – and I don’t really know how to tell to be honest). But we also need to build in the grounding for the second phase.
The way these changes generally occur is through an accidental process of accretion and agglomeration. The phase change crystallises out around those pockets of new practice. But, to stretch the physical chemistry analogy, doesn’t necessarily crystallise in the form one would design for. But we have an advantage, if we design in advance to enable that crystallisation then we can prepare communities and prepare tooling for when it happens and we can design in the features that will get use closer to the optimum we are looking for.
What does this mean in practice? It means that when we develop tools and approaches it is more important for our community to have standards than it is for there to be an effort on any particular tool or approach. The language we use, that will be adopted by communities we are working with, should be consistent, so that when those communities meet they can communicate. The technical infrastructure we use should be shared, and we need interoperable standards to ensure that those connections can be made. Again, interchange and interoperability are more important than any single effort, any single project.
If we really believe in the value of change then we need to get these things together before we push them too hard into the diverse set of research communities where we want them to take root. We really need to get interoperability, standards, and language sorted out before the hammer of policy comes down and forces us into some sort of local minimum. In fact, it sounds rather like we have a collective action problem of our own. So what are we going to do about that?
While there has been a lot of talk about data repositories and data publication there remains a real lack of good tools that are truly attractive to research scientists and also provide a route to more general and effective data sharing. Peter Murray-Rust has recently discussed the deficiencies of the traditional institutional repository as a research data repository in some depth [1, 2, 3, 4].
Data publication approaches and tools are appearing including Dryad, Figshare, BuzzData and more traditional venues such as GigaScience from BioMedCentral but these are all formal mechanisms that involve significant additional work alongside an active decision to “publish the dataâ€. The research repository of choice remains a haphazard file store and the data sharing mechanism of choice remains email. How do we bridge this gap?
One of the problems with many efforts in this space is how they are conceived and sold as the user. “Making it easy to put your data on the web†and “helping others to find your data†solve problems that most researchers don’t think they have. Most researchers don’t want to share at all, preferring to retain as much of an advantage through secrecy as possible. Those who do see a value in sharing are for the most part highly skeptical that the vast majority of research data can be used outside the lab in which it was generated. The small remainder who see a value in wider research data sharing are painfully aware of how much work it is to make that data useful.
A successful data repository system will start by solving a different problem, a problem that all researchers recognize they have, and will then nudge the users into doing the additional work of recording (or allowing the capture of) the metadata that could make that data useful to other researchers. Finally it will quietly encourage them to make the data accessible to other researchers. Both the nudge and the encouragement will arise by offering back to the user immediate benefits in the form of automated processing, derived data products, or other more incentives.
But first the problem to solve. The problem that researchers recognize they have, in many cases prompted by funder data sharing plan requirements, is to properly store and back up their data. A further problem most PIs realize they have is access to the data of their group in a useful form. So they initial sales pitch for a data repository is going to be local and secure backup and sharing within the group. This has to be dead simple and ideally automated.
Such a repository will capture as much data as possible at source as it is generated. Just grabbing the file and storing it with the minimal contextual data, of who (which directory was it saved in), when (aquired from the initial file datestamp), and what (where has it come from), backing it up and exposing it to the research group (or subsets of it) via a simple web service. It will almost certainly involve some form of Dropbox-style system which synchronises a users own data across their own devices. Here is an immediate benefit. I don’t have to get the pen drive out of my pocket if I’m confident my new data will be on my laptop when I get back to it.
It will allow for simple configuration on each instrument that sets up a target directory and registers a filetype so that the system can recognize what instrument, or computational process, a file came from (the what). The who can be more complex, but a combination of designated directories (where a user has their own directory of data on a specific instrument), login info, and where required low irritant requests or claiming systems can be built. The when is easy. The sell here is in two parts, directory synching across computers means less mucking around with USB keys. And the backup makes everyone feel better, the researcher in the lab, the PI, and the funder.
So far so good, and there are in fact examples of systems like this that exist in one form or another or are being developed including DataStage within the DataFlow project from David Shotton’s group at Oxford, the Clarion project (PMR group at Cambridge) and Smart Research Frameworks (that I have an involvement with) led by Jeremy Frey at Southampton. I’m sure there are dozens of other systems or locally developed tools that do similar things and these are a good starting position.
The question is how do you take systems like this and push it to the next level. How do capture, or encourage the user to provide, enough metadata to actually make the stored data more widely useful? Particularly when they don’t have any real interest in sharing or data publication? I think that there is a significant potential in offering downstream processing of the data.
If This Then That (IFTTT) is a startup that has got quite a bit of attention over the past few weeks as it has come into public beta. The concept is very simple. For a defined set of services there are specific triggers (posting a tweet, favouriting a YouTube video) that can be used to set off another action at another service (send an email, bookmark the URL of the tweet or the video). What if we could offer data processing steps to the user? If the processing steps happen automatically but require a bit more metadata will that provide the incentive to get that data in?
This concept may sound a lot like the functionality provided by workflow engines but there is a difference. Workflow systems are generally very difficult for the general user to setup. This is mostly because they solve a general problem, that of putting any object into any suitable process. IFTTT offers something much simpler, a small set of common actions on common objects, that solves the 80/20 problem. Workflows are hard because they can do anything with any object. And that flexibility comes at a price because it is difficult to know whether that csv file is from a UV-Vis instrument, a small-angle x-ray experiment, or a simulated data set.
But locally, within a research group those there is a more limited set of data objects. With a local (or localized) repository it is possible to imagine plugins that do common single steps on common files. And because the configuration is local there is much less metadata required. But in turn that configuration provides metadata. If a particular filetype from a directory is configured for automated calculation of A280 for protein concentrations then we know that those data files are UV-Vis spectra. What is more, once we know that we can offer an automated protein concentration calculator. This will only work if the system knows what protein you are measuring, an incentive to identify the sample when you do the measurement.
The architecture of such a system would be relatively straightforward. A web service provides the cloud based backup and configuration for captured data files and users. Clients that sit on group users’ personal computers as well as on instruments grab their configuration information from the central repository. They might simply monitor specified directories, or they might pop up with a defined set of questions to capture additional metadata. Users register the instruments that they want to “follow†and when a new data file is generated with their name on it, it is also synchronized back to their registered devices.
The web service provides a plugin architecture where appropriate plugins for the group can be added from some sort of marketplace or online. Plugins that process data to generate additional metadata (e.g. by parsing a log file) can add that to the record of the data file. Those that generate new data will need to be configured as to where that data should go and who should have it synchronised. The plugins will also generate a record of what they did, providing an audit and provenance trail. Finally plugins can provide notifcation back to the users via email, the webservice, or a desktop client, of queued processes that need more information to proceed. The user can mute these but equally the encouragement is there to provide a little more info.
Finally the question of sharing and publication. For individual data file sharing sites like Figshare, plugins might enable straightforward direct submission of files submitted into a specific directory. For collections of data, such as those supported by Dryad, there will need to be to group files together but again this could be as simple as creating a directory. Even if files are copied and pasted or removed from their “proper directories†the system stands a reasonable chance of recognizing files it has already seen and inferring their provenance.
By making it easy to share and easier to satisfy data sharing requests by pushing data to a public space (while still retaining some ability to see how it is being used and by who) the substrate that is required to build better plugins, more functionality, and above all better discovery tools will be provided, and in turn those tools will start to develop. As the tools and functionality develop then the value gained by sharing will rise creating a virtuous circle, encouraging both good data management practice, good metadata provision, and good sharing.
This path starts with things we can build today, and in some cases already exist. It becomes more speculative as it goes forward. There are issues with file synching and security. Things will get messy. The plugin architecture is nothing more than hand waving at the moment and success will require a whole ecosystem of repositories and tools for operating on them. But it is a way forward that looks plausible to me. One that solves the problems researchers have today and guides them towards a tomorrow where best practice is a bit closer to common practice.
Science Online London ran late last week and into the weekend and I was very pleased to be asked to run a panel, broadly speaking focused on evaluation and incentives. Now I had thought that the panel went pretty well but I’d be fibbing if I said I wasn’t a bit disappointed. Not disappointed with the panel members or what they said. Yes it was a bit more general than I had hoped and there were things that I wished we’d covered but the substance was good from my perspective. My disappointment was with the response from the audience, really on two slightly different points.
The first was the lack of response to what I thought were some of the most exciting things I’ve heard in a long time from major stakeholders. I’ll come back to that later. But a bigger disappointment was that people didn’t seem to connect the dots to their own needs and experiences.
Science Online, both in London and North Carolina forms, has for me always been a meeting where the conversation proceeds at a more sophisticated level than the usual. So I pitched the plan of the session at where I thought the level should be. Yes we needed to talk about the challenges and surface the usual problems, non-traditional research outputs and online outputs in particular don’t get the kind of credit that papers do, institutions struggle to give credit for work that doesn’t fit in a pigeonhole, funders seem to reward only the conventional and traditional, and people outside the ivory tower struggle to get either recognition or funding. These are known challenges, the question is how to tackle them.
The step beyond this is the hard one. It is easy to say that incentives need to change. But incentives don’t drop from heaven. Incentives are created within communities and they become meaningful when they are linked to the interests of stakeholders with resources. So the discussion wasn’t really about impact, or funding, or to show that nothing can be done by amateurs. The discussion was about the needs of institutions and funders and how they can be served by what is being generated by the online community. It was also about the constraints they face in acting. But fundamentally you had major players on the stage saying “this is the kind of thing we need to get the ball rollingâ€.
Make no mistake, this is tough. Everyone is constrained and resources are tight but at the centre of the discussion were the key pointers to how to cut through the knot. The head of strategy at a major research university stated that universities want to play a more diverse role, want to create more diverse scholarly outputs, and want to engage with the wider community in new ways. That smart institutions will be looking to diversify. The head of evaluation at a major UK funder said that funders really want to know about non-traditional outputs and how they were having a wider impact. That these outputs are amongst the best things they can talk about to government. That they will be crucial to make the case to sustain science funding.
Those statements are amongst the most direct and exciting I have heard in some years of advocacy in this space. The opportunity is there, if you’re willing to put the effort in to communicate and to shape what you are doing to match to match their needs. As Michael Nielsen said in his morning keynote this is a collective action problem. That means finding what unites the needs of those doing with the needs of those with resources. It means compromise, and it means focusing on the achievable, but the point of the discussion was to identify what might be achievable.
So mostly I was disappointed that the excitement I felt wasn’t mirrored in the audience. The discussion about incentives has to move on. Saying that “institutions should do X†or “funders should do Y†gets us nowhere. Understanding what we can do together with funders and institutions and other communities to take the online agenda forward and understanding what the constraints are is where we need to go. The discussion showed that both institutions and funders know that they need what the community of online scientists can do. They don’t know how to go about it, and they don’t even know very much what we are doing, but they want to know. And when they do know they can advise and help and they can bring resources to bear. Maybe not all the resources you would like, and maybe not for all the things you would like, but resources nonetheless.
With a lot of things it is easy to get too immersed in the detail of these issues and to forget that people are looking in from the outside without the same context. I guess the fact that I pulled out what might have seemed to the audience to be just asides as the main message is indicative of that. But I really want to get that message out because I think it  is critical if the community of online scientists wants to be the mainstream. And I think it should be.
The bottom line is that smart funders and smart institutions value what is going on online. They want to support it, they want to be seen to supportit, but they’re not always sure how to go about it and how to judge its quality. But they want to know more. That’s where you come in and that’s why the session was relevant. Lars Fischer had it absolutely right: “I think the biggest and most consequential incentive for scientists is (informal) recognition by peers.†You know, we know, who is doing the good stuff and what is valuable. Take that conversation to the funders and the institutions, explain to them what’s good and why, and tell the story of what the value is. Put it in your CV, demand that promotion panels take account of it, whichever side of the table you are on. Show that you make an impact in language that they understand. They want to know. They may not always be able to act – funding is an issue – but they want to and they need your help. In many ways they need our help more than we need theirs. And if that isn’t an incentive then I don’t know what is.
1. What ethical and legal principles should govern access to research results and data? How can ethics and law assist in simultaneously protecting and promoting both public and private interests?
There are broadly two principles that govern the ethics of access to research results and data. Firstly there is the simple position that publicly funded research should by default be accessible to the public (with certain limited exceptions, see below). Secondly claims that impinge on public policy, health, safety, or the environment, that are based on research should be supported by public access to the data. See more detail in answer to Q2.
2 a) How should principles apply to publicly-funded research conducted in the public interest?
By default research outputs from publicly funded research should be made publicly accessible and re-usable in as timely a manner as possible. In an ideal world the default would be immediate release, however this is not a practically accessible goal in the near future. Cultural barriers and community inertia prevent the exploitation of technological tools that demonstrably have the potential enable research to move faster and more effectively. Research communication mechanisms are currently shackled to the requirements of the research community to monitor career progression and not optimised for effective communication.
In the near term it is practical to move towards an expectation that research outputs that support published research should be accessible and re-usable. Reasonable exceptions to this include data that is personally identifiable, that may place cultural or environmental heritage at risk, that places researchers at risk, or that might affect the integrity of ongoing data collection. The key point is that while there are reasonable exceptions to the principle of public access to public research outputs that these are exceptions and not the general rule.
What is not reasonable is to withhold or limit the re-use of data, materials, or other research outputs from public research for the purpose of personal advancement, including the “squeezing out of a few more papers”. If these outputs can be more effectively exploited elsewhere then this a more efficient use of public resources to further our public research agenda. The community has placed the importance of our own career advancement ahead of the public interest in achieving outcomes from public research for far too long.
What is also politically naive is to believe or even to create the perception that it is acceptable to withhold data on the basis that “the public won’t understand” or “it might be misused”. The web has radically changed the economics of information transfer but it has perhaps more importantly changed the public perception on access to data. The wider community is rightly suspicious of any situation where public information is withheld. This applies equally to publicly funded research as it does to government data.
2 b) How should principles apply to privately-funded research involving data collected about or from individuals and/or organisations (e.g. clinical trials)?
Increasingly public advocacy groups are becoming involved in contributing to a range of research activities including patient advocacy groups supporting clinical trials, environmental advocacy groups supporting data collection, as well as a wider public involvement in, for instance, citizen science projects.
In the case where individuals or organisations are contributing to research they have a right for that contribution to be recognised and a right to participate on their own terms (or to choose not to participate where those terms are unacceptable).
Organised groups (particularly patient groups) are of growing importance to a range of research. Researchers should expect to negotiate with such groups as to the ultimate publication of data. Such groups should have the ability to demand greater public release and to waive rights to privacy. Equally contributors have a right to expect a default right to privacy where personally identifiable information is involved.
Privacy trumps the expectation of data release and the question of what is personally identifiable information is a vexed question which as a society we are working through. Researchers will need to explore these issues with participants and to work to ensure that data generated can be anonymised in a way that enables the released data to effectively support the claims made from it. This is a challenging area which requires significant further technical, policy, and ethics work.
2 c) How should principles apply to research that is entirely privately-funded but with possible public implications?
It is clear that public funded research is a public good. By contrast privately funded research is properly a private good and the decision to release or not release research outputs lies with the funder.
It is worth noting that much of the privately funded research in UK universities is significantly subsidised through the provision of public infrastructure and this should be taken into consideration when defining publicly and privately funded research. Here I consider research that is 100% privately funded.
Where claims are made on the basis of privately funded research (e.g. of environmental impact or the efficacy of health treatments) then such claims SHOULD be fully supported by provision of the underlying evidence and data if they are to be credible. Where such claims are intended to influence public policy such evidence and data MUST be made available. That is, evidence based public policy must be supported by the publication of the full evidence regardless of the source of that evidence. Claims made to influence public policy that are not supported by provision of evidence must be discounted for the purposes of making public policy.
2 d) How should principles apply to research or communication of data that involves the promotion of the public interest but which might have implications from the privacy interests of citizens?
See above: the right to privacy trumps any requirement to release raw data. Nonetheless research should be structured and appropriate consent obtained to ensure that claims made on the basis of the research can be supported by an adequate, publicly accessible, evidence base.
3. What activities are currently under way that could improve the sharing and communication of scientific information?
A wide variety of technical initiatives are underway to enable the wider collection, capture, archival and distribution of research outputs including narrative, data, materials, and other elements of the research process. It is technically possible for us today to immediately publish the entire research record if we so choose. Such an extreme approach is resource intensive, challenging, and probably not ultimately a sensible use of resources. However it is clear that more complete and rapid sharing has the potential to increase the effectiveness and efficiency of research.
The challenges in exploiting these opportunities are fundamentally cultural. The research community is focussed almost entirely on assessment through the extremely narrow lens of publication of extended narratives in high profile peer reviewed journals. This cultural bias must be at least partially reversed before we can realise the opportunities that technology affords us. This involves advocacy work, policy development, the addressing of incentives for researchers and above all the slow and arduous process of returning the research culture to one which takes responsibility for the return on the public investment, including economic, health, social, education, and research returns and one that takes responsibility for effective communication of research outputs.
4. How do/should new media, including the blogosphere, change how scientists conduct and communicate their research?
New media (not really new any more and increasingly part of the mainstream) democratise access to communications and increase the pace of communication. This is not entirely a good thing and en masse the quality of the discourse is not always high. High quality depends on the good will, expertise, and experience of those taking part.There is a vast quantity of high quality, rapid response discourse that occurs around research on the web today even if it occurs in many places. The most effective means of determining whether a recent high profile communication stands up to criticism is to turn to discussion on blogs and news sites, not to wait months for a possible technical criticism to appear in a journal. In many ways this is nothing new, it is return to the traditional approaches of communication seen at the birth of the Royal Society itself of direct and immediate communication between researchers by the most efficient means possible; letters in the 17C and the web today.
Alongside the potential for more effective communication of researchers with each other there is also an enormous potential for more effective engagement with the wider community, not merely through “news and views” pieces but through active conversation, and indeed active contributions from outside the academy. A group of computer consultants are working to contribute their expertise in software development to improving legacy climate science software. This is a real contribution to the research effort. Equally the right question at the right time may come from an unexpected source but lead to new insights. We need to be open to this.
At the same time there is a technical deficiency in the current web and that is the management of the sheer quantity of potential connections that can be made. Our most valuable resource in research is expert attention. This attention may come from inside or outside the academy but it is a resource that needs to be efficiently directed to where it can have the most impact. This will include the necessary development of mechanisms that assist in choosing which potential contacts and information to follow up. These are currently in their infancy. Their development is in any case a necessity to deal with the explosion of traditional information sources.
5. What additional challenges are there in making data usable by scientists in the same field, scientists in other fields, ‘citizen scientists’ and the general public?
Effective sharing of data and indeed most research outputs remains a significant challenge. The problem is two-fold, first of ensuring sufficient contextual information that an expert can understand the potential uses of the research output. Secondly the placing of that contextual information in a narrative that is understandable to the widest possible range of users. These are both significant challenges that are being tackled by a large number of skilled people. Progress is being made but a great deal of work remains in developing the tools, techniques, and processes that will enable the cost effective sharing of research outputs.
A key point however is that in a world where publication is extremely cheap then simply releasing whatever outputs exist in their current form can still have a positive effect. Firstly where the cost of release is effectively zero even if there is only a small chance of those data being discovered and re-used this will still lead to positive outcomes in aggregate. Secondly the presence of this underexploited resource of released, but insufficiently marked up and contextualised, data will drive the development of real systems that will make them more useful.
6 a) What might be the benefits of more widespread sharing of data for the productivity and efficiency of scientific research?
Fundamentally more efficient, more effective, and more publicly engaging research. Less repetition and needless rediscovery of negative results and ideally more effective replication and critiquing of positive results are enabled by more widespread data sharing. As noted above another important outcome is that even suboptimal sharing will help to drive the development of tools that will help to optimise the effective release of data.
6 b) What might be the benefits of more widespread sharing of data for new sorts of science?
The widespread sharing of data has historically always lead to entirely new forms of science. The modern science of crystallography is based largely on the availability of crystal structures, bioinformatics would simply not exist without genbank, the PDB, and other biological databases and the astronomy of today would be unrecognizable to someone whose career ended prior to the availability of the Sloan Digital Sky Survey. Citizen science projects of the type of Galaxy Zoo, Fold-IT and many others are inconceivable without the data to support them. Extrapolating from this evidence provides an exciting view of the possibilities. Indeed one which it would be negligent not to exploit.
6 c) What might be the benefits of more widespread sharing of data for public policy?
Policy making that is supported by more effective evidence is something that appeals to most scientists. Of course public policy making is never that simple. Nonetheless it is hard to see how a more effective and comprehensive evidence base could fail to support better evidence based policy making. Indeed it is to be hoped that a wide evidence base, and the contradictions it will necessarily contain, could lead to a more sophisticated understanding of the scope and critique of evidence sources.
6 d) What might be the benefits of more widespread sharing of data for other social benefits?
The potential for wider public involvement in science is a major potential benefit. As in e) above a deeper understanding of how to treat and parse evidence and data throughout society can only be positive.
6 e) What might be the benefits of more widespread sharing of data for innovation and economic growth?
Every study of the release of government data has shown that it leads to a nett economic benefit. This is true even when such data has traditionally been charged for. The national economy benefits to a much greater extent than any potential loss of revenue. While this is not necessarily sufficient incentive for private investors to release data in this case of public investment the object is to maximise national ROI. Therefore release in a fully open form is the rational economic approach.
The costs of lack of acces to publicly funded research outputs by SMEs is well established. Improved access will remove the barriers that currently stifle innovation and economic growth.
6 f) What might be the benefits of more widespread sharing of data for public trust in the processes of science?
There is both a negative and a positive side to this question. On the positive greater transparency, more potential for direct involvement, and a greater understanding of the process by which research proceeds will lead to greater public confidence. On the negative, doing nothing is simply not an option. Recent events have shown not so much that the public has lost confidence in science and scientists but that there is deep shock at the lack of transparency and the lack of availability of data.
If the research community does not wish to be perceived in the same way as MPs and other recent targets of public derision then we need to move rapidly to improve the degree of transparency and accessibility of the outputs of public research.
7. How should concerns about privacy, security and intellectual property be balanced against the proposed benefits of openness?
There is little evidence that the protection of IP supports a nett increase on the return on the public investment in research. While there may be cases where it is locally optimal to pursue IP protection to exploit research outputs and maximise ROI this is not generally the case. The presumption that everything should be patented is both draining resources and stifling British research. There should always be an avenue for taking this route to exploitation but there should be a presumption of open communication of research outputs and the need for IP protection should be justified on a case by case basis. It should be unacceptable for the pursuit of IP protection to damage the communication and downstream exploitation of research.
Privacy issues and concerns around the personal security of researchers have been discussed above. National security issues will in many cases fall under a justifiable exception to the presumption of openness although it is clear that this needs care and probably oversight to retain public confidence.
8. What should be expected and/or required of scientists (in companies, universities or elsewhere), research funders, regulators, scientific publishers, research institutions, international organisations and other bodies?
British research could benefit from a statement of values, something that has the cultural significance of the Haldane principle (although perhaps better understood) or the Hippocratic oath. A shared cultural statement that captures a commitment to efficiently discharging the public trust invested in us, to open processes as a default, and to specific approaches where appropriate would act as a strong centre around which policy and tools could be developed. Leadership is crucial here in setting values and embedding these within our culture. Organisations such as the Royal Society have an important role to play.
Researchers and the research community need to take these responsibilities on ourselves in a serious and considered manner. Funders and regulators need to provide a policy framework, and where appropriate community sanctions for transgression of important principles. Research institutions are for the most part tied into current incentive systems that are tightly coupled to funding arrangements and have limited freedom of movement. Nonetheless a serious consideration of the ROI of technology transfer arrangements and of how non-traditional outputs, including data, contribute to the work of the institution and its standing are required. In the current economic climate successful institutions will diversify in their approach. Those that do not are unlikely to survive in their current form.
Other comments
This is not the first time that the research community has faced this issue. Indeed it is not even the first time the Royal Society has played a central role. Several hundred years ago it was a challenge to persuade researchers to share information at all. Results were hidden. Sharing was partial, only within tight circles, and usually limited in scope. The precursors of the Royal Society played a key role in persuading the community that effective sharing of their research outputs would improve research. Many of the same concerns were raised; concerns about the misuse of those outputs, concerns about others stealing ideas, concerns about personal prestige and the embarrassment potential of getting things wrong.
The development of journals and the development of a values system that demanded that results be made public took time, it took leadership, and with the technology of the day the best possible system was developed over an extended period. With a new technology now available we face the same issues and challenges. It is to be hoped that we tackle those challenges and opportunities with the same sense of purpose.
I developed an interest in research evaluation as an advocate of open research process. It is clear that researchers are not going to change themselves so someone is going to have to change them and it is funders who wield the biggest stick. The only question, I thought, Â was how to persuade them to use it
Of course it’s not that simple. It turns out that funders are highly constrained as well. They can lead from the front but not too far out in front if they want to retain the confidence of their community. And the actual decision making processes remain dominated by senior researchers. Successful senior researchers with little interest in rocking the boat too much.
The thing you realize as you dig deeper into this as that the key lies in finding motivations that work across the interests of different stakeholders. The challenge lies in finding the shared objectives. What it is that unites both researchers and funders, as well as government and the wider community. So what can we find that is shared?
I’d like to suggest that one answer to that is Impact. The research community as a whole has stake in convincing government that research funding is well invested. Government also has a stake in understanding how to maximize the return on its investment. Researchers do want to make a difference, even if that difference is a long way off. You need a scattergun approach to get the big results, but that means supporting a diverse range of research in the knowledge that some of it will go nowhere but some of it will pay off.
Impact has a bad name but if we step aside from the gut reactions and look at what we actually want out of research then we start to see a need to raise some challenging questions. What is research for? What is its role in our society really? What outcomes would we like to see from it, and over what timeframes? What would we want to evaluate those outcomes against? Economic impact yes, as well as social, health, policy, and environmental impact. This is called the ‘triple bottom line’ in Australia. But alongside these there is also research impact.
All these have something in common. Re-use. What we mean by impact is re-use. Re-use in industry, re-use in public health and education, re-use in policy development and enactment, and re-use in research.
And this frame brings some interesting possibilities. We can measure some types of re-use. Citation, retweets, re-use of data or materials, or methods or software. We can think about gathering evidence of other types of re-use, and of improving the systems that acknowledge re-use. If we can expand the culture of citation and linking to new objects and new forms of re-use, particularly for objects on the web, where there is some good low hanging fruit, then we can gather a much stronger and more comprehensive evidence base to support all sorts of decision making.
There are also problems and challenges. The same ones that any social metrics bring. Concentration and community effects, the Matthew effect of the rich getting richer. We need to understand these feedback effects much better and I am very glad there are significant projects addressing this.
But there is also something more compelling for me in this view. It let’s us reframe the debate around basic research. The argument goes we need basic research to support future breakthroughs. We know neither what we will need nor where it will come from. But we know that its very hard to predict – that’s why we support curiosity driven research as an important part of the portfolio of projects. Yet the dissemination of this investment in the future is amongst the weakest in our research portfolio. At best a few papers are released then hidden in journals that most of the world has no access to and in many cases without the data, or other products either being indexed or even made available. And this lack of effective dissemination is often because the work is perceived as low, or perhaps better, slow impact.
We may not be able to demonstrate or to measure significant re-use of the outputs of this research for many years. But what we can do is focus on optimizing the capacity, the potential, for future exploitation. Where we can’t demonstrate re-use and impact we should demand that researchers demonstrate that they have optimized their outputs to enable future re-use and impact.
And this brings me full circle. My belief is that the way to ensure the best opportunities for downstream re-use, over all timeframes, is that the research outputs are open, in the Budapest Declaration sense. But we don’t have to take my word for it, we can gather evidence. Making everything naively open will not always be the best answer, but we need to understand where that is and how best to deal with it. We need to gather evidence of re-use over time to understand how to optimize our outputs to maximize their impact.
But if we choose to value re-use, to value the downstream impact that our research or have, or could have, then we can make this debate not about politics or ideology but how about how best to take the public investment in research and to invest it for the outcomes that we need as a society.
Quite some months ago an article in Cancer Therapy and Biology by Scott Kern of Johns Hopkins kicked up an almighty online stink. The article entitled “Where’s the passion” bemoaned the lack of hard core dedication amongst the younger researchers that the author saw around him…starting with:
It is Sunday afternoon on a sunny, spring day.
I’m walking the halls—all of them—in a modern $59 million building dedicated to cancer research. A half hour ago, I completed a stroll through another, identical building. You see, I’m doing a survey. And the two buildings are largely empty.
The point being that if they really cared, those young researchers would be there day in-day out working their hearts out to get to the key finding. At one level this is risible, expecting everyone to work 24×7 is not a good or efficient way to get results. Furthermore you have to wonder why these younger researchers have “lost their passion”. Why doesn’t the environment create that naturally, what messages are the tenured staff sending through their actions. But I’d be being dishonest if there wasn’t a twinge of sympathy for me as well. Anyone who’s run a group has had that thought that the back of their mind; “if only they’d work harder/smarter/longer we’d be that much further ahead…”.
But all of that has been covered by others. What jumped out of the piece for me at the time were some other passages, ones that really got me angry.
When the mothers of the Mothers March collected dimes, they KNEW that teams, at that minute, were performing difficult, even dangerous, research in the supported labs. Modern cancer advocates walk for a cure down the city streets on Saturday mornings across the land. They can comfortably know that, uh…let’s see here…, some of their donations might receive similar passion. Anyway, the effort should be up to full force by 10 a.m. or so the following Monday.
[…]
During the survey period, off-site laypersons offer comments on my observations. “Don’t the people with families have a right to a career in cancer research also?†I choose not to answer. How would I? Do the patients have a duty to provide this “rightâ€, perhaps by entering suspended animation?
Now these are all worthy statements. We’d all like to see faster development of cures and I’ve no doubt that the people out there pounding the streets are driven to do all they can to see those cures advance. But is the real problem here whether the postdocs are here on a Sunday afternoon or are there things we could do to advance this? Maybe there are other parts of the research enterprise that could be made more efficient…like I don’t know making the results of research widely available and ensuring that others are in the best position possible to build on their results?
It would be easy to pick on Kern’s record on publishing open access papers. Has he made all the efforts that would enable patients and doctors to make the best decisions they can on the basis of his research? His lab generates cell lines that can support further research. Are those freely available for others to use and build on? But to pick on Kern personally is to completely miss the point.
No, the problem is that this is systemic. Researchers across the board seem to have no interest whatsoever in looking closely at how we might deliver outcomes faster. No-one is prepared to think about how the system could be improved so as to deliver more because everyone is too focussed on climbing up the greasy pole; writing the next big paper and landing the next big grant. What is worse is that it is precisely in those areas where there is most public effort to raise money, where there is a desperate need, that attitudes towards making research outputs available are at their worse.
What made me absolutely incandescent about this piece was a small piece of data that some of use have known about for a while but has only just been published. Heather Piwowar, who has done a lot of work on how and where people share, took a close look at the sharing of microarray data. What kind of things are correlated with data sharing. The paper bears close reading (Full disclosure: I was the academic editor for PLoS ONE on this paper) but one thing has stood out from me as shocking since the first time I heard Heather discuss it: microarray data linked to studies of cancer is systematically less shared.
This is not an isolated case. Across the board there are serious questions to be asked about why it seems so difficult to get the data from studies that relate to cancer. I don’t want to speculate on the reasons because whatever they are, they are unnacceptable. I know I’ve recommended this video of Josh Sommer speaking many times before, but watch it again. Then read Heather’s paper. And then decide what you think we need to do about it. Because this can not go on.
Peter Murray-Rust has sparked off another round in the discussion of the value that publishers bring to the scholarly communication game and told a particular story of woe and pain inflicted by the incumbent publishers. On the day he posted that I had my own experience of just how inefficient and ineffective our communication systems are by wasting the better part of the day trying to find some information. I thought it might be fun to encourage people to post their own stories of problems and frustrations with access to the literature and the downstream issues that creates, so here is mine.
I am by no means a skilled organic chemist but I’ve done a bit of synthesis in my time and I certainly know enough to be able to read synthetic chemistry papers and decide whether a particular synthesis is accessible. So on this particular day I was interested in deciding whether it was easy or difficult to make deuterated mono-olein. This molecule can be made by connecting glycerol to oleic acid. Glycerol is cheap and I should have in my hands some deuterated oleic acid in the next month or so. The chemistry for connecting acids to alcohols is straightforward, I’ve even done it myself, but this is a slightly special case. Firstly the standard methods tend to be wasteful of the acid, which in my case is the expensive bit. The second issue is that glycerol has three alcohol groups. I only want to modify one, leaving the other two unchanged, so it is important to find a method that gives me mostly what I want and only a little of what I don’t.
So the question for me is: is there a high yielding reaction that will give me mostly what I want, while wasting as little as possible of the oleic acid? And if there is a good technique is it accessible given the equipment I have in the lab? Simple question, quick trip to Google Scholar, to find reams of likely looking papers, not one of which I had full text access to. The abstracts are nearly useless in this case because I need to know details of yields and methodology so I had several hundred papers, and no means of figuring out which might be worth an inter-library loan. I spent hours trying to parse the abstracts to figure out which were the most promising and in the end I broke…I asked someone to email me a couple of pdfs because I knew they had access. Bear in mind what I wanted to do was spend a quick 30 minutes or so to decide whether this was pursuing in detail. What is took was about three hours, which at full economic cost of my time comes to about £250. That’s about £200 of UK taxpayers money down the toilet because, on the site of the UKs premiere physical and biological research facilities I don’t have access to those papers. Yes I could have asked someone else to look but that would have taken up their time.
But you know what’s really infuriating. I shouldn’t even have been looking at the papers at all when I’m doing my initial search. What I should have been able to do was ask the question:
Show me all syntheses of mono-olein ranked first by purity of the product and secondly by the yield with respect to oleic acid.
There should be a database where I can get this information. In fact there is. But we can’t afford access to the ACS’ information services here. These are incredibly expensive because it used to be necessary for this information to be culled from papers by hand. But today that’s not necessary. It could be done cheaply and rapidly. In fact I’ve seen it done cheaply and rapidly by tools developed in Peter’s group that get around ~95% accuracy and ~80% recall over synthetic organic chemistry. Those are hit rates that would have solved my problem easily and effectively.
Unfortunately despite the fact those tools exist, despite the fact that they could be deployed easily and cheaply, and that they could save researchers vast amounts of time research is being held back by a lack of access to the literature, and where there is access by contracts that prevent us collating, aggregating, and analysing our own work. The public pays for the research to be done, the public pays for researchers to be able to read it, and in most cases the public has to pay again if they should want to read it. But what is most infuriating is the way the public pays yet again when I and a million other scientists waste our time, the public’s time, because the tools that exist and work cannot be deployed.
How many researchers in the UK or world wide are losing hours or even days every week because of these inefficiencies. How many new tools or techniques are never developed because they can’t legally be deployed? And how many hundreds of millions of dollars of public money does that add up to?
Michael Nielsen is a good friend as well as being an inspiration to many of us in the Open Science community. I’ve been privileged to watch and in a small way to contribute to the development of his arguments over the years and I found the distillation of these years of effort into the talk that he recently gave at TEDxWaterloo entirely successful. Here is a widely accesible and entertaining talk that really pins down the arguments, the history, the successes and the failures of recent efforts to open up science practice.
Professional scientific credit is the central issue
I’ve been involved in many discussions around why the potential of opening up research practice hasn’t lead to wider adoption of these approaches. The answer is simple, and as Michael says very clearly in the opening section of the talk, the problem is that innovative approaches to doing science are not going to be adopted while those that use them don’t get conventional scientific credit. I therefore have to admit to being somewhat nonplussed by GrrlScientist’s assessment of the talk that “Dr Nielsen has missed — he certainly has not emphasised — the most obvious reason why the Open Science movement will not work: credit.”
For me, the entire talk is about credit. He frames the discussion of why the Qwiki wasn’t a huge success, compared to the Polymath project, in terms of the production of conventional papers, he discusses the transition from Galileo’s anagrams to the development of the scientific journal in terms of ensuring priority and credit. Finally he explicitly asks the non-scientist members of the audience to do something that even more closely speaks to the issue of credit, to ask their scientist friends and family what they are doing to make their results more widely available. Remember this talk is aimed at a wider audience, the TEDxWaterloo attendees and the larger audience for the video online (nearly 6,000 when I wrote this post). What happens when taxpayers start asking their friends, their family, and their legislative representatives how scientific results are being made available? You’d better believe that this has an affect on the credit economy.
Do we just need the celebrities to back us?
Grrl suggests that the answer to pushing the agenda forward is to enlist Nobelists to drive projects in the same way that Tim Gowers pushed the Polymath project. While I can see the logic and there is certainly value in moral support from successful scientists we already have a lot of this. Sulston, Varmus, Michael and Jon Eisen, and indeed Michael himself just to name a few are already pushing this agenda. But moral support and single projects are not enough. What we need to do is hack the underlying credit economy, provide proper citations for data and software, exploit the obsession with impact factors.
The key to success in my view is a pincer movement. First, showing that more (if not always completely) open approaches can outcompete closed approaches on traditional assessment measures, something demonstrated successfully by Galaxy Zoo, the Alzeimers Disease Neuroimaging Initiative, and the Polymath Projects. Secondly changing assessment policy and culture itself, both explicitly by changing the measures by which researchers are ranked, and implicitly by raising the public expectation that research should be open.
The pendulum is swinging and we’re pushing it just about every which-way we can
I guess what really gets my back up is that Grrl sets off with the statement that “Open Science will never work” but then does on to put her finger on exactly the point where we can push to make it work. Professional and public credit is absolutely at the centre of the challenge. Michael’s talk is part of a concerted, even quite carefully coordinated, campaign to tackle this issue at a wide range of levels. Michael’s tour of his talk, funded by the Open Society Institute seeks to raise awareness. My recent focus on researchassessment (and a project also funded by OSI) is tackling the same problem from another angle. It is not entirely a coincidence that I’m writing this in a hotel room in Washington DC and it is not at all accidental that I’m very interested in progress towards widely accepted researcher identifiers. The development of Open Research Computation is a deliberate attempt to build a journal that exploits the nature of journal rankings to make software development more highly valued.Â
All of these are part of a push to hack, reconfigure, and re-assess the outputs and outcomes that researchers get credit for and the the outputs and outcomes that are valued by tenure committees and grant panels. And from where I stand we’re making enough progress that Grrl’s argument seems a bit tired and outdated. I’m seeing enough examples of people getting credit and reward for being open and simply doing and enabling better science as a result that I’m confident the pendulum is shifting. Would I advise a young scientist that being open will lead to certain glory? No, it’s far from certain, but you need to distinguish yourself from the crowd one way or another and this is one way to do it. It’s still high risk but show me something in a research career that is low risk and I’ll show something that isn’t worth doing.
What can you do?
If you believe that a move towards more open research practice is a good thing then what can you do to make this happen? Well follow what Michael says, give credit to those who share, explicitly acknowledge the support and ideas you get from others. Ask researchers how they go about ensuring that their research is widely available and above all used. The thing is, in the end changing the credit economy itself isn’t enough, we actually have to change the culture that underlies that economy. This is hard but it is done by embedding the issues and assumptions in the everyday discourse about research. “How useable are your research outputs really?” is the question that gets to the heart of the problem. “How easily can people access, re-use, and improve on your research? And how open are you to getting the benefit of other people’s contribution?” are the questions that I hope will become embedded in the assumptions around how we do research. You can make that happen by asking them.