Who’s in the Club? New frames for understanding knowledge sharing

English: Venn diagram (coloured)
Venn diagram (Photo credit: Wikipedia)

The following is a version of the text I spoke from at the STEPS 2015 Conference, Resource Politics, at a session on Open Science organised by Valleria Arza, where I spoke along with Ross Mounce and Cindy Regalado. This version is modified slightly in response to comments from the audience.

There aren’t too many privileged categories I don’t fall into. White, male, middle class, middle aged, home owner. Perhaps the only claim I could make in the UK context is not having a connection with Oxbridge. The only language I speak is English and I’ve never lived in a non-english speaking country, never lived outside of Australia or England in fact. What do I have to say about developing countries? Or transitional, or emerging or peripheral…all problematic terms rooted in a developed world, western, narrative.

I hope it is more than just a hand-wringing liberal response. I suspect all of us do, genuinely base our work on a conviction that we can make a difference for good. In this context we collectively  believe that the critical tradition of scholarship developed in Western Europe can bring benefits to disadvantaged groups. And after all the resources to actually take action are in our hands or within our gift to influence. Something must be done. We can do something. Therefore we must do it.

Obviously this is an old critique and one that has shaped  decisions about how to act. We seek to move beyond charity and paternalistic intervention to offering frameworks and taking consultative approaches. To requiring a deeper understanding of context. In my own work I’ve tried to focus on offering ideas about processes and implementation, not on what should be done. But my ideas are of course still trapped within the frameworks I work within.

Central to those frameworks for me is “Open”. I’m not going to seek to explain in detail what I mean by “open”. Nor am I going to provide a critical analysis of the issues it raises in a development context, or its dependence on western liberal democratic, or neoliberal, or even libertarian values. Others are better placed to do that. What I do want to propose is that “Open” in the sense that I mean it is a culture, and it is a culture deeply rooted in its particular (north) western historical context.

If you accept that Open is a culture then our traditional thinking would be that it is the product of a particular community. Or rather communities. Again the history is complex but we can identify a complex of groups, clubs if you like, that have grown up with their own motivations and agendas but have sufficient alignment that we can collect them together under the label of “Open”. “Open Source”, “Open Data”, “Open Access”, “Open Science”, but also perhaps “Open Government”, transparency and others. Often these groups are identified with a charismatic individual.

John Hartley and Jason Potts in their book Cultural Science, propose a shift in our usual way of thinking about these groups and their cultures, that is both subtle and to my mind radical. We would usually think of individuals coming together to form groups in a common interest (often framed as an political-economic analysis of the way the collective resources of the group combine to achieve action). The individuals in the group and their values combine to define the culture of the group.

Hartley and Potts invert this. Their claim is that it is culture that creates groups. This inversion, whether you take it at face value as a real description of causation, or simply as a useful way to reframe the analysis has an important consequence. It focuses the unit of analysis onto the group rather than the individual. Rather than asking how individual behaviour lead to the consequences of groups interacting, we ask how cultures do or do not align, reinforce or cancel out.

In the session at the Resource Politics Conference on Tuesday on Assessment of Assessments we heard how governance reflects the structural moves allowed within an institution and how the framing of a problem reflects (or creates) these structures. Martin Mahony spoke of certain framings as “colonising spaces” which I would in turn appropriate as an example of how adaptive cultural elements can be spread through their re-creation or co-creation by groups.

In any case take your pick. New model of how culture, groups, society and sociotechnical institutions  that they co-create actually work and evolve, or a different framing that lets us tackle interesting problems from a new perspective. Either way, is it helpful? And what does it have to do with “development” or “sustainability” or “open” for that matter?

I think its useful (and relevant) because it lets us take a new view on the status of knowledge created in different contexts and it provokes some new ways of asking what we should do with the resources that we have.

First it gives us a license to say that some forms of knowledge are simply incompatible, growing as they do out of different cultures. But crucially it requires us to accept that in both directions – forms of knowledge from other cultures that are inaccessible to us, but also that our knowledge is accessible to others. It also suggests that some forms of compatibility may be defined through absence, exclusion or antagonism.

An anecdote: A few years ago I was working on a program focussing on Scholarly Communication in Sub-Saharan Africa. One striking finding was the way the communities of scholars, in disciplines that traditionally don’t communicate across groups, were actively engaging on social media platforms. Researchers from Russia, Iran, Iraq, Pakistan, India, Madagascar, Zimbabwe, Brazil, and Chile were all discussing the details of the problems they were facing in synthetic chemistry, a discipline that in my world is almost legendary for its obsessive individualism and lack of sharing. They shared the language of chemistry, and of English as the lingua franca, but they shared that with the absent centre of this geographical circle. Their shared culture was one of exclusion from the North Western centre of their discipline.

And yet, our culture, that of western scholarship was still dominant. I was struck yesterday in the “Assessment of Assessment” session focussing as it did on questions of transparency, engagement, and above all framing, that the framing of the session itself was not interrogated. Why, in an area focussed on building an inclusive consensus, is the mode of communication one of individual experts at the centre (as I am here on the podium) with questions to be asked, when allowed, from the periphery (you in the audience)?

Worse than that the system these chemists were using, ResearchGate is a western commercial infrastructures built from the classic Silicon Valley mindset, seeking to monetise, to in many ways to colonise, the interactions of these scholars who define themselves precisely through opposition to much of that culture. Is it possible to build a system that would help this group communicate within the scope of their culture but that doesn’t impose assumptions of our western culture? What infrastructures might be built that would achieve this and how would they be designed?

For Hartley and Potts the group, this level of analysis is one defined by shared culture. And of which cultures support groups which support the dynamic co- and re-creation of that culture. So another way of approaching this is to view them through the lens of the economics of clubs. What makes a club viable and sustainable? What goods does it use to achieve this? This group economics focus is interesting to me because it challenges many of our assumptions about the politics and economics of “Open”.

Rather than adopt a language of nationalisation of private goods: you journal publisher must give up your private property (articles) and transfer them to the public, you researcher must share your data with the world; we ask a different question – what is the club giving up and what are they gaining in return? Knowledge in this model is not a public good, but rather a club good – there is always some exclusion – that we are seeking to make more public through sharing. The political/economic (or advocacy) challenge is how to create an environment that tips the balance for clubs towards knowledge sharing.

These two questions – how can we support peripheral communities to co-create their own cultures without imposing ours and how we might change the economics of knowledge systems to favour investment in sharing – lead for me to an interesting suggestion and a paradox. What enabling infrastructures can be built and how can we make them as neutral and inclusive as possible while simultaneously embracing that anything built with western resources will be framed by our own cultures?

My stance on this is a re-statement of the concept from Zittrain, Benkler, Shirky and others that networks at scale can deliver new kinds of value. That the infrastructures we seek to build can tip the balance towards club investment in sharing if they provide mechanisms for clubs to gain access to networks. This is an architectural principle, that we can take a step up (or down if you prefer) identifying the common aspects of functionality required. It is also not new.

The new step is to adopt a principle of cultural engagement in governance, a means of – in the language of this conference – aligning the institutions that provide infrastructures and their governance and structures with the maximum possible number (and not power, not centrality) of cultures. The the criteria we use is one of maximising the number of productive interactions between cultures through the platforms we provide.

And this is what brings us back to Open, to what for me is the core of the philosophy, value system, or culture of Open Practice. Not that sharing outwards to the public is the target in itself but that it is through sharing that we create new opportunities for interaction and it is being open to contributions, to productive interactions that in my old world view creates value, but in this new framing promotes the “clashes of culture” that create new knowledge.

Response to the RFI on Public Access to Research Communications

Have you written your response to the OSTP RFIs yet? If not why not? This is amongst the best opportunities in years to directly tell the U.S. government how important Open Access to scientific publications is and how to start moving to a much more data centric research process. You’d better believe that the forces of stasis, inertia, and vested interests are getting their responses in. They need to be answered.

I’ve written mine on public access and you can read and comment on it here. I will submit it tomorrow just in front of the deadline but in the meantime any comments are welcome. It expands on and discusses many of the same issues, specifically on re-configuring the debate on access away from IP and towards services, that have been in my recent posts on the Research Works Act.

Enhanced by Zemanta

Science Commons Symposium – Redmond 20th February

Science Commons
Image by dullhunk via Flickr

One of the great things about being invited to speak that people don’t often emphasise is that it gives you space and time to hear other people speak. And sometimes someone puts together a programme that means you just have to shift the rest of the world around to make sure you can get there. Lisa Green and Hope Leman have put together the biggest concentration of speakers in the Open Science space that I think I have ever seen for the Science Commons Symposium – Pacific Northwest to be held on the Microsoft Campus in Redmond on 20 February. If you are in the Seattle area and have an interest in the future of science, whether pro- or anti- the “open” movement, or just want to hear some great talks you should be there. If you can’t be there then watch out for the video stream.

Along with me you’ll get Jean-Claude Bradley, Antony Williams, Peter Murray-Rust, Heather Joseph, Stephen Friend, Peter Binfield, and John Wilbanks. Everything from policy to publication, software development to bench work, and from capturing the work of a single researcher to the challenges of placing several hundred millions dollars worth of drug discovery data into the public domain. All with a focus on how we make more science available and generate more and innovative. Not to be missed, in person or online – and if that sounds too much like self promotion then feel free to miss the first talk… ;-)

Reblog this post [with Zemanta]

Open Research: The personal, the social, and the political

Next Tuesday I’m giving a talk at the Institute for Science Ethics and Innovation in Manchester. This is a departure for me in terms of talk subjects, in as much as it is much more to do with policy and politics. I have struggled quite a bit with it so this is an effort to work it out on “paper”. Warning, it’s rather long. The title of the talk is “Open Research: What can we do? What should we do? And is there any point?”

I’d like to start by explaining where I’m coming from. This involves explaining a bit about me. I live in Bath. I work at the Rutherford Appleton Laboratory, which is near Didcot. I work for STFC but this talk is a personal view so you shouldn’t take any of these views as representing STFC policy. Bath and Didcot are around 60 miles apart so each morning I get up pretty early, I get on a train, then I get on a bus which gets me to work. I work on developing methodology to study complex biological structures. We have a particular interest in trying to improve methods for looking at proteins that live in biological membranes and protein-nucleic acid complexes. I also have done work on protein labelling that lets us make cool stuff and pretty pictures. This work involves an interesting mixture of small scale lab work, work at large facilities on big instruments, often multi-national facilities. It also involves far too much travelling.

A good question to ask at this point is “Why?” Why do I do these things? Why does the government fund me to do them? Actually it’s not so much why the government funds them as why the public does. Why does the taxpayer support our work? Even that’s not really the right question because there is no public. We are the public. We are the taxpayer. So why do we as a community support science and research? Historically science was carried out by people sufficiently wealthy to fund it themselves, or in a small number of cases by people who could find wealth patrons. After the second world war there was a political and social concensus that science needed to be supported and that concensus has supported research funding more or less to the present day. But with the war receding in public memory we seem to have retained the need to frame the argument for research funding in terms of conflict or threat. The War on Cancer, the threat of climate change. Worse, we seem to have come to believe our own propaganda, that the only way to justify public research funding is that it will cure this, or save us from that. And the reality is that in most cases we will probably not deliver on this.

These are big issues and I don’t really have answers to a lot them but it seems to me that they are important questions to think about. So here are some of my ideas about how to tackle them from a variety of perspectives. First the personal.

A personal perspective on why and how I do research

My belief is we have to start with being honest with ourselves, personally, about why and how we do research. This sounds like some sort of self-help mantra I know but let me explain what I mean. My personal aim is to maximise my positive impact on the world, either through my own work or through enabling the work of others. I didn’t come at this from first principles but it has evolved. I also understand I am personally motivated by recognition and reward and that I am strongly, perhaps too strongly, motivated by others opinions of me. My understanding of my own skills and limitations means that I largely focus my research work on methodology development and enabling others. I can potentially have a bigger impact by building systems and capabilities that help others do their research than I can by doing that research myself. I am lucky enough to work in an organization that values that kind of contribution to the research effort.

Because I want my work to be used as far as is possible I make as much as possible of it freely available. Again I am lucky that I live now when the internet makes this kind of publishing possible. We have services that enable us to easily publish ideas, data, media, and process and I can push a wide variety of objects onto the web for people to use if they so wish. Even better than that I can work on developing tools and systems that help other people to do this effectively. If I can have a bigger impact by enabling other peoples research then I can multiply that again by helping other people to share that research. But here we start to run into problems. Publishing is easy. But sharing is not so easy. I can push to the web, but is anyone listening? And if they are, can they understand what I am saying?

A social perspective (and the technical issues that go with it)

If I want my publishing to be useful I need to make it available to people in a way they can make use of. We know that networks increase in value as they grow much more than linearly. If I want to maximise my impact, I have to make connections and maximise the ability of other people to make connections. Indeed Merton made the case for this in scientific research 20 years ago.

I propose the seeming paradox that in science, private property is established by having its substance freely given to others who might want to make use of it.

This is now a social problem but a social problem with a distinct technical edge to it.  Actually we have two related problems. The issue of how I make my work available in a useful form and the separate but related issue of how I persuade others to make their work available for others to use.

The key to making my work useful is interoperability. This is at root a technical issue but at a purely technical level is one that has been solved. We can share through agreed data formats and vocabularies. The challenges we face in actually making it happen are less technical problems than social ons but I will defer those for the moment. We also need legal interoperability. Science Commons amongst others has focused very hard on this question and I don’t want to discuss it in detail here except to say that I agree with the position that Science Commons takes; that if you want to maximise the ability of others to re-use your work then you must make it available with liberal licences that do not limit fields of use or the choice of license on derivative works. This mean CC-BY, BSD etc. but if you want to be sure then your best choice is explicit dedication to the public domain.

But technical and legal interoperability are just subsets of what I think is more important;  process interoperability. If the object we publish are to be useful then they must be able to fit into the processes that researchers actually use. As we move to the question of persuading others to share and build the network this becomes even more important. We are asking people to change the way they do things, to raise their standards perhaps. So we need to make sure that this is as easy as possible and fits into their existing workflows. The problem with understanding how to achieve technical and legal interoperability is that the temptation is to impose it and I am as guilty of this as anyone. What I’d like to do is use a story from our work to illustrate an approach that I think can help us to make this easier.

Making life easier by capturing process as it happens: Objects first, structure later

Our own work on web based laboratory recording systems, which really originates in the group of Jeremy Frey at Southampton came out of earlier work on a fully semantic RDF backed system for recording synthetic chemistry. In contrast we took an almost completely unstructured approach to recording work in a molecular biology laboratory, not because we were clever or knew it would work out, but because it was a contrast to what had gone before. The LaBLog is based on a Blog framework and allows the user to put in completely free text, completely arbitrary file attachments, and to organize things in whichever way they like. Obviously a recipe for chaos.

And it was to start with as we found our way around but we went through several stages of re-organization and interface design over a period of about 18 months. The key realization we made was that while a lot of what we were doing was difficult to structure in advance that there were elements within that, specific processes, specific types of material that were consistently repeated, even stereotyped, and that structuring these gave big benefits. We developed a template system that made producing these repeated processes and materials much easier. These templates depended on how we organized our posts, and the metadata that described them, and the metadata in turn was driven by the need for the templates to be effective. A virtuous circle developed around the positive re-inforcement that the templates and associated metadata provided. More suprisingly the structure that evolved out of this matched in many cases well onto existing ontologies. In specific cases where it didn’t we could see that either the problem arose from the ontology itself, or the fact that our work simply wasn’t well mapped by that ontology. But the structure arose spontaneously out of a considered attempt to make the user/designer’s life easier. And was then mapped onto the external vocabularies.

I don’t want to suggest that our particular implementation is perfect. It is far from it, with gaping holes in the usability and our ability to actually exploit the structure that has developed. But I think the general point is useful. For the average scientist to be willing to publish more of their research, that process has to be made easy and it has to recognise the inherently unstructured nature of most research. We need to apply structured descriptions where they make the user’s life easier but allow unstructured or semi-structured representations elsewhere. But we need to build tools that make it easy to take those unstructured or semi-structure records and mold them into a specific structured narrative as part of a reporting process that the researcher has to do anyway. Writing a report, writing a paper. These things need to be done anyway and if we could build tools so that the easiest way to write the report or paper is to bring elements of the original record together and push those onto the web in agreed formats through easy to use filters and aggregators then we will have taken an enormous leap forward.

Once you’ve insinuated these systems into the researchers process then we can start talking about making that process better. But until then technical and legal interoperability are not enough – we need to interoperate with existing processes as well. If we could achieve this then much more research material would flow online, connections would be formed around those materials, and the network would build.

And finally – the political

This is all very well. With good tools and good process I can make it easier for people to use what I publish and I can make it easier for others to publish. This is great but it won’t make others want to publish. I believe that more rapid publication of research is a good thing. But if we are to have a rational discussion about whether this is true we need to have agreed goals. And that moves the discussion into the political sphere.

I asked earlier why it is that we do science as a society, why we fund it. As a research community I feel we have no coherent answer to these questions.  I also talked about being honest to ourselves. We should be honest with other researchers about what motivates us, why we choose to do what we do, and how we choose to divide limited resources. And as recipients of taxpayers money we need to be clear with government and the wider community about what we can achieve. We also have an obligation to optimize the use of the money we spend. And to optimize the effective use of the outputs derived from that money.

We need at core a much more sophisticated conversation with the wider community about the benefits that research brings; to the economy, to health, to the environment, to education. And we need a much more rational conversation within the research community as to how those different forms of impact are and should be tensioned against each other.  We need in short a complete overhaul if not a replacement of the post-war concensus on public funding of research. My fear is that without this the current funding squeeze will turn into a long term decline. And that without some serious self-examination the current self-indulgent bleating of the research community is unlikely to increase popular support for public research funding.

There are no simple answers to this but it seems clear to me that at a minimum we need to be demonstrating that we are serious about maximising the efficiency with which we spend public money. That means making sure that research outputs can be re-used, that wheels don’t need to re-invented, and innovation flows easily from the academic lab into the commercial arena. And it means distinguishing between the effective use of public money to address market failures and subsidising UK companies that are failing to make effective investments in research and development.

The capital generated by science is in ideas, capability, and people. You maximise the effective use of capital by making it easy to move, by reducing barriers to trade. In science we can achieve this by maximising the ability transfer research outputs. If we to be taken seriously as guardians of public money and to be seen as worthy of that responsibility our systems need to make ideas, data, methodology, and materials flow easily. That means making our data, our process, and our materials freely available and interoperable. That means open research.

We need a much greater engagement with the wider community on how science works and what science can do. The web provides an immense opportunity to engage the public in active research as demonstrated by efforts as diverse as Galaxy Zoo with 250,000 contributors and millions of galaxy classifications and the Open Dinosaur Project with people reading online papers and adding the measurements of thigh bones to an online spreadsheet. Without the publicly available Sloan Digital Sky Survey, without access to the paleontology papers, and without the tools to put the collected data online and share them these people, this “public”, would be far less engaged. That means open research.

And finally we need to turn the tools of our research on ourselves. We need to critically analyse our own systems and processes for distributing resources, for communicating results, and for apportioning credit. We need to judge them against the value for money they offer to the taxpayer and where they are found wanting we need to adjust. In the modern networked world we need to do this in a transparent and honest manner. That means open research.

But even if we agree these things are necessary, or a general good, they are just policy. We already have policies which are largely ignored. Even when obliged to by journal publication policies or funder conditions researchers avoid, obfuscate, and block attempts to gain access to data, materials, and methdology. Researchers are humans too with the same needs to get ahead and to be recognized as anyone else. We need to find a way to map those personal needs, and those personal goals, onto the community’s need for more openness in research. As with the tooling we need to “bake in” the openness to our processes to make it the easiest way to get ahead. Policy can help with cultural change but we need an environment in which open research is the simplest and easiest approach to take. This is interoperability again but in this case the policy and process has to interoperate with the real world. Something that is often a bit of a problem.

So in conclusion…

I started with a title I’ve barely touched on.  But I hope with some of the ideas I’ve explored we are in a position to answer the questions I posed. What can we do in terms of Open Research? The web makes it technically possible for us the share data, process, and records in real time. It makes it easier for us to share materials though I haven’t really touched on that. We have the technical ability to make that data useful through shared data formats and vocabularies. Many of the details are technically and socially challenging but we can share pretty much anything we choose to on a wide variety of timeframes.

What should we do? We should make that choice easier through the development of tools and interfaces that recognize that it is usually humans doing and recording the research and exploiting the ability of machines to structure that record when they are doing the work. These tools need to exploit structure where it is appropriate and allow freedom where it is not. We need tools to help us map our records onto structures as we decide how we want to present them. Most importantly we need to develop structures of resource distribution, communication, and recognition that encourage openness by making it the easiest approach to take. Encouragement may be all that’s required. The lesson from the web is that once network effects take hold they can take care of the rest.

But is there any point? Is all of this worth the effort? My answer, of course, is an unequivocal yes. More open research will be more effective, more efficient, and provide better value for the taxpayer’s money. But more importantly I believe it is the only credible way to negotiate a new concensus on the public funding of research. We need an honest conversation with government and the wider community about why research is valuable, what the outcomes are, and how the contribute to our society. We can’t do that if the majority cannot even see those outcomes. The wider community is more sophisticated that we give it credit for. And in many ways the research community is less sophisticated than we think. We are all “the public”. If we don’t trust the public to understand why and how we do research, if we don’t trust ourselves to communicate the excitement and importance of our work effectively, then I don’t see why we deserve to be trusted to spend that money.

Replication, reproduction, confirmation. What is the optimal mix?

Issues surrounding the relationship of Open Research and replication seems to be the meme of the week. Abhishek Tiwari provided notes on a debate describing concerns about how open research could damage replication and Sabine Hossenfelder explored the same issue in a blog post. The concern fundamentally is that by providing more of the details of our research we may actually be damaging the research effort by reducing the motivation to reproduce published findings or worse, as Sabine suggests, encouraging group think and a lack of creative questioning.

I have to admit that even at a naive level I find this argument peculiar. There is no question that in aiming to reproduce or confirm experimental findings it may be helpful to carry out that process in isolation, or with some portion of the available information withheld. This can obviously increase the quality and power of the confirmation, making it more general. Indeed the question of how and when to do this most effectively is very interesting and bears some thought. The optimization of these descisions in specific cases will be important part of improving research quality. What I find peculiar is the apparent belief in many quarters (but not necessarily Sabine who I think has a much more sophisticated view) that this optimization is best encouraged by not bothering to make information available. We can always choose not to access information if it is available but if it is not we cannot choose to look at it. Indeed to allow the optimization of the confirmation process it is crucial that we could have access to the information if we so decided.

But I think there is a deeper problem than the optimization issue. I think that the argument also involves two category errors. Firstly we need to distinguish between different types of confirmation. There is pure mechanical replication, perhaps just to improve statistical power or to re-use a technique for a different experiment. In this case you want as much detail as possible about how the original process was carried out because there is no point in changing the details. The whole point of the exercise is to keep things as similar as possible. I would suggest the use of the term “reproduction” to mean a slightly different case. Here the process or experiment “looks the same” and is intended to produce the same result but the details of the process are not tightly controlled. The purpose of the exercise is to determine how robust the result or output is to modified conditions. Here withholding, or not using, some information could be very useful. Finally there is the process of actually doing something quite different with the intention of testing an idea or a claim from a different direction with an entirely different experiment or process. I would refer to this as “confirmation”. The concerns of those arguing against providing detailed information lie primarily with confirmation, but the data and process sharing we are talking about relates more to replication and reproduction. The main efficiency gains lie in simply re-using shared process to get down a scientific path more rapidly rather than situations where the process itself is the subject of the scientific investigation.

The second category error is somewhat related in as much as the concerns around “group-think” refer to claims and ideas whereas the objects we are trying to encourage sharing when we talk about open research are more likely to be tools and data. Again, it seems peculiar to argue that the process of thinking independently about research claims is aided by reducing  the amount of data available. There is a more subtle argument that Sabine is making and possibly Harry Collins would make a similar one, that the expression of tools and data may be inseparable from the ideas that drove their creation and collection. I would still argue however that it is better to actively choose to omit information from creative or critical thinking rather than be forced to work in the dark. I agree that we may need to think carefully about how we can effectively do this and I think that would be an interesting discussion to have with people like Harry.

But the argument that we shouldn’t share because it makes life “too easy” seems dangerous to me. Taking that argument to its extreme we should remove the methods section from papers altogether. In many cases it feels like we already have and I have to say that in day to day research that certainly doesn’t feel helpful.

Sabine also makes a good point, that Michael Nielsen also has from time to time, that these discussions are very focussed on experimental and specifically hypothesis driven research. It bears some thinking about but I don’t really know enough about theoretical research to have anything useful to add. But it is the reason that some of the language in this post may seem a bit tortured.

Talking to the next generation – NESTA Crucible Workshop

Yesterday I was privileged to be invited to give a talk at the NESTA Crucible Workshop being held in Lancaster. You can find the slides on slideshare. NESTA, the National Endowment for Science, Technology, and the Arts,  is an interesting organization funded via a UK government endowment to support innovation and enterprise and more particularly the generation of a more innovative and entrepreneurial culture in the UK. Among the programmes it runs in pursuit of this is the Crucible program where a small group of young researchers, generally looking for or just in their first permanent or independent positions, attend a series of workshops to get them thinking broadly about the role of their research in the wider world and to help them build new networks for support and collaboration.

My job was to talk about “Science in Society” or “Open Science”. My main theme was the question of how we justify taxpayer expenditure on research; that to me this implies an obligation to maximise the efficiency of how we do our research. Research is worth doing but we need to think hard about how and what we do. Not surprisingly I focussed on the potential of using web based tools and open approaches to make things happen cheaper, quicker, and more effectively. To reduce waste and try to maximise the amount of research output for the money spent.

Also not surprisingly there was significant pushback – much of it where you would expect. Concerns over data theft, over how “non-traditional” contributions might appear (or not) on a CV, and over the costs in time were all mentioned. However what surprised me most was the pushback against the idea of putting material on the open web versus traditional journal formats. There was a real sense that the group had a respect for the authority of the printed, versus online, word which really caught me out. I often use a gotcha moment in talks to try and illustrate how our knowledge framework is changed by the web. It goes “how many people have opened a physical book for information in the last five years?”. Followed by “and how many haven’t used Google in the last 24 hours”. This is shamelessly stolen from Jamie Boyle incidentally.

Usually you get three or four sheepish hands going up admitting a personal love of real physical books. Generally it is around 5-10% of the audience, and this has been pretty consistent amongst mid-career scientists in both academia and industry, and people in publishing. In this audience about 75% put their hands up.  Some of these were specialist “tool” books, mathematical forms, algorithmic recipes, many of them were specialist texts and many referred to the use of undergraduate textbooks. Interestingly they also brought up an issue that I’ve never had an audience bring up before; that of how do you find a good route into a new subject area that you know little about, but that you can trust?

My suspicion is that this difference comes from three places, firstly that these researchers were already biased towards being less discipline bound by the fact that they’d applied for the workshop. They were therefore more likely to discipline hoppers,  jumping into new fields where they had little experience and needed a route in. Secondly, they were at a stage of their career where they were starting to teach, again possibly slightly outside their core expertise and therefore looking for good, reliable material, to base their teaching on. Finally though there was a strong sense of respect for the authority of the printed word. The printing of the German Wikipedia was brought up as evidence that printed matter was, at least perceived to be, more trustworthy. Writing this now I am reminded of the recent discussion on the hold that the PDF has over the imagination of researchers. There is a real sense that print remains authoritative in a way that online material is not. Even though the journal may never be printed the PDF provides the impression that it could or should be. I would guess also that the group were young enough also to be slightly less cynical about authority in general.

Food for thought, but it was certainly a lively discussion. We actually had to be dragged off to lunch because it went way over time (and not I hope just because I had too many slides!). Thanks to all involved in the workshop for such an interesting discussion and thanks also to the twitter people who replied to my request for 140 character messages. They made a great way of structuring the talk.

Open Data, Open Source, Open Process: Open Research

There has been a lot of recent discussion about the relative importance of Open Source and Open Data (Friendfeed, Egon Willighagen, Ian Davis). I don’t fancy recapitulating the whole argument but following a discussion on Twitter with Glyn Moody this morning [1, 2, 3, 4, 5, 6, 7, 8] I think there is a way of looking at this with a slightly different perspective. But first a short digression.

I attended a workshop late last year on Open Science run by the Open Knowledge Foundation. I spent a significant part of the time arguing with Rufus Pollock about data licences, an argument that is still going on. One of Rufus’ challenges to me was to commit to working towards using only Open Source software. His argument was that there wasn’t really any excuses any more. Open Office could do the job of MS Office, Python with SciPy was up to the same level as MatLab, and anything specialist needed to be written anyway so should be open source from the off.

I took this to heart and I have tried, I really have tried. I needed a new computer and, although I got a Mac (not really ready for Linux yet), I loaded it up with Open Office, I haven’t yet put my favourite data analysis package on the computer (Igor if you must know), and have been working in Python to try to get some stuff up to speed. But I have to ask whether this is the best use of my time. As is often the case with my arguments this is a return on investment question. I am paid by the taxpayer to do a job. At what point is the extra effort I am putting into learning to use, or in some cases fight with, new tools cost more than the benefit that is gained, by making my outputs freely available?

Sometimes the problems are imposed from outside. I spent a good part of yesterday battling with an appalling, password protected, macroed-to-the-eyeballs Excel document that was the required format for me to fill in a form for an application. The file crashed Open Office and only barely functioned in Mac Excel at all. Yet it was required, in that format, before I could complete the application. Sometimes the software is just not up to scratch. Open Office Writer is fine, but the presentation and spreadsheet modules are, to be honest, a bit ropey compared to the commercial competitors. And with a Mac I now have Keynote which is just so vastly superior that I have now transferred wholesale to that. And sometimes it is just a question of time. Is it really worth me learning Python to do data analysis that I could knock in Igor in a tenth of the time?

In this case the answer is, probably yes. Because it means I can do more with it. There is the potential to build something that logs process the way I want to , the potential to convert it to run as a web service. I could do these things with other OSS projects as well in a way that I can’t with a closed product. And even better because there is a big open community I can ask for help when I run into problems.

It is easy to lose sight of the fact that for most researchers software is a means to an end. For the Open Researcher what is important is the ability to reproduce results, to criticize and to examine. Ideally this would include every step of the process, including the software. But for most issues you don’t need, or even want, to be replicating the work right down to the metal. You wouldn’t after all expect a researcher to be forced to run their software on an open source computer, with an open source chipset. You aren’t necessarily worried what operating system they are running. What you are worried about is whether it is possible read their data files and reproduce their analysis. If I take this just one step further, it doesn’t matter if the analysis is done in MatLab or Excel, as long as the files are readable in Open Office and the analysis is described in sufficient detail that it can be reproduced or re-implemented.

Lets be clear about this: it would be better if the analysis were done in an OSS environment. If you have the option to work in an OSS environment you can also save yourself time and effort in describing the process and others have a much better chances of identifying the sources of problems. It is not good enough to just generate an Excel file, you have to generate an Excel file that is readable by other software (and here I am looking at the increasing number of instrument manufacturers providing software that generates so called Excel files that often aren’t even readable in Excel). In many cases it might be easier to work with OSS so as to make it easier to generate an appropriate file. But there is another important point; if OSS generates a file type that is undocumented or worse, obfuscated, then that is also unacceptable.

Open Data is crucial to Open Research. If we don’t have the data we have nothing to discuss. Open Process is crucial to Open Research. If we don’t understand how something has been produced, or we can’t reproduce it, then it is worthless. Open Source is not necessary, but, if it is done properly, it can come close to being sufficient to satisfy the other two requirements. However it can’t do that without Open Standards supporting it for documenting both file types and the software that uses them.

The point that came out of the conversation with Glyn Moody for me was that it may be more productive to focus on our ability to re-implement rather than to simply replicate. Re-implementability, while an awful word, is closer to what we mean by replication in the experimental world anyway. Open Source is probably the best way to do this in the long term, and in a perfect world the software and support would be there to make this possible, but until we get there, for many researchers, it is a better use of their time, and the taxpayer’s money that pays for that time, to do that line fitting in Excel. And the damage is minimal as long as source data and parameters for the fit are made public. If we push forward on all three fronts, Open Data, Open Process, and Open Source then I think we will get there eventually because it is a more effective way of doing research, but in the meantime, sometimes, in the bigger picture, I think a shortcut should be acceptable.

Licenses and protocols for Open Science – the debate continues

This is an important discussion that has been going on in disparate places, but primarily at the moment is on the Open Science mailing list maintained by the OKF (see here for an archive of the relevant thread). To try and keep things together and because Yishay Mor asked, I thought I would try to summarize the current state of the debate.

The key aim here is to find a form of practice that will enhance data availability, and protect it into the future.

There is general agreement that there is a need for some sort of declaration associated with making data available. Clarity is important and the minimum here would be a clear statement of intention.Where there is disagreement is over what form this should take. Rufus Pollock started by giving the reasons why this should be a formal license. Rufus believes that a license provides certainty and clarity in a way that a protocol, statement of principles, or expression of community standards can not.  I, along with Bill Hooker and John Wilbanks [links are to posts on mailing list], expressed a concern that actually the use of legal language, and the notion of “ownership” of this by lawyers rather than scientists would have profound negative results. Andy Powell points out that this did not seem to occur either in the Open Source movement or with much of the open content community. But I believe he also hits the nail on the head with the possible reason:

I suppose the difference is that software space was already burdened with heavily protective licences and that the introduction of open licences was perceived as a step in the right direction, at least by those who like that kind of thing.         

Scientific data has a history of being assumed to be in public domain (see the lack of any license at PDB or Genbank or most other databases) so there isn’t the same sense of pushing back from an existing strong IP or licensing regime. However I think there is broad agreement that this protocol or statement would look a lot like a license and would aim to have the legal effect of at least providing clarity over the rights of users to copy, re-purpose, and fork the objects in question.

Michael Nielsen and John Wilbanks have expressed a concern about the potential for license proliferation and incompatibility. Michael cites the example of Apache, Mozilla, and GPL2 licenses. This feeds into the issue of the acceptability, or desirability of share-alike provisions which is an area of significant division. Heather Morrison raises the issue of dealing with commercial entities who may take data and use technical means to effectively take it out of the public domain, citing the takeover of OAIster by OCLC as a potential example.

This is a real area of contention I think because some of us (including me) would see this in quite a positive light (data being used effectively in a commercial setting is better than it not being used at all) as long as the data is still both legally and technically in the public domain. Indeed this is at the core of the power of a public domain declaration. The issue of finding the resources that support the preservation of research objects in the (accessible) public domain is a separate one but in my view if we don’t embrace the idea that money can and should be made off data placed in the public domain then we are going to be in big trouble sooner or later because the money will simply run out.

On the flip side of the argument is a strong tradition of arguing that viral licensing and share alike provisions protect the rights and personal investment of individuals and small players against larger commercial entities. Many of the people who support open data belong to this tradition, often for very good historical reasons. I personally don’t disagree with the argument on a logical level, but I think for scientific data we need to provide clear paths for commercial exploitation because using science to do useful things costs a lot of money. If you want people want to invest in using the outputs of publicly funded research you need to provide them with the certainty that they can legitimately use that data within their current business practice. I think it is also clear that those of us who take this line need to come up with a clear and convincing way of expressing this argument because it is at the centre of the objection to “protection” via licenses and share alike provisions.

Finally Yishay brings us back to the main point. Something to keep focussed on:

I may be off the mark, but I would argue that there’s a general principle to consider here. I hold that any data collected by public money should be made freely available to the public, for any use that contributes to the public good. Strikes me as a no-brainer, but of course – we have a long way to go. If we accept this principle, the licensing follows.         

Obviously I don’t agree with the last sentence – I would say that dedication to the public domain follows – but the principle I think is something we can agree that we are aiming for.

Very final countdown to Science Online 09

I should be putting something together for the actual sessions I am notionally involved in helping running but this being a very interactive meeting perhaps it is better to leave things to very last minute. Currently I am at a hotel at LAX awaiting an early flight tomorrow morning. Daily temperatures in the LA area have been running around 25-30 C for the past few days but we’ve been threatened with the potential for well below zero in Chapel Hill. Nonetheless the programme and the people will more than make up for it I have no doubt. I got to participate in a bit of the meeting last year via streaming video and that was pretty good but a little limited – not least because I couldn’t really afford to stay up all night unlike some people who were far more dedicated.

This year I am involved in three sessions (one on Blog Networks, one on Open Notebook Science, and one on Social Networks for Scientists – yes those three are back to back…) and we will be aiming to be video casting, live blogging, posting slides, images, and comments; the whole deal. If you’ve got opinions then leave them at the various wiki pages (via the programme) or bring them along to the sessions. We are definitely looking for lively discussion. Two of these are being organised with the inimitable Deepak Singh who I am very much looking forward to finally meeting in person – along with many others I feel I know quite well but have never met – and others I have met and look forward to catching up with including Jean-Claude who has instigated the Open Notebook session.

With luck I will get to the dinner tomorrow night so hope to see some people there. Otherwise I hope to see many in person or online over the weekend. Thanks for Bora and Anton and David for superb organisation (and not a little pestering to make sure I decided to come!)