What exactly is infrastructure? Seeing the leopard’s spots

This is a photo of a black leopard from the Ou...
Black leopard from Out of Africa Wildlife Park in Arizona (Wikipedia)

Cite as: What exactly is infrastructure? Seeing the leopard’s spots. Geoffrey Bilder, Jennifer Lin, Cameron Neylon. figshare. http://dx.doi.org/10.6084/m9.figshare.1520432

We ducked a fundamental question raised by our proposal for infrastructure principles: “what exactly counts as infrastructure?” This question matters. If our claim is that infrastructures should operate according to a set of principles, we need to be able to identify the thing of which we speak. Call the leopard, a leopard. Of course this is not a straightforward question and part of the reason for leaving it in untouched in the introductory post. We believe that any definition must entail a much broader discussion from the community. But we wanted to kick this off with a discussion of an important part of the infrastructure puzzle that we think is often missed.

In our conversations with scholars and others in the research ecosystem, people frequently speak of “infrastructures” when what they mean are services on top of deeper infrastructures. Or indeed infrastructures that sit on deeper infrastructures. Most people think of the web as an essential piece of infrastructures and it is the platform that makes much of what we are talking about possible. But the web is built on deeper layers of infrastructure: the internet, MAC addresses, IP, and TCP/IP. Things that many readers will never even have heard of because they have disappeared from view. . Similarly in academia, a researcher will point to “CERN” or “Genbank” or “Pubmed” or “The Perseus Project” when asked about critical infrastructure. The Physicists at CERN have long since taken the plumbing, electricity, roads, tracks, airports etc. that make CERN possible for granted.

All these examples involve services operating a layer above ones we have been considering. Infrastructure is not commonly seen. That lower level infrastructure has become invisible just as the underlying network protocols, which make Genbank, PubMed and the Perseus Project possible have also long since become invisible and taken for granted. To put a prosaic point on it, what is not commonly seen is also not commonly noticed. And that makes it even more important for us to focus on getting these deeper layers of infrastructure right.

If doing research entails the search for new discoveries, those involved are more inclined to focus on what is different about their research. Every sub-community within academia tends to think at a level of abstraction that is typically one layer above the truly essential – and shared – infrastructure. We hear physicists, chemists, biologists, humanists at meetings and conferences assume that the problems that they are trying to solve in online scholarly communication are specific to their particular discipline. They say “we need to uniquely identify antibodies” or “we need storage for astronomy data” or “we need to know which journals are open access” or “how many times has this article been downloaded”. Then they build the thing that they (think they) need.

This then leads to another layer of invisibility – the infrastructures that we were concerned with in the Principles for Open Scholarly Infrastructure are about what is the same across disciplines, not what is different. It is precisely the fact that these common needs are boring that means they starts to disappear from view, in some cases before they even get built. For us it is almost a law: people tend to identify infrastructure one layer too high. We want to refocus attention on the layer below, the one that is disappearing from view. It turns out that a black leopard’s spots can be seen – once they’re viewed under infrared light.

So where does that leave us? What we hear in these conversations across disciplines are not the things that are different (ie., what is special about antibodies or astronomy data or journal classifications or usage counts) but what is in common across all of these problems. This class of common problems need shared solutions. “We need identifiers” and “we need storage” and ”we need to assign metadata” and “we need to record relationships”. These infrastructures are ones that will allow us to identify objects of interest (ex: identify new kinds of research objects), store resources where more specialised storage doesn’t already exists (ex: validate a data analysis pipeline), and record metadata and relationships between resources, objects and ideas (ex: describing the relationships between funders and datasets). For example, the Directory of Open Access Journals provides identifiers for Open Access journals, claims about such journals, and relationships with other resources and objects (such as article level metadata, like Crossref DOIs and Creative Commons license URLs).

But what has generally happened in the past  is that each group re-invents the wheel for its own particular niche. Specialist resources build a whole stack of tools rather than layering on the one specific piece that they need on an existing set of infrastructures. There is an important counter-example: the ability to easily cross-reference articles and datasets as well as connect these to the people who created them. This is made possible by Crossref and Datacite DOIs with ORCID IDs. ORCID is an infrastructure that provides identifiers for people as well as metadata and claims about relationships between people and other resources (e.g., articles, funders, and institutions) which are in turn described by identifiers from other infrastructures (Crossref, FundRef, ISNI). The need to identify objects is something that we have recognised as common across the research enterprise. And the common infrastructures that have been built are amongst the most powerful that we have at our disposal. But yet most of us don’t even notice that we are using them.

Infrastructures for identification, storage, metadata and relationships enable scholarship. We need to extend the base platform of identifiers into those new spaces, beyond identification to include storage and references. If we can harness the benefits on the same scale that have arisen from the provision of identifiers like Crossref DOIs then the building of new services that are specific to given disciplines will become so much easier.  In particular, we need to address the gap in providing a way to describe relationships between objects and resources in general. This base layer may be “boring” and it may be invisible to the view of most researchers. But that’s the way it should be. That’s what makes it infrastructure.

It isn’t what is immediately visible on the surface that makes a leopard a leopard, otherwise the black leopard wouldn’t be, it is what is buried beneath.

[pdf-lite]

Open is a state of mind

English: William Henry Fox Talbot's 'The Open ...
English: William Henry Fox Talbot’s ‘The Open Door’ (Photo credit: Wikipedia)

“Open source” is not a verb

Nathan Yergler via John Wilbanks

I often return to the question of what “Open” means and why it matters. Indeed the very first blog post I wrote focussed on questions of definition. Sometimes I return to it because people disagree with my perspective. Sometimes because someone approaches similar questions in a new or interesting way. But mostly I return to it because of the constant struggle to get across the mindset that it encompasses.

Most recently I addressed the question of what “Open” is about in a online talk I gave for the Futurium Program of the European Commission (video is available). In this I tried to get beyond the definitions of Open Source, Open Data, Open Knowledge, and Open Access to the motivation behind them, something which is both non-obvious and conceptually difficult. All of these various definitions focus on mechanisms – on the means by which you make things open – but not on the motivations behind that. As a result they can often seem arbitrary and rules-focussed, and do become subject to the kind of religious wars that result from disagreements over the application of rules.

In the talk I tried to move beyond that, to describe the motivation and the mind set behind taking an open approach, and to explain why this is so tightly coupled to the rise of the internet in general and the web in particular. Being open as opposed to making open resources (or making resources open) is about embracing a particular form of humility. For the creator it is about embracing the idea that – despite knowing more about what you have done than any other person –  the use and application of your work is something that you cannot predict. Similarly for someone working on a project being open is understanding that – despite the fact you know more about the project than anyone else – that crucial contributions and insights could come from unknown sources. At one level this is just a numbers game, given enough people it is likely that someone, somewhere, can use your work, or contribute to it in unexpected ways. As a numbers game it is rather depressing on two fronts. First, it feels as though someone out there must be cleverer than you. Second, it doesn’t help because you’ll never find them.

Most of our social behaviour and thinking feels as though it is built around small communities. People prefer to be a (relatively) big fish in a small pond, scholars even take pride in knowing the “six people who care about and understand my work”, the “not invented here” syndrome arises from the assumption that no-one outside the immediate group could possibly understand the intricacies of the local context enough to contribute. It is better to build up tools that work locally rather than put an effort into building a shared community toolset. Above all the effort involved in listening for, and working to understand outside contributions, is assumed to be wasted. There is no point “listening to the public” because they will “just waste my precious time”. We work on the assumption that, even if we accept the idea that there are people out there who could use our work or could help, that we can never reach them. That there is no value in expending effort to even try. And we do this for a very good reason; because for the majority of people, for the majority of history it was true.

For most people, for most of history, it was only possible to reach and communicate with small numbers of people. And that means in turn that for most kinds of work, those networks were simply not big enough to connect the creator with the unexpected user, the unexpected helper with the project. The rise of the printing press, and then telegraph, radio, and television changed the odds, but only the very small number of people who had access to these broadcast technologies could ever reach larger numbers. And even they didn’t really have the tools that would let them listen back. What is different today is the scale of the communication network that binds us together. By connecting millions and then billions together the probability that people who can help each other can be connected has risen to the point that for many types of problem that they actually are.

That gap between “can” and “are”, the gap between the idea that there is a connection with someone, somewhere, that could be valuable, and actually making the connection is the practical question that underlies the idea of “open”. How do we make resources, discoverable, and re-usable so that they can find those unexpected applications? How do we design projects so that outside experts can both discover them and contribute? Many of these movements have focussed on the mechanisms of maximising access, the legal and technical means to maximise re-usability. These are important; they are a necessary but not sufficient condition for making those connections. Making resources open enables, re-use, enhances discoverability, and by making things more discoverable and more usable, has the potential to enhance both discovery and usability further. But beyond merely making resources open we also need to be open.

Being open goes in two directions. First we need to be open to unexpected uses. The Open Source community was first to this principle by rejecting the idea that it is appropriate to limit who can use a resource. The principle here is that by being open to any use you maximise the potential for use. Placing limitations always has the potential to block unexpected uses. But the broader open source community has also gone further by exploring and developing mechanisms that support the ability of anyone to contribute to projects. This is why Yergler says “open source” is not a verb. You can license code, you can make it “open”, but that does not create an Open Source Project. You may have a project to create open source code, an “Open-source project“, but that is not necessarily a project that is open, an “Open source-project“. Open Source is not about licensing alone, but about public repositories, version control, documentation, and the creation of viable communities. You don’t just throw the code over the fence and expect a project to magically form around it, you invest in and support community creation with the aim of creating a sustainable project. Successful open source projects put community building, outreach, both reaching contributors and encouraging them, at their centre. The licensing is just an enabler.

In the world of Open Scholarship, and I would include both Open Access and Open Educational Resources in this, we are a long way behind. There are technical and historical reasons for this but I want to suggest that a big part of the issue is one of community. It is in large part about a certain level of arrogance. An assumption that others, outside our small circle of professional peers, cannot possibly either use our work or contribute to it. There is a comfort in this arrogance, because it means we are special, that we uniquely deserve the largesse of the public purse to support our work because others cannot contribute. It means do note need to worry about access because the small group of people who understand our work “already have access”. Perhaps more importantly it encourages the consideration of fears about what might go wrong with sharing over a balanced assessment of the risks of sharing versus the risks of not sharing, the risks of not finding contributors, of wasting time, of repeating what others already know will fail, or of simply never reaching the audience who can use our work.

It also leads to religious debates about licenses, as though a license were the point or copyright was really a core issue. Licenses are just tools, a way of enabling people to use and re-use content. But the license isn’t what matters, what matters is embracing the idea that someone, somewhere can use your work, that someone, somewhere can contribute back, and adopting the practices and tools that make it as easy as possible for that to happen. And that if we do this collectively that the common resource will benefit us all. This isn’t just true of code, or data, or literature, or science. But the potential for creating critical mass, for achieving these benefits, is vastly greater with digital objects on a global network.

All the core definitions of “open” from the Open Source Definition, to the Budapest (and Berlin and Bethesda) Declarations on Open Access, to the Open Knowledge Definition have a common element at their heart – that an open resource is one that any person can use for any purpose. This might be good in itself, but thats not the real point, the point is that it embraces the humility of not knowing. It says, I will not restrict uses because that damages the potential of my work to reach others who might use it. And in doing this I provide the opportunity for unexpected contributions. With Open Access we’ve only really started to address the first part, but if we embrace the mind set of being open then both follow naturally.

Enhanced by Zemanta

Through a PRISM darkly

I don’t really want to add anything more to what has been said in many places (and has been rounded up well by Bora Zivkovic on Blog Around the Clock, see also Peter Suber for the definitive critique, also updates here and here). However there is a public relations issue here for the open science movement in general that I think hasn’t come up yet.

PRISM is an organisation with a specific message designed by PR people which is essentially that ‘Mandating Open Access for government funded science undermines the traditional model of peer review’. We know this is demonstrably false in respect of both Open Access scientific journals and more generally of making papers from other journals available after a certain delay. It is however conceivable, for someone with a particularly twisted mindset, to construe the actions of some members of the ‘Open Science Community’ as being intended to undermine peer review. We think of providing raw data online or using blogs, Wikis, pre-print archives or whatever other means to discuss science as an exciting way to supplement the peer reviewed literature. PRISM, and other like-minded groups, will attempt to link Open Access and Open Science together so as to represent an attempt by ‘those people’ to undermine peer review.

What is important is control of the language. PRISM has focussed on the term ‘Open Access’. We must draw a sharp distinction between Open Access and ‘Open Science’ (or ‘Open Research‘ which may be a better term). The key is that while those of us who believe in Open Research are largely in favour of Open Access literature, publishing in the Open Access literature does not imply any commitment to Open Research. Indeed it doesn’t even imply a commitment to providing the raw data that supports a publication. It is purely and simple a commitment to provide specific peer reviewed research literature in a freely accessible form which can be freely re-used and re-mixed.

We need some simple messages of our own. Here are some suggested ideas;

‘Open Access literature provides public access to publicly funded research’

‘Publically supported research should be reported in publically accessible literature’

‘How many times should a citizen have to pay to see a report on research supported by their tax dollars?’

‘Open Access literature improves the quality of peer review’

Emphasis here is on ‘public’ and ‘literature’ rather than ‘government’ and ‘results’ or ‘science’

I think there is also a need for some definitions that the ‘Open Research Community’ feels able to sign up to. Jean-Claude Bradley and Bertalan Mesko are running a session in Second Life on Nature Island next Tuesday (1600 UTC) which will include a discussion of definitions (see here for details and again the link to Bill Hooker’s good discussion of terminology). I probably won’t be able to attend but would encourage people to participate in whatever form possible so as to take this forward.

Open (adjective)

Open [oh-puhn ] (adjective) not closed…having no means of closing or barring…relatively free of obstruction…without restrictions as to who may participate…undecided; unsettled… (from Dictionary.com)

There is a great deal of confusion out there as to what ‘Open’ means, especially in science. The definitions above seem particularly apposite ‘…relatively free of obstruction…’. Certainly undecided or unsettled seems appropriate in some cases. The claims of a journal to be ‘Open Access’ can set off a barrage of comment in the blogosphere. Whether this makes any difference to the journal is unclear but definitions are clearly important. If my aim here is talk about Open Science then it is sensible to be clear what I mean.

So the following stand as definitions until they need to be changed;

Open Access (of journals, data, or anything else really): Means freely available and accesible to use, re-use, re-distribute, re-mix subject only to a requirement to attribute the work. Essentially as described in the Berlin and Bethesda declarations. Well summarised by Chris Surridge on his blog at PLoS ONE.

Freely accesible: On the web, indexed by search engines, in a useable format, with no requirement to pay for access and no exclusion of any potential users (except perhaps for antisocial behaviour).

Open Notebook Science: This is Jean-Claude Bradley‘s term which I think encompasses much of what I am interested in doing and has been pretty clearly defined (see here and here). To summarise this means that every experiment that is done and every piece of data that is collected is placed online in a freely accessible repository in a timely manner. I would add to this something which I don’t think is explicit in previous definitions but I think is implicit in the way his group works and make their data available. That is that there must be space for interaction, comments, and questions from the outside world.

Open Science is really too woolly a term to mean anything much but it encompasses the movement that is working towards more of the above throughout the science community. Its a good phrase, it captures the imagination, is evocative, and memorable. Its just too big to be pinned down. But its a big set of ideas, so let’s see where it leads us.