What exactly is infrastructure? Seeing the leopard’s spots
We ducked a fundamental question raised by our proposal for infrastructure principles: “what exactly counts as infrastructure?” This question matters. If our claim is that infrastructures should operate according to a set of principles, we need to be able to identify the thing of which we speak. Call the leopard, a leopard. Of course this is not a straightforward question and part of the reason for leaving it in untouched in the introductory post. We believe that any definition must entail a much broader discussion from the community. But we wanted to kick this off with a discussion of an important part of the infrastructure puzzle that we think is often missed.
In our conversations with scholars and others in the research ecosystem, people frequently speak of “infrastructures” when what they mean are services on top of deeper infrastructures. Or indeed infrastructures that sit on deeper infrastructures. Most people think of the web as an essential piece of infrastructures and it is the platform that makes much of what we are talking about possible. But the web is built on deeper layers of infrastructure: the internet, MAC addresses, IP, and TCP/IP. Things that many readers will never even have heard of because they have disappeared from view. . Similarly in academia, a researcher will point to “CERN” or “Genbank” or “Pubmed” or “The Perseus Project” when asked about critical infrastructure. The Physicists at CERN have long since taken the plumbing, electricity, roads, tracks, airports etc. that make CERN possible for granted.
All these examples involve services operating a layer above ones we have been considering. Infrastructure is not commonly seen. That lower level infrastructure has become invisible just as the underlying network protocols, which make Genbank, PubMed and the Perseus Project possible have also long since become invisible and taken for granted. To put a prosaic point on it, what is not commonly seen is also not commonly noticed. And that makes it even more important for us to focus on getting these deeper layers of infrastructure right.
If doing research entails the search for new discoveries, those involved are more inclined to focus on what is different about their research. Every sub-community within academia tends to think at a level of abstraction that is typically one layer above the truly essential – and shared – infrastructure. We hear physicists, chemists, biologists, humanists at meetings and conferences assume that the problems that they are trying to solve in online scholarly communication are specific to their particular discipline. They say “we need to uniquely identify antibodies” or “we need storage for astronomy data” or “we need to know which journals are open access” or “how many times has this article been downloaded”. Then they build the thing that they (think they) need.
This then leads to another layer of invisibility – the infrastructures that we were concerned with in the Principles for Open Scholarly Infrastructure are about what is the same across disciplines, not what is different. It is precisely the fact that these common needs are boring that means they starts to disappear from view, in some cases before they even get built. For us it is almost a law: people tend to identify infrastructure one layer too high. We want to refocus attention on the layer below, the one that is disappearing from view. It turns out that a black leopard’s spots can be seen – once they’re viewed under infrared light.
So where does that leave us? What we hear in these conversations across disciplines are not the things that are different (ie., what is special about antibodies or astronomy data or journal classifications or usage counts) but what is in common across all of these problems. This class of common problems need shared solutions. “We need identifiers” and “we need storage” and ”we need to assign metadata” and “we need to record relationships”. These infrastructures are ones that will allow us to identify objects of interest (ex: identify new kinds of research objects), store resources where more specialised storage doesn’t already exists (ex: validate a data analysis pipeline), and record metadata and relationships between resources, objects and ideas (ex: describing the relationships between funders and datasets). For example, the Directory of Open Access Journals provides identifiers for Open Access journals, claims about such journals, and relationships with other resources and objects (such as article level metadata, like Crossref DOIs and Creative Commons license URLs).
But what has generally happened in the past is that each group re-invents the wheel for its own particular niche. Specialist resources build a whole stack of tools rather than layering on the one specific piece that they need on an existing set of infrastructures. There is an important counter-example: the ability to easily cross-reference articles and datasets as well as connect these to the people who created them. This is made possible by Crossref and Datacite DOIs with ORCID IDs. ORCID is an infrastructure that provides identifiers for people as well as metadata and claims about relationships between people and other resources (e.g., articles, funders, and institutions) which are in turn described by identifiers from other infrastructures (Crossref, FundRef, ISNI). The need to identify objects is something that we have recognised as common across the research enterprise. And the common infrastructures that have been built are amongst the most powerful that we have at our disposal. But yet most of us don’t even notice that we are using them.
Infrastructures for identification, storage, metadata and relationships enable scholarship. We need to extend the base platform of identifiers into those new spaces, beyond identification to include storage and references. If we can harness the benefits on the same scale that have arisen from the provision of identifiers like Crossref DOIs then the building of new services that are specific to given disciplines will become so much easier. In particular, we need to address the gap in providing a way to describe relationships between objects and resources in general. This base layer may be “boring” and it may be invisible to the view of most researchers. But that’s the way it should be. That’s what makes it infrastructure.
It isn’t what is immediately visible on the surface that makes a leopard a leopard, otherwise the black leopard wouldn’t be, it is what is buried beneath.