Where are the pipes? Building Foundational Infrastructures for Future Services

Cite as “Bilder G, Lin J, Neylon C (2016) Where are the pipes? Building Foundational Infrastructures for Future Services, retrieved [date], https://cameronneylon.net/blog/where-are-the-pipes-building-foundational-infrastructures-for-future-services/ ‎”
Utility RodeoYou probably don’t think too much about where all the services to your residence run. They go missing from view until something goes wrong. But how do we maintain them unless they are identified? An entire utilities industry, which must search for utility infrastructure, hangs in the balance on this knowledge. There’s even an annual competition, a rodeo no less, to crown the best infrastructure locators in the land, rewarding those who excel at re-discovering where lost pipes and conduits run.

It’s all too easy to forget where the infrastructure lies when it’s running well. And then all too expensive if you have to call in someone to find it again. Given that preservation and record keeping lies at the heart of research communications, we’d like to think we could do a better job. Good community governance makes it clear who is responsible for remembering where the pipes run as well as keeping them “up to code”. Turns out that keeping the lights on and the taps running involves the greater We (i.e., all of us).

Almost a year ago, we proposed a set of principles for open scholarly infrastructure based on discussions over the past decade or so. Our intention was to generate conversation amongst funders, infrastructure players, tool builders, all those who identify as actors in the scholarly ecosystem. We also sought test cases for these principles, real world examples that might serve as reference models and/or refine the set of principles.

One common question emerged from the ensuing discussions in person and online, and we realized we ducked a fundamental question: “what exactly is infrastructure? In our conversations with scholars and others in the research ecosystem, people frequently speak of “infrastructures” to reference services or tools that support research and scholarly communications. The term is gaining currency as more scholars and developers are turning their attention to a) the need for improved research tools and platforms that can be openly shared and adopted and b) that sense that some solutions are interoperable and more efficiently implemented across communities (for example, see this post from Bill Mills).

It is exciting for us that the principles might have a broader application and we are more than happy to talk with groups and organisations that are interested in how the principles might apply in these setting. However our interest was always in going deeper. The kinds of “infrastructures” – we would probably say “services” – that Bill is talking about rely on deeper layers. Our follow-up blog post was aimed at addressing the question, but it ended with a koan:

“It isn’t what is immediately visible on the surface that makes a leopard a leopard, otherwise the black leopard wouldn’t be, it is what is buried beneath.”

Does making the invisible visible require an infrared camera trap or infrastructure rodeo competitions? Thankfully, no. But we do continue to see the need to shine a light on those deeper underlying layers to reveal the leopard’s spots. This is the place where we feel the most attention is needed, precisely because these layers are invisible and largely ignored. We have started to use the term “foundational infrastructure” to distinguish these deeper layers, that for us need to be supported by organisations with strong principles.

The important layers applicable to foundational infrastructure seem to be:

  • Storage: places to put stuff that is generated by research, including possibly physical stuff
  • Identifiers: means of uniquely identifying each of the – sufficiently important – objects the research process created
  • Metadata: information about each of these objects
  • Assertions: relationships between the objects, structured as assertions that link identifiers

These requirements nicely spell out SIMA, a term that in geology refers to the layer of crust that sits below both the ocean and continental crusts, satisfying both the idea of a foundational layer and also one that is global. Examples of organizations that might fall under this rubric include:

  • ISNI
  • ORCID
  • ArXiV
  • CHORUS
  • DSpace
  • Worldwide Protein Data Bank (PDB)

This is just a beginning list and some might object to the inclusion of one or more on the list. Certainly many will object to some that are missing and certainly none would fully qualify as compliant with the principles. Some have a disciplinary focus, some are at least perceived to be controlled by or serving specific stakeholder interests rather than the community as a whole. That “perceived to be” can be as important as real issues of control. If an infrastructure is truly foundational it needs to be trusted by the whole community. That trust is the foundation in a very real sense, the core on which the SIMA infrastructures should sit.

We originally avoided a list of names as we didn’t want to give the impression of criticising specific organisations against a set of principles we still think of as being at a draft stage. Now we name these examples because we’d like to elevate the conversation on the importance of these foundational infrastructures and the organisations that support them. Some examples may never fit the community principles. Some might with changes in governance or focus. The latter are of particular interest: what would those changes look like and how can we identify those infrastructures which are truly foundational? Institutional support for these organisations in the long run is a critical community discussion.

At the moment, numerous cross-stakeholder initiatives for new services are being developed, and in many cases being hampered by the lack of shared, reliable, and trusted foundations of SIMA infrastructures. Where these infrastructures do exist, initiatives take hold and thrive. Where they are patchy, we struggle. At the same time these issues are starting to be recognised more widely, for instance in the technology world where Nadia Eghbal recently left a venture capital firm to investigate important projects invisible to VCs. She identified Open Source Infrastructure, as a class of foundational projects “which tech simply cannot do without” but which are generally without any paths to funding.

Identifying foundational infrastructure – an institutional construct – is neither an art nor a science. There’s no point in drawing lines in the sand for its own sake. This question is more a means to a greater end if we take a holistic point of view: how do we create a healthy, robust ecosystem of players that support and enable scholarly communications? We know all of these layers are necessary, each with different roles they play and serving distinct communities. They each have different paths for setting up and sustainability. It is precisely the fact that these common needs are boring that means they starts to disappear from view, in some cases before they even get built. Understanding the distinctions between these two layers will help us better support both of them.

The other option is to forget where the pipes are, and to have to call in someone to find them again.

Over, and over, and over again…

Credits:

  • cowboy hat by Lloyd Humphreys from the Noun Project, used here under a CC-BY license
  • tower by retinaicon from the Noun Project, used here under a CC-BY license

The authors are writing in a personal capacity. None of the above should be taken as the view or position of any of our respective employers or other organisations.

What exactly is infrastructure? Seeing the leopard’s spots

This is a photo of a black leopard from the Ou...
Black leopard from Out of Africa Wildlife Park in Arizona (Wikipedia)

Cite as: What exactly is infrastructure? Seeing the leopard’s spots. Geoffrey Bilder, Jennifer Lin, Cameron Neylon. figshare. http://dx.doi.org/10.6084/m9.figshare.1520432

We ducked a fundamental question raised by our proposal for infrastructure principles: “what exactly counts as infrastructure?” This question matters. If our claim is that infrastructures should operate according to a set of principles, we need to be able to identify the thing of which we speak. Call the leopard, a leopard. Of course this is not a straightforward question and part of the reason for leaving it in untouched in the introductory post. We believe that any definition must entail a much broader discussion from the community. But we wanted to kick this off with a discussion of an important part of the infrastructure puzzle that we think is often missed.

In our conversations with scholars and others in the research ecosystem, people frequently speak of “infrastructures” when what they mean are services on top of deeper infrastructures. Or indeed infrastructures that sit on deeper infrastructures. Most people think of the web as an essential piece of infrastructures and it is the platform that makes much of what we are talking about possible. But the web is built on deeper layers of infrastructure: the internet, MAC addresses, IP, and TCP/IP. Things that many readers will never even have heard of because they have disappeared from view. . Similarly in academia, a researcher will point to “CERN” or “Genbank” or “Pubmed” or “The Perseus Project” when asked about critical infrastructure. The Physicists at CERN have long since taken the plumbing, electricity, roads, tracks, airports etc. that make CERN possible for granted.

All these examples involve services operating a layer above ones we have been considering. Infrastructure is not commonly seen. That lower level infrastructure has become invisible just as the underlying network protocols, which make Genbank, PubMed and the Perseus Project possible have also long since become invisible and taken for granted. To put a prosaic point on it, what is not commonly seen is also not commonly noticed. And that makes it even more important for us to focus on getting these deeper layers of infrastructure right.

If doing research entails the search for new discoveries, those involved are more inclined to focus on what is different about their research. Every sub-community within academia tends to think at a level of abstraction that is typically one layer above the truly essential – and shared – infrastructure. We hear physicists, chemists, biologists, humanists at meetings and conferences assume that the problems that they are trying to solve in online scholarly communication are specific to their particular discipline. They say “we need to uniquely identify antibodies” or “we need storage for astronomy data” or “we need to know which journals are open access” or “how many times has this article been downloaded”. Then they build the thing that they (think they) need.

This then leads to another layer of invisibility – the infrastructures that we were concerned with in the Principles for Open Scholarly Infrastructure are about what is the same across disciplines, not what is different. It is precisely the fact that these common needs are boring that means they starts to disappear from view, in some cases before they even get built. For us it is almost a law: people tend to identify infrastructure one layer too high. We want to refocus attention on the layer below, the one that is disappearing from view. It turns out that a black leopard’s spots can be seen – once they’re viewed under infrared light.

So where does that leave us? What we hear in these conversations across disciplines are not the things that are different (ie., what is special about antibodies or astronomy data or journal classifications or usage counts) but what is in common across all of these problems. This class of common problems need shared solutions. “We need identifiers” and “we need storage” and ”we need to assign metadata” and “we need to record relationships”. These infrastructures are ones that will allow us to identify objects of interest (ex: identify new kinds of research objects), store resources where more specialised storage doesn’t already exists (ex: validate a data analysis pipeline), and record metadata and relationships between resources, objects and ideas (ex: describing the relationships between funders and datasets). For example, the Directory of Open Access Journals provides identifiers for Open Access journals, claims about such journals, and relationships with other resources and objects (such as article level metadata, like Crossref DOIs and Creative Commons license URLs).

But what has generally happened in the past  is that each group re-invents the wheel for its own particular niche. Specialist resources build a whole stack of tools rather than layering on the one specific piece that they need on an existing set of infrastructures. There is an important counter-example: the ability to easily cross-reference articles and datasets as well as connect these to the people who created them. This is made possible by Crossref and Datacite DOIs with ORCID IDs. ORCID is an infrastructure that provides identifiers for people as well as metadata and claims about relationships between people and other resources (e.g., articles, funders, and institutions) which are in turn described by identifiers from other infrastructures (Crossref, FundRef, ISNI). The need to identify objects is something that we have recognised as common across the research enterprise. And the common infrastructures that have been built are amongst the most powerful that we have at our disposal. But yet most of us don’t even notice that we are using them.

Infrastructures for identification, storage, metadata and relationships enable scholarship. We need to extend the base platform of identifiers into those new spaces, beyond identification to include storage and references. If we can harness the benefits on the same scale that have arisen from the provision of identifiers like Crossref DOIs then the building of new services that are specific to given disciplines will become so much easier.  In particular, we need to address the gap in providing a way to describe relationships between objects and resources in general. This base layer may be “boring” and it may be invisible to the view of most researchers. But that’s the way it should be. That’s what makes it infrastructure.

It isn’t what is immediately visible on the surface that makes a leopard a leopard, otherwise the black leopard wouldn’t be, it is what is buried beneath.

[pdf-lite]

Community Support for ORCID – Who’s next to the plate?

Geoff Bilder, Jennifer Lin, Cameron Neylon

The announcement of a $3M grant from the Helmsley Trust to ORCID is a cause for celebration. For many of us who have been involved with ORCID, whether at the centre or the edges, the road to sustainability has been a long one, but with this grant (alongside some other recent successes) the funding is in place to take the organization to where it needs to be as a viable membership organization providing critical community services.

When we wrote the Infrastructure Principles we published some weeks back, ORCID was at the centre of our thinking, both as one of the best examples of good governance practice and as an infrastructure that needs sustaining. To be frank it has been disappointing, if perhaps not surprising how long it has taken for key stakeholders to step up to the plate to support its development. Publishers get a lot of stick when it comes to demanding money, but when it comes to community initiatives it is generally publisher that put up the initial funding. This has definitely been the case with ORCID, with funders and institutions falling visibly behind, apparently assuming others will get things moving.

This is a common pattern, and not restricted to scholarly communications. Collective Action Problems are hard to solve, particularly when communities are diverse and have interests that are not entirely aligned. Core to solving collective action problems is creating trust. Our aim with the infrastructure principles was very much to raise the issue of trust and what makes a trustworthy organization to a greater prominence.

Developing trust amongst our diverse communities requires that we create trustworthy institutions. We have a choice. We can create those institutions in a way that embodies our values, the values that we attempted to articulate, or we can leave it to others. Those others will have other values, and other motives, and cannot be relied upon to align with our communities’ interests.

Google Scholar is a classic example. Everybody loves Google Scholar, and it’s a great set of tools. But it has not acted in a way that serve the communities’ broader needs. It does not have an API to allow the data to be re-used elsewhere. We cannot rely on it to continue in its current form.

Google Scholar exists fundamentally so that researchers will help Google organize the world’s research information for Google’s benefit. ORCID exists so that the research community can organize the world’s research information for our community’s benefit. One answers to its shareholders, the other to the community. And as a result needs the support of our community for its sustainability. As the saying goes, when you don’t pay for the product, you are the product.

ORCID is a pivotal case. Will our communities choose to work together to build sustainable infrastructures that serve our needs and answer to us? Or will we repeat the mistakes of the past and leave that to other players whose interests do not align with our own. If we can’t collectively bring ORCID to a place where it is sustainable, supported by the whole community then what hope is there for more specialist, but no less necessary, infrastructures?

The Helmsley Trust deserves enormous credit for stepping up to the plate with this grant funding, as do the funders (including publishers) and early joining members who have gone before. But as a community we need to do more than provide time limited grants. We need to take the collective responsibility to fund key infrastructure on an ongoing basis. And that means that other funders, institutions, alongside publishers need to play their part.

Principles for Open Scholarly Infrastructures

Cite as “Bilder G, Lin J, Neylon C (2015) Principles for Open Scholarly Infrastructure-v1, retrieved [date], http://dx.doi.org/10.6084/m9.figshare.1314859

UPDATE: This is the original blogpost from 2015 that introduced the Principles. You also have the option to cite or reference the Principles themselves as: Bilder G, Lin J, Neylon C (2020), The Principles of Open Scholarly Infrastructure, retrieved [date], https://doi.org/10.24343/C34W2H

infrastructure /ˈɪnfɹəˌstɹʌkt͡ʃɚ/ (noun) – the basic physical and organizational structures and facilities (e.g. buildings, roads, power supplies) needed for the operation of a society or enterprise. – New Oxford American Dictionary

Everything we have gained by opening content and data will be under threat if we allow the enclosure of scholarly infrastructures. We propose a set of principles by which Open Infrastructures to support the research community could be run and sustained. – Geoffrey Bilder, Jennifer Lin, Cameron Neylon

Over the past decade, we have made real progress to further ensure the availability of data that supports research claims. This work is far from complete. We believe that data about the research process itself deserves exactly the same level of respect and care. The scholarly community does not own or control most of this information. For example, we could have built or taken on the infrastructure to collect bibliographic data and citations but that task was left to private enterprise. Similarly, today the metadata generated in scholarly online discussions are increasingly held by private enterprises. They do not answer to any community board. They have no obligations to continue to provide services at their current rates, particularly when that rate is zero.

We do not contest the strengths of private enterprise: innovation and customer focus. There is a lot of exciting innovation in this space, much it coming from private, for profit interests, or public-private partnerships. Even publicly funded projects are under substantial pressures to show revenue opportunities. We believe we risk repeating the mistakes of the past, where a lack of community engagement lead to a lack of community control, and the locking up of community resources. In particular our view is that the underlying data that is generated by the actions of the research community should be a community resource – supporting informed decision making for the community as well as providing as base for private enterprise to provide value added services.

What should a shared infrastructure look like? Infrastructure at its best is invisible. We tend to only notice it when it fails. If successful, it is stable and sustainable. Above all, it is trusted and relied on by the broad community it serves. Trust must run strongly across each of the following areas: running the infrastructure (governance), funding it (sustainability), and preserving community ownership of it (insurance). In this spirit, we have drafted a set of design principles we think could support the creation of successful shared infrastructures.

Governance

If an infrastructure is successful and becomes critical to the community, we need to ensure it is not co-opted by particular interest groups. Similarly, we need to ensure that any organisation does not confuse serving itself with serving its stakeholders. How do we ensure that the system is run “humbly”, that it recognises it doesn’t have a right to exist beyond the support it provides for the community and that it plans accordingly? How do we ensure that the system remains responsive to the changing needs of the community?

  • Coverage across the research enterprise – it is increasingly clear that research transcends disciplines, geography, institutions and stakeholders. The infrastructure that supports it needs to do the same.
  • Stakeholder Governed – a board-governed organisation drawn from the stakeholder community builds more confidence that the organisation will take decisions driven by community consensus and consideration of different interests.
  • Non-discriminatory membership – we see the best option as an ‘opt-in’ approach with a principle of non-discrimination where any stakeholder group may express an interest and should be welcome. The process of representation in day to day governance must also be inclusive with governance that reflects the demographics of the membership.
  • Transparent operations – achieving trust in the selection of representatives to governance groups will be best achieved through transparent processes and operations in general (within the constraints of privacy laws).
  • Cannot lobby – the community, not infrastructure organizations, should collectively drive regulatory change. An infrastructure organisation’s role is to provide a base for others to work on and should depend on its community to support the creation of a legislative environment that affects it.
  • Living will – a powerful way to create trust is to publicly describe a plan addressing the condition under which an organisation would be wound down, how this would happen, and how any ongoing assets could be archived and preserved when passed to a successor organisation. Any such organisation would need to honour this same set of principles.
  • Formal incentives to fulfil mission & wind-down – infrastructures exist for a specific purpose and that purpose can be radically simplified or even rendered unnecessary by technological or social change. If it is possible the organisation (and staff) should have direct incentives to deliver on the mission and wind down.

Sustainability

Financial sustainability is a key element of creating trust. ‘Trust’ often elides multiple elements: intentions, resources and checks and balances. An organisation that is both well meaning and has the right expertise will still not be trusted if it does not have sustainable resources to execute its mission. How do we ensure that an organisation has the resources to meet its obligations?

  • Time-limited funds are used only for time-limited activities – day to day operations should be supported by day to day sustainable revenue sources. Grant dependency for funding operations makes them fragile and more easily distracted from building core infrastructure.
  • Goal to generate surplus – organisations which define sustainability based merely on recovering costs are brittle and stagnant. It is not enough to merely survive it has to be able to adapt and change. To weather economic, social and technological volatility, they need financial resources beyond immediate operating costs.
  • Goal to create contingency fund to support operations for 12 months – a high priority should be generating a contingency fund that can support a complete, orderly wind down (12 months in most cases). This fund should be separate from those allocated to covering operating risk and investment in development.
  • Mission-consistent revenue generation – potential revenue sources should be considered for consistency with the organisational mission and not run counter to the aims of the organisation. For instance…
  • Revenue based on services, not data – data related to the running of the research enterprise should be a community property. Appropriate revenue sources might include value-added services, consulting, API Service Level Agreements or membership fees.

Insurance

Even with the best possible governance structures, critical infrastructure can still be co-opted by a subset of stakeholders or simply drift away from the needs of the community. Long term trust requires the community to believe it retains control.

Here we can learn from Open Source practices. To ensure that the community can take control if necessary, the infrastructure must be ‘forkable’. The community could replicate the entire system if the organisation loses the support of stakeholders, despite all established checks and balances. Each crucial part then must be legally and technically capable of replication, including software systems and data.

Forking carries a high cost, and in practice this would always remain challenging. But the ability of the community to recreate the infrastructure will create confidence in the system. The possibility of forking prompts all players to work well together, spurring a virtuous cycle. Acts that reduce the feasibility of forking then are strong signals that concerns should be raised.

The following principles should ensure that, as a whole, the organisation in extremis is forkable:

  • Open source – All software required to run the infrastructure should be available under an open source license. This does not include other software that may be involved with running the organisation.
  • Open data (within constraints of privacy laws) – For an infrastructure to be forked it will be necessary to replicate all relevant data. The CC0 waiver is best practice in making data legally available. Privacy and data protection laws will limit the extent to which this is possible.
  • Available data (within constraints of privacy laws) – It is not enough that the data be made ‘open’ if there is not a practical way to actually obtain it. Underlying data should be made easily available via periodic data dumps.
  • Patent non-assertion – The organisation should commit to a patent non-assertion covenant. The organisation may obtain patents to protect its own operations, but not use them to prevent the community from replicating the infrastructure.

Implementation

Principles are all very well but it all boils down to how they are implemented. What would an organisation actually look like if run on these principles? Currently, the most obvious business model is a board-governed, not-for-profit membership organisation, but other models should be explored. The process by which a governing group is established and refreshed would need careful consideration and community engagement. As would appropriate revenue models and options for implementing a living will.

Many of the consequences of these principles are obvious. One which is less obvious is that the need for forkability implies centralization of control. We often reflexively argue for federation in situations like this because a single centralised point of failure is dangerous. But in our experience federation begets centralisation. The web is federated, yet a small number of companies (e.g., Google, Facebook, Amazon) control discoverability; the published literature is federated yet two organisations control the citation graph (Thomson Reuters and Elsevier via Scopus). In these cases, federation did not prevent centralisation and control. And historically, this has occurred outside of stewardship to the community. For example, Google Scholar is a widely used infrastructure service with no responsibility to the community. Its revenue model and sustainability are opaque.

Centralization can be hugely advantageous though – a single point of failure can also mean there is a single point for repair. If we tackle the question of trust head on instead of using federation as a way to avoid the question of who can be trusted, we should not need to federate for merely political reasons. We will be able to build accountable and trusted organisations that manage this centralization responsibly.

Is there any existing infrastructure organisation that satisfies our principles? ORCID probably comes the closest, which is not a surprise as our conversation and these principles had their genesis in the community concerns and discussions that led to its creation. The ORCID principles represented the first attempt to address the issue of community trust which have developed in our conversations since to include additional issues. Other instructive examples that provide direction include Wikimedia Foundation and CERN.

Ultimately the question we are trying to resolve is how do we build organizations that communities trust and rely on to deliver critical infrastructures. Too often in the past we have used technical approaches, such as federation, to combat the fear that a system can be co-opted or controlled by unaccountable parties. Instead we need to consider how the community can create accountable and trustworthy organisations. Trust is built on three pillars: good governance (and therefore good intentions), capacity and resources (sustainability), and believable insurance mechanisms for when something goes wrong. These principles are an attempt to set out how these three pillars can be consistently addressed.

The challenge of course lies in implementation. We have not addressed the question of how the community can determine when a service has become important enough to be regarded as infrastructure nor how to transition such a service to community governance. If we can answer that question the community must take the responsibility to make that decision. We therefore solicit your critique and comments on this draft list of principles. We hope to provoke discussion across the scholarly ecosystem from researchers to publishers, funders, research institutions and technology providers and will follow up with a further series of posts where we explore these principles in more detail.

The authors are writing in a personal capacity. None of the above should be taken as the view or position of any of our respective employers or other organisations.