What exactly is infrastructure? Seeing the leopard’s spots

This is a photo of a black leopard from the Ou...
Black leopard from Out of Africa Wildlife Park in Arizona (Wikipedia)

Cite as: What exactly is infrastructure? Seeing the leopard’s spots. Geoffrey Bilder, Jennifer Lin, Cameron Neylon. figshare. http://dx.doi.org/10.6084/m9.figshare.1520432

We ducked a fundamental question raised by our proposal for infrastructure principles: “what exactly counts as infrastructure?” This question matters. If our claim is that infrastructures should operate according to a set of principles, we need to be able to identify the thing of which we speak. Call the leopard, a leopard. Of course this is not a straightforward question and part of the reason for leaving it in untouched in the introductory post. We believe that any definition must entail a much broader discussion from the community. But we wanted to kick this off with a discussion of an important part of the infrastructure puzzle that we think is often missed.

In our conversations with scholars and others in the research ecosystem, people frequently speak of “infrastructures” when what they mean are services on top of deeper infrastructures. Or indeed infrastructures that sit on deeper infrastructures. Most people think of the web as an essential piece of infrastructures and it is the platform that makes much of what we are talking about possible. But the web is built on deeper layers of infrastructure: the internet, MAC addresses, IP, and TCP/IP. Things that many readers will never even have heard of because they have disappeared from view. . Similarly in academia, a researcher will point to “CERN” or “Genbank” or “Pubmed” or “The Perseus Project” when asked about critical infrastructure. The Physicists at CERN have long since taken the plumbing, electricity, roads, tracks, airports etc. that make CERN possible for granted.

All these examples involve services operating a layer above ones we have been considering. Infrastructure is not commonly seen. That lower level infrastructure has become invisible just as the underlying network protocols, which make Genbank, PubMed and the Perseus Project possible have also long since become invisible and taken for granted. To put a prosaic point on it, what is not commonly seen is also not commonly noticed. And that makes it even more important for us to focus on getting these deeper layers of infrastructure right.

If doing research entails the search for new discoveries, those involved are more inclined to focus on what is different about their research. Every sub-community within academia tends to think at a level of abstraction that is typically one layer above the truly essential – and shared – infrastructure. We hear physicists, chemists, biologists, humanists at meetings and conferences assume that the problems that they are trying to solve in online scholarly communication are specific to their particular discipline. They say “we need to uniquely identify antibodies” or “we need storage for astronomy data” or “we need to know which journals are open access” or “how many times has this article been downloaded”. Then they build the thing that they (think they) need.

This then leads to another layer of invisibility – the infrastructures that we were concerned with in the Principles for Open Scholarly Infrastructure are about what is the same across disciplines, not what is different. It is precisely the fact that these common needs are boring that means they starts to disappear from view, in some cases before they even get built. For us it is almost a law: people tend to identify infrastructure one layer too high. We want to refocus attention on the layer below, the one that is disappearing from view. It turns out that a black leopard’s spots can be seen – once they’re viewed under infrared light.

So where does that leave us? What we hear in these conversations across disciplines are not the things that are different (ie., what is special about antibodies or astronomy data or journal classifications or usage counts) but what is in common across all of these problems. This class of common problems need shared solutions. “We need identifiers” and “we need storage” and ”we need to assign metadata” and “we need to record relationships”. These infrastructures are ones that will allow us to identify objects of interest (ex: identify new kinds of research objects), store resources where more specialised storage doesn’t already exists (ex: validate a data analysis pipeline), and record metadata and relationships between resources, objects and ideas (ex: describing the relationships between funders and datasets). For example, the Directory of Open Access Journals provides identifiers for Open Access journals, claims about such journals, and relationships with other resources and objects (such as article level metadata, like Crossref DOIs and Creative Commons license URLs).

But what has generally happened in the past  is that each group re-invents the wheel for its own particular niche. Specialist resources build a whole stack of tools rather than layering on the one specific piece that they need on an existing set of infrastructures. There is an important counter-example: the ability to easily cross-reference articles and datasets as well as connect these to the people who created them. This is made possible by Crossref and Datacite DOIs with ORCID IDs. ORCID is an infrastructure that provides identifiers for people as well as metadata and claims about relationships between people and other resources (e.g., articles, funders, and institutions) which are in turn described by identifiers from other infrastructures (Crossref, FundRef, ISNI). The need to identify objects is something that we have recognised as common across the research enterprise. And the common infrastructures that have been built are amongst the most powerful that we have at our disposal. But yet most of us don’t even notice that we are using them.

Infrastructures for identification, storage, metadata and relationships enable scholarship. We need to extend the base platform of identifiers into those new spaces, beyond identification to include storage and references. If we can harness the benefits on the same scale that have arisen from the provision of identifiers like Crossref DOIs then the building of new services that are specific to given disciplines will become so much easier.  In particular, we need to address the gap in providing a way to describe relationships between objects and resources in general. This base layer may be “boring” and it may be invisible to the view of most researchers. But that’s the way it should be. That’s what makes it infrastructure.

It isn’t what is immediately visible on the surface that makes a leopard a leopard, otherwise the black leopard wouldn’t be, it is what is buried beneath.

[pdf-lite]

Writing a Wave Robot – Some thoughts on good practice for Research Robots

ChemSpidey lives! Even in the face of Karen James’ heavy irony I am still amazed that someone like me with very little programming experience was able to pull together something that actually worked effectively in a live demo. As long as you’re not actively scared of trying to put things together it is becoming relatively straightforward to build tools that do useful things. Building ChemSpidey relied heavily on existing services and other people’s code but pulling that together was a relatively straightforward process. The biggest problems were fixing the strange and in most cases undocumented behaviour of some of the pieces I used. So what is ChemSpidey?

ChemSpidey is a Wave robot that can be found at chemspidey@appspot.com. The code repository is available at Github and you should feel free to re-use it in anyway you see fit, although I wouldn’t really recommend it at the moment, it isn’t exactly the highest quality. One of the first applications I see for Wave is to make it easy to author (semi-)semantic documents which link objects within the document to records on the web. In chemistry it would be helpful to link the names of compounds through to records about those compounds on the relevant databases.

If ChemSpidey is added to a wave it watches for text of the form “chem[ChemicalName{;weight {m}g}]” where the curly bracketed parts are optional. When a blip is submitted by hitting the “done” button ChemSpidey searches through the blip looking for this text and if it finds it, strips out the name and sends it) to the ChemSpider SimpleSearch service. ChemSpider returns a list of database ids and the robot currently just pulls the top one off the list and adds the text ChemicalName (csid:####) to the wave, where the id is linked back to ChemSpider. If there is a weight present it asks the ChemSpider MassSpec API for the nominal molecular weight calculates the number of moles and inserts that. You can see video of it working here (look along the timeline for the ChemSpidey tag).

ChemSpidey

What have I learned? Well some stuff that is probably obvious to anyone who is a proper developer. Use the current version of the API. Google AppEngine pushes strings around as unicode which broke my tests because I had developed things using standard Python strings. But I think it might be useful to start drawing some more general lessons about how best to design robots for research, so to kick of the discussion here are my thoughts, many of which came out of discussions with Ian Mulvany as we prepared for last weeks demo.

  1. Always add a Welcome Blip when the Robot is added to a wave. This makes the user confident that something has happened. Lets you notify users if a new version has been released, which might change the way the robot works, and lets you provide some short instructions.It’s good to include a version number here as well.
  2. Have some help available. Ian’s Janey robot responds to the request (janey help) in a blip with an extended help blip explaining context. Blips are easily deleted later if the user wants to get rid of them and putting these in separate blips keeps them out of the main document.
  3. Where you modify text leave an annotation. I’ve only just started to play with annotations but it seems immensely useful to at least attempt to leave a trace here of what you’ve done that makes it easy for either your own Robot or others, or just human users to see who did what. I would suggest leaving annotations that identfy the robot, include any text that was parsed, and ideally provide some domain information. We need to discuss how to setup some name spaces for this.
  4. Try to isolate the “science handling” from the “wave handling”. ChemSpidey mixes up a lot of things into one Python script. Looking back at it now it makes much more sense to isolate the interaction with the wave from the routines that parse text, or do mole calculations. This means both that the different levels of code should become easier for others to re-use and also if Wave doesn’t turn out to be the one system to rule them all that we can re-use the code. I am no architecture expert and it would be good to get some clues from some good ones about how best to separate things out.

These are just some initial thoughts from a very novice Python programmer. My code satisfies essentially none of these suggestions but I will make a concerted attempt to improve on that. What I really want to do is kick the conversation on from where we are at the moment, which is basically playing around, into how we design an architecture that allows rapid development of useful and powerful functionality.

Where is the best place in the Research Stack for the human API?

Interesting conversation yesterday on Twitter with Evgeniy Meyke of EarthCape prompted in part by my last post. We started talking about what a Friendfeed replacement might look like and how it might integrate more directly into scientific data. Is it possible to build something general or will it always need to be domain specific. Might this in fact be an advantage? Evgeniy asked:

@CameronNeylon do you think that “something new” could be more vertically oriented rather then for “research community” in general?

His thinking being, as I understand it that to get at domain specific underlying data is always likely to take local knowledge. As he said in his next tweet:

@CameronNeylon It might be that the broader the coverage the shallower is integration with underlining research data, unless api is good

This lead me to thinking about integration layers between data and people and recalled something that I said in jest to someone some time ago;

“If you’re using a human as your API then you need to work on your user interface.”

Thinking about the way Friendfeed works there is a real sense in which the system talks to a wide range of automated APIs but at the core there is a human layer that firstly selects feeds of interest and then when presented with other feeds selects from them specific items. What Friendfeed does very well in some senses is provide a flexible API between feeds and the human brain. But Evegeniy made the point that this “works only 4 ‘discussion based’ collaboration (as in FF), not 4 e.g. collab. taxonomic research that needs specific data inegration with taxonomic databases”.

Following from this was an interesting conversation [Webcite Archived Version] about how we might best integrate the “human API” for some imaginary “Science Stream” with domain specific machine APIs that work at the data level. In a sense this is the core problem of scientific informatics. How do you optimise the ability of machines to abstract and use data and meaning while at the same time fully exploiting the ability of the human scientist to contribute their own unique skills, pattern recognition, insight, lateral thinking. And how do you keep these in step with each other so both are optimally utilised? Thinking in computational terms about the human as a layer in the system with its own APIs could be a useful way to design systems.

Friendfeed in this view is a peer to peer system for pushing curated and annotated data streams. It mediates interactions with the underlying stream but also with other known and unknown users. Friendfeed seems to get three things very right: 1) Optimising the interaction with the incoming data stream; 2) Facilitating the curation and republication of data into a new stream for consumption by others, creating a virtuous feedback look in fact; and 3) Facilitating discovery of new peers. Friendfeed is actually a bittorrent for sharing conversational objects.

This conversational layer, a research discourse layer if you like, is at the very top of the stack, keeping the humans to a high level abstracted level of conversation, where we are probably still at our best. And my guess is that something rather like Friendfeed is pretty good at being the next layer down, the API to feeds of interesting items.  But Evgeniy’s question was more about the bottom of the stack, where the data is being generated and needs to be turned into a useful and meaningful feed, ready to be consumed. The devil is always in the details and vertical integration is likely to help her. So what do these vertical segments look like?

In some domains these might be lab notebooks, in some they might be specific databases, or they might be a mixture of both and of other things. At the coal face it is likely to be difficult to find a way of describing the detail in a way that is both generic enough to be comprehensible and detailed enough to be useful. The needs of the data generator are likely to be very different to those of a generic data consumer. But if there is a curation layer, perhaps human or machine mediated, that partly abstracts this then we may be on the way to generating the generic feeds that will be finally consumed at the top layer.  This curation layer would enable semantic markup, ideally automatically, would require domain specific tooling to translate from the specific to the generic, and provide a publishing mechanism. In short it sounds (again) quite a bit like Wave. Actually it might just as easily be Chem4Word or any other domain specific semantic authoring tool, or just a translation engine that takes in detailed domain specific info and correlates it with a wider vocabulary.

One of the things that appeals to me about Wave, and Chem4Word, is that they can (or at least have the potential to) hide the complexities of the semantics within a straightforward and comprehensible authoring environment. Wave can be integrated into domain specific systems via purpose built Robots making it highly extensible. Both are capable of “speaking web” and generating feeds that can be consumed and processed in other places and by other services. At the bottom layer we can chew the problem off one piece at a tim, including human processing where it is appropriate and avoiding it where we can.

The middleware is of coures, as always, the problem. The middleware is agreed and standardised vocabularies and data formats. While in the past I have thought this near intractable actually it seems as though many of the pieces are actually falling into place. There is still a great need for standardisation and perhaps a need for more meta-standards but it seems like a lot of this is in fact on the way. I’m still not convinced that we have a useful vocabulary for actually describing experiments but enough smart people disagree with me that I’m going to shut up on that one until I’ve found the time to have a closer look at the various things out there in more detail.

These are half baked thoughts – but I think the idea of where we optimally place the human in the system is a useful question. It also hasn’t escaped my notice that I’m talking about something very similar to the architecture that Simon Coles of Amphora Research Systems always puts up in his presentations on Electronic Lab Notebooks. Fundamentally because the same core drivers are there.