Github for science? Shouldn’t we perhaps build TCP/IP first?

Mapa mental do TCP/IP
Image via Wikipedia

It’s one of those throw away lines, “Before we can talk about a github for science we really need to sort out a TCP/IP for science”, that’s geeky, sharp, a bit needly and goes down a treat on Twitter. But there is a serious point behind it. And its not intended to be dismissive of the ideas that are swirling around about scholarly communication at the moment either. So it seems worth exploring in a bit more detail.

The line is stolen almost wholesale from John Wilbanks who used it (I think) in the talk he gave at a Science Commons meetup in Redmond a few years back. At the time I think we were awash in “Facebooks for Science” so that was the target but the sentiment holds. As once was the case with Facebook and now is for Github, or Wikipedia, or StackOverflow, the possibilities opened up by these new services and technologies to support a much more efficient and effective research process look amazing. And they are. But you’ve got to be a little careful about taking the analogy too far.

If you look at what these services provide, particularly those that are focused on coding, they deliver commentary and documentation, nearly always in the form of text about code – which is also basically text. The web is very good at transferring text, and code, and data. The stack that delivers this is built on a set of standards, with each layer building on the layer beneath it. StackOverflow and Github are built on a set of services, that in turn sit on top of the web standards of http, which in turn are built on network standards like TCP/IP that control the actual transfer of bits and bytes.

The fundamental stuff of these coding sites and Wikipedia is text, and text is really well supported by the stack of web technologies. Open Source approaches to software development didn’t just develop because of the web, they developed the web so its not surprising that they fit well together. They grew up together and nurtured each other. But the bottom line is that the stack is optimized to transfer the grains of material, text and code, that make up the core of these services.

When we look at research we can see that when we dig down to the granular level it isn’t just made up of text. Sure most research could be represented as text but we don’t have the standardized forms to do this. We don’t have standard granules of research that we can transfer from place to place. This is because its complicated to transfer the stuff of research. I picked on TCP/IP specifically because it is the transfer protocol that supports moving bits and bytes from one place to another. What we need are protocols that support moving the substance of a piece of my research from one place to another.

Work on Research Objects [see also this paper], intended to be self-contained but useable pieces of research is a step in this direction, as are the developing set of workflow tools, that will ultimately allow us to describe and share the process by which we’ve transformed at least some parts of the research process into others. Laboratory recording systems will help us to capture and workflow-ify records of the physical parts of the research process. But until we can agree how to transfer these in a standardized fashion then I think it is premature to talk about Githubs for research.

Now there is a flip side to this, which is that where there are such services that do support the transfer of pieces of the research process we absolutely should be  experimenting with them. But in most cases the type-case itself will do the job. Github is great for sharing research code and some people are doing terrific things with data there as well. But if it does the job for those kinds of things why do we need one for researchers? The scale that the consumer web brings, and the exposure to a much bigger community, is a powerful counter argument to building things ‘just for researchers’. To justify a service focused on a small community you need to have very strong engagement or very specific needs. By the time that a mainstream service has mindshare and researchers are using it, your chances of pulling them away to a new service just for them are very small.

So yes, we should be inspired by the possibilities that these new services open up, and we should absolutely build and experiment but while we are at it can we also focus on the lower levels of the stack?They aren’t as sexy and they probably won’t make anyone rich, but we’ve got to get serious about the underlying mechanisms that will transfer our research in comprehensible packages from one place to another.

We have to think carefully about capturing the context of research and presenting that to the next user. Github works in large part because the people using it know how to use code, can recognize specific languages, and know how to drive it. It’s actually pretty poor for the user who just wants to do something – we’ve had to build up another set of services at different levels, the Python Package Index, tools for making and distributing executables, that help provide the context required for different types of user. This is going to be much, much harder, for all the different types of use we might want to put research to.

But if we can get this right – if we can standardize transfer protocols and build in the context of the research into those ‘packets’ that lets people use it then what we have seen on the wider web will happen naturally. As we build the stack up these services that seem so hard to build at the moment will become as easy today as throwing up a blog, downloading a rubygem, or firing up a machine instance. If we can achieve that then we’ll have much more than a github for research, we’ll have a whole web for research.

There’s nothing new here that wasn’t written some time ago by John Wilbanks and others but it seemed worth repeating. In particular I recommend these posts [1, 2] from John.

12 Replies to “Github for science? Shouldn’t we perhaps build TCP/IP first?”

  1. Cameron – I took your meaning in the tweet to be the second, sarcastic one, I didn’t think you were actually somewhat serious. But TCP is probably the wrong level. I recently skimmed through several thousand of the most recent IETF RFC’S and there are a number of proposals that looked to me to be possibly useful for science communication – capturing “context” more completely than is typical with URL’s. But the web is becoming quite ossified at the lower layers and you see considerable pushback against any non-URL non-HTTP proposal for something new. If it can be done with URL’s, the argument goes, why do you need something else? Web browsers are so ubiquitous and the protocol so widely implemented, the huge “installed base” is overwhelming.

    That’s what’s happened with DOI’s and the handle system (an alternative to DNS).

  2. I should say that I’m not proposing that we actually go in and build a new protocol at the network layer below http. Its pretty clear that whatever we build will sit over http and almost certainly involve existing web tools. But my point is that we need to do some pretty fundamental low level engineering of standards that would sit at some level equivalent to TCP/IP with reference to something that really acted as a “Github” level entity. What we need are good protocols for identifying and transferring pieces of research over the web. I’m sure that these will have URLs and use some components of the SemWeb stack. Then we can build a social layer on top of those.

  3.  Ah well – on that I think you’re right – but what we should seriously discuss is a “SemWeb” for scientists, because as it currently stands the structure of the semantic web I believe is NOT actually very conducive to the needs of scientific communication (or perhaps many other practical uses, given the very slow uptake of SemWeb 10 years on). I wrote about this last year here:

    http://arthur.shumwaysmith.com/life/content/where_the_semantic_web_goes_wrong_thoughts_on_web_30

    and let me quote myself: “In my view the semantic web vision is actually backwards: trust,
    provenance, and context need to be at the foundation of meaning, not
    tacked on at the top layer for “intelligent agents” to try to figure
    out. A given statement is meaningless out of context. Recognition of the
    existence of multiple contexts allows us to handle mutually
    inconsistent statements without human rationality completely breaking
    down, so “intelligent agents” will need to do this too. RDF and the
    “giant global graph” tries to define a universal context, but I believe
    such a project can never succeed (outside of, possibly, mathematics)
    because the real world is just too messy.”

  4. Agreed, and that was part of the discussion that would be good to provoke. Those are some significant issues you are raising and there hasn’t always been a good match from SemWeb thinking to research as we do it in practice. On the other hand there is an awful lot of potential there that we clearly want to get at as well – so what’s the overall architecture and how do we best approach getting it right?

  5. I find myself wondering …

    What if we just use github itself?  Couldn’t that be our “publishing” platform?

  6. That’s kind of my point with all of these “x for science” things. You need a compelling reason to not just use X. I’ve thought about doing stuff through github and others have done things from time to time. It works – its not perfect but its good enough for a lot of stuff.

  7. As someone who is posting their daily research to github, I recognize what @CameronNeylon:disqus  is saying about the broader lack of standards to move research around, even though it works well for code. 

    I think this misses the bigger issue though — I feel the social challenges are far greater than the technical ones.  Wired’s github piece already eludes to this — obviously github didn’t invent git, it solved a primarily social problem instead. I believe many researchers are sharing exactly as much as they want to. Incentive them to share more, and collectively we’ll discover much better solutions to the technical barriers those of us trying to share already bang our heads against.  Maybe?

  8. I can’t argue with that. At some level the standards problem I sketch out *is* the social problem. So TCP/IP and HTTP are solutions to social problems – how do we agree to send packets across a network and how do we request them in a standardised form. They are implemented as technical standards but the social problems were the ones solved by development and critically adoption of those standards. If we could agree a standard then we would also have solved the social problem for research – the problem of course is to get enough people to the point where they recognise there is a problem. And you’re dead right – the incentives are core to that.

Comments are closed.