Github for science? Shouldnâ€™t we perhaps build TCP/IP first?

Mapa mental do TCP/IP — Image via Wikipedia

Itâ€™s one of those throw away lines, â€œBefore we can talk about a github for science we really need to sort out a TCP/IP for scienceâ€, thatâ€™s geeky, sharp, a bit needly and goes down a treat on Twitter. But there is a serious point behind it. And its not intended to be dismissive of the ideas that are swirling around about scholarly communication at the moment either. So it seems worth exploring in a bit more detail.

The line is stolen almost wholesale from John Wilbanks who used it (I think) in the talk he gave at a Science Commons meetup in Redmond a few years back. At the time I think we were awash in â€œFacebooks for Scienceâ€ so that was the target but the sentiment holds. As once was the case with Facebook and now is for Github, or Wikipedia, or StackOverflow, the possibilities opened up by these new services and technologies to support a much more efficient and effective research process look amazing. And they are. But youâ€™ve got to be a little careful about taking the analogy too far.

If you look at what these services provide, particularly those that are focused on coding, they deliver commentary and documentation, nearly always in the form of text about code â€“ which is also basically text. The web is very good at transferring text, and code, and data. The stack that delivers this is built on a set of standards, with each layer building on the layer beneath it. StackOverflow and Github are built on a set of services, that in turn sit on top of the web standards of http, which in turn are built on network standards like TCP/IP that control the actual transfer of bits and bytes.

The fundamental stuff of these coding sites and Wikipedia is text, and text is really well supported by the stack of web technologies. Open Source approaches to software development didnâ€™t just develop because of the web, they developed the web so its not surprising that they fit well together. They grew up together and nurtured each other. But the bottom line is that the stack is optimized to transfer the grains of material, text and code, that make up the core of these services.

When we look at research we can see that when we dig down to the granular level it isnâ€™t just made up of text. Sure most research could be represented as text but we donâ€™t have the standardized forms to do this. We donâ€™t have standard granules of research that we can transfer from place to place. This is because its complicated to transfer the stuff of research. I picked on TCP/IP specifically because it is the transfer protocol that supports moving bits and bytes from one place to another. What we need are protocols that support moving the substance of a piece of my research from one place to another.

Work on Research ObjectsÂ [see also this paper], intended to be self-contained but useable pieces of research is a step in this direction, as are the developing set of workflow tools, that will ultimately allow us to describe and share the process by which weâ€™ve transformed at least some parts of the research process into others. Laboratory recording systems will help us to capture and workflow-ify records of the physical parts of the research process. But until we can agree how to transfer these in a standardized fashion then I think it is premature to talk about Githubs for research.

Now there is a flip side to this, which is that where there are such services that do support the transfer of pieces of the research process we absolutely should beÂ experimenting with them. But in most cases the type-case itself will do the job. Github is great for sharing research code and some people are doing terrific things with data there as well. But if it does the job for those kinds of things why do we need one for researchers? The scale that the consumer web brings, and the exposure to a much bigger community, is a powerful counter argument to building things â€˜just for researchersâ€™. To justify a service focused on a small community you need to have very strong engagement or very specific needs. By the time that a mainstream service has mindshare and researchers are using it, your chances of pulling them away to a new service just for them are very small.

So yes, we should be inspired by the possibilities that these new services open up, and we should absolutely build and experiment but while we are at it can we also focus on the lower levels of the stack?They arenâ€™t as sexy and they probably wonâ€™t make anyone rich, but weâ€™ve got to get serious about the underlying mechanisms that will transfer our research in comprehensible packages from one place to another.

We have to think carefully about capturing the context of research and presenting that to the next user. Github works in large part because the people using it know how to use code, can recognize specific languages, and know how to drive it. Itâ€™s actually pretty poor for the user who just wants to do something â€“ weâ€™ve had to build up another set of services at different levels, the Python Package Index, tools for making and distributing executables, that help provide the context required for different types of user. This is going to be much, much harder, for all the different types of use we might want to put research to.

But if we can get this right â€“ if we can standardize transfer protocols and build in the context of the research into those â€˜packetsâ€™ that lets people use it then what we have seen on the wider web will happen naturally. As we build the stack up these services that seem so hard to build at the moment will become as easy today as throwing up a blog, downloading a rubygem, or firing up a machine instance. If we can achieve that then weâ€™ll have much more than a github for research, weâ€™ll have a whole web for research.

Thereâ€™s nothing new here that wasnâ€™t written some time ago by John Wilbanks and others but it seemed worth repeating. In particular I recommend these posts [1, 2] from John.

12 Replies to “Github for science? Shouldnâ€™t we perhaps build TCP/IP first?”

Cameron – I took your meaning in the tweet to be the second, sarcastic one, I didn’t think you were actually somewhat serious. But TCP is probably the wrong level. I recently skimmed through several thousand of the most recent IETF RFC’S and there are a number of proposals that looked to me to be possibly useful for science communication – capturing “context” more completely than is typical with URL’s. But the web is becoming quite ossified at the lower layers and you see considerable pushback against any non-URL non-HTTP proposal for something new. If it can be done with URL’s, the argument goes, why do you need something else? Web browsers are so ubiquitous and the protocol so widely implemented, the huge “installed base” is overwhelming.

That’s what’s happened with DOI’s and the handle system (an alternative to DNS).

Cameron Neylon says:

February 22, 2012 at 12:02 pm

I should say that I’m not proposing that we actually go in and build a new protocol at the network layer below http. Its pretty clear that whatever we build will sit over http and almost certainly involve existing web tools. But my point is that we need to do some pretty fundamental low level engineering of standards that would sit at some level equivalent to TCP/IP with reference to something that really acted as a “Github” level entity. What we need are good protocols for identifying and transferring pieces of research over the web. I’m sure that these will have URLs and use some components of the SemWeb stack. Then we can build a social layer on top of those.
1. Arthur Smith says:
  
  February 22, 2012 at 1:03 pm
  
  Â Ah well – on that I think you’re right – but what we should seriously discuss is a “SemWeb” for scientists, because as it currently stands the structure of the semantic web I believe is NOT actually very conducive to the needs of scientific communication (or perhaps many other practical uses, given the very slow uptake of SemWeb 10 years on). I wrote about this last year here:
  
  http://arthur.shumwaysmith.com/life/content/where_the_semantic_web_goes_wrong_thoughts_on_web_30
  
  and let me quote myself: “In my view the semantic web vision is actually backwards: trust,
  provenance, and context need to be at the foundation of meaning, not
  tacked on at the top layer for “intelligent agents” to try to figure
  out. A given statement is meaningless out of context. Recognition of the
  existence of multiple contexts allows us to handle mutually
  inconsistent statements without human rationality completely breaking
  down, so “intelligent agents” will need to do this too. RDF and the
  “giant global graph” tries to define a universal context, but I believe
  such a project can never succeed (outside of, possibly, mathematics)
  because the real world is just too messy.”
  1. Cameron Neylon says:
    
    February 22, 2012 at 8:43 pm
    
    Agreed, and that was part of the discussion that would be good to provoke. Those are some significant issues you are raising and there hasn’t always been a good match from SemWeb thinking to research as we do it in practice. On the other hand there is an awful lot of potential there that we clearly want to get at as well – so what’s the overall architecture and how do we best approach getting it right?

BitTorrent for science?

Carl Boettiger says:

February 29, 2012 at 2:05 am

@0f71f0cb2c5d58e90c535b7408cae94e:disqusÂ Perhaps you’ve already seen biotorrents? Paper: http://www.plosone.org/article/info:doi/10.1371/journal.pone.0010071Â site” http://www.biotorrents.net/about.php

(Though I gather @CameronNeylon:disqus is not so concerned about transferring the actual bits as doing so in an intelligible way).Â

Pingback: Software Carpentry » Granules of Research

I find myself wondering …

What if we just use github itself?Â Couldn’t that be our “publishing” platform?

Cameron Neylon says:

February 28, 2012 at 4:31 pm

That’s kind of my point with all of these “x for science” things. You need a compelling reason to not just use X. I’ve thought about doing stuff through github and others have done things from time to time. It works – its not perfect but its good enough for a lot of stuff.

As someone who is posting their daily research to github, I recognize what @CameronNeylon:disqusÂ is saying about the broader lack of standards to move research around, even though it works well for code.Â

I think this misses the bigger issue though — I feel the social challenges are far greater than the technical ones.Â Wired’s github piece already eludes to this — obviously github didn’t invent git, it solved a primarily social problem instead. I believe many researchers are sharing exactly as much as they want to. Incentive them to share more, and collectively we’ll discover much better solutions to the technical barriers those of us trying to share already bang our heads against.Â Maybe?

Cameron Neylon says:

February 29, 2012 at 7:45 am

I can’t argue with that. At some level the standards problem I sketch out *is* the social problem. So TCP/IP and HTTP are solutions to social problems – how do we agree to send packets across a network and how do we request them in a standardised form. They are implemented as technical standards but the social problems were the ones solved by development and critically adoption of those standards. If we could agree a standard then we would also have solved the social problem for research – the problem of course is to get enough people to the point where they recognise there is a problem. And you’re dead right – the incentives are core to that.

Pingback: Transactions in research on the web: research objects but also the synthesis and transfer of ideas / conceets / hpotheses | Confessions of a marine chemist

Comments are closed.