How do we build the science data commons? A proposal for a SciFoo session
I realised the other day that I haven’t written an exciteable blog post about getting an invitation to SciFoo! The reason for this is that I got overexcited over on FriendFeed instead and haven’t really had time to get my head together to write something here. But in this post I want to propose a session and think through what the focus and aspects of that might be.
I am a passionate advocate of two things that I think are intimately related. I believe strongly in the need and benefits that will arise from building, using, and enabling the effective search and processing of a scientific data commons. I [1,2] and others (including John Wilbanks, Deepak Singh, and Plausible Accuracy) have written on this quite a lot recently. The second aspect is that I believe strongly in the need for effective useable and generic tools to record science as it happens and to process that record so that others can use it effectively. To me these two things are intimately related. By providing the tools that enable the record to be created and integrating them with the systems that will store and process the data commons we can enable scientists to record their work better, communicate it better, and make it available as a matter of course to other scientists (not necessarily immediately I should add, but when they are comfortable with it).
At last year’s SciFoo Chris DiBona ran a session called ‘Give us the Data’ [Ed. As noted below Jon Trowbridge led the session with Chris DiBona acting as 'session chair'] Now I wasn’t there but Deepak Singh wrote the session up in a blog post which is my main inspiration. The idea here seemed to be to send in big data sets and that Google could slurp these up for processing. Now the world has moved on from last year and I believe we can propose something much more ambitious with the tools we have available. We can build the systems, or at least the prototypes of systems, that will be so useful for recording and sharing data that people will want to use them. Google AppEngine, Amazon Web Services, Freebase, CouchDB, Twine, and the growing efforts in actually making controlled vocabularies useful and useable, as well as a growing web of repositories and services make this look like an achievable target in a way that it wasn’t twelve months ago. Once we start building these repositories of data then there will be a huge incentive both for scientists and big players such as Google, Microsoft, Yahoo, IBM, and Amazon to look at how to effectively process, re-process, and present this data. That will be exciting, but first we need to build the tools and systems that will enable us to put the data there.
What does this ‘system’ look like. I’m going to suggest a possible model, based on my non-technical understanding of how these things work. I will probably get the details wrong (and please comment where I do). Let’s start at the storage end. We are only interested in open repositories, where it is explicitly stated that data is in the public domain and therefore freely re-useable. Some of these repositories exist for specific domains. ChemSpider , NMRShiftDB and crystallography databases in the chemistry domain, GenBank, PDB, and many others in the biosciences domain. To be honest, for most large datasets repositories already exist. The problem is not ‘big science’, the problem is ‘small science’ or ‘long tail’ science. The vast quantity of data and description that is generated by small numbers o people in isolated labs. Huge amounts of data languishing on millions of laptops in non-standard formats. All that stuff that no-one can really be bothered dealing with.
The storage itself is not really a problem. It can go on Amazon S3, Google servers, or into the various Institutional Repositories that Universities are rapidly deploying internationally. In fact, it should go into all three and a few other places besides. Storage is cheap, redoing the experiment is not. The digital curation people are also working on the format problem with standards like ORE providing a means of wrapping collections of data up and describing the format internally. Combine this with an open repository of data formats and the problem of storing and understanding data goes away, or rather it is still a large technical problem but one that is reasonably soluble with sufficient effort. One thing that is needed is standards for describing data formats; both standards in terms of the description and quality standards for what are the minimal requirements for describing data (MiSDaFD anyone?). A lot of work is required in the area of controlled vocabularies and ontologies, particularly in science domains where they haven’t yet got much traction. These don’t need to be perfect, they just need to be descriptive. But once again, how to go about doing this is reasonably clear (devil is in the details I know).
The fundamental problem is how to capture the data, associated metadata, and do the wrapping in a way that is an easy and convenient enough that scientists will actually do it. It will come as no suprise to the regularn readers of this blog that I would advocate the development of tools to enable scientists to do this effectively at the stage at which they already record their experiments; when they are done. Not only is this the point where the scientist already has the task of planning, executing, and recording their experiment. It is also the point at which embedding good metadata and recording practice will be the most benefit to them.
So we need to build a lab book system, or a virtual research environment if you prefer, because this will encompass much more than the traditional paper notebook. It will also encompass much more than the current crop of electronic lab notebooks. I don’t want to go into the details of what I think this should look like here; that’s the subject for a post in its own right. But to give an outline this system will consist of a feed aggregator (not unlike Friendfeed) where the user can subscribe to various data streams, some from people, some from instruments or sensors, and some that they have generated themselves, pictures at Flickr, comments on Twitter. The lab book component itself will look more or less like a word processor, or a blog, or a wiki, depending on what the user wants. In this system the user will generate documents (or pages); some will describe samples, some procedures, and some ideas. Critically this authoring system will have an open plugin architecture supported by an open source community that enables the user to grab extra functionality automatically.
A simple plugin might recognise a DNA sequence, or an INCHI code, and provide a functional graphical display, complete with automatically generated links back to the appropriate databases. Another plugin might provide access to a range of ontologies or controlled vocabularies which enable autocompletion of typed text with the appropriate name or easy or automatically generate the PDB/SBML/MiBBI compliant datafiles required for deposition. Journal submission templates could also be plugins or gadgets that assist in pulling all the relevant information together for submission and wrapping it up in an appropriate format. Semantic authoring at the journal submission stage is never going to work. Building it in early on and sucking it through when required just might. Adding semantic information could be as easy (but probably should be no easier than) formatting text. Select, right click, add semantics from drop down menu.
This environment will have a proper versioning system, web integration to push documents into the wider world, and a WYSIWYG interface. It will expose human readable and machine readable versions of the files, and will almost certainly use an underlying XML document format. It will probably expose both microformats, rdf, and free text tags.
I think there are essentially two routes towards this kind of system. One is to build into existing word processor and document handling ecosystems. Essentially building a plugin architecture for Word or OpenOffice. This would have the advantage of a system that is already widely used and familiar to scientists. It has the disadvantage (at least in the Microsoft case) that it would be built on top of a closed source product and be reliant on the continued existence and support of a large company. The alternative approach is to bring a completely open source product through. This will require much more work, particularly on the user interface side, but may be easier to handle in terms of open source plugin architecture. It could be entirely web based. Google AppEngine provides a very interesting environment for building such a thing. CouchDB may be a good way of maintaining the underlying data. Google Gadgets may well provide a lot of the desired plugin architecture and widgets of various, types, javascript, bookmarklets, and all the various tools on the web may provide a lot of the rest.
The purpose of this system is ultimately to encourage people to place research results in the public domain in a way that makes them useful. For it to get traction it has to be a no-brainer for people to adopt as a central part of the way they run their laboratory. If it doesn’t make people’s life easier; a lot easier; its not going to get any traction. The User interface will be crucial, reliability will be crucial, and the availability of added functionality which is currently simply unavailable will be crucial.
Agreeing an architecture is likely to be a challenge but an open source project is probably a more effective way of leveraging effort from a wide community. There are many open standards available that can be used here and the key is getting them all working together. Ideally the choice of authoring environment won’t matter because it can still push documents to the web in the same format(s), still use the same plugins, still interface with the databases. Building this will cost money, quite a lot of money, but the flip side is enabling the availability of the data, the procedures into the public domain. As I heard Christoph Steinbeck say yesterday, we can either get on and do this, or wait for the singularity. I think we’ve got a while to wait for that, so I think its time to get on with the build. Who wants to be involved?













Flickr
Slideshare