freebase – Science in the Open

Capture and Connect: automated data capture — Image by cameronneylon via Flickr

Jon Eisen asked a question on Friendfeed last week that sparked a really interesting discussion of what an electronic research record should look like. The conversation is worth a look as it illustrates different perspectives and views on what is important. In particular I thought Jonâ€™s statement of what he wanted was very interesting:

I want a system where people record EVERYTHING they are doing in their research with links to all data, analyses, output, etc [CN – my italics]. And I want access to it from anywhere. And I want to be able to search it intelligently. Dropbox won’t cut it.

This is interesting to me because it maps onto my own desires. Simple systems that make it very easy to capture digital research objects as they are created and easy-to-use tools that make it straightforward to connect these objects up. This is in many ways the complement of the Reseach Communication as Aggregation idea that I described previously. By collecting all the pieces and connecting them up correctly we create a Research Record as Aggregation, making it easy to wrap pieces of this up and connect them to communications. It also provides a route towards bridging the divide between research objects that are born digital and those that are physical objects that need to be represented by digital records.

Ok. So so much handwaving â€“ what about building something? What about building something this weekend at ScienceHackDay? My idea is that we can use three pieces that have recently come together to build a demonstrator of how such a system might work. Firstly the DropBox API is now available (and I have a developer key). DropBox is a great tool that delivers on the promise of doing one thing well. It sits on your computer and synchronises directories with the cloud and any other device you put it on. Just works. This makes it a very useful entry point for the capture of digital research objects. So Step One:

Build a web service on the DropBox API that enables users (or instruments) to subscribe and captures new digital objects, creating an exposed feed of resources.

This will enable us to captures and surface research objects with users simply dropping files into directories on local computers. Using DropBox means these can be natively synchronised across multiple user computers which is nice. But then we need to connect these objects up, ideally in an automatic way. To do this we need a robust and general way of describing relationships between them. As part of the OREChem project, a collaboration between Cambridge, Southampton, Indiana, Penn State and Cornell Universities and PubChem, supported by Microsoft, Mark Borkum has developed an ontology that describes experiments (unfortunately there is nothing available on the web as yet â€“ but I am promised there will be soon!). Nothing so new there, been done before. What is new here is that the OREChem vocabulary describes both plans and instances of carrying out those plans. It is very simple, essentially describing each part of a process as a â€œstageâ€ which takes in inputs and emits outputs. The detailed description of these inputs and outputs is left for other vocabularies. The plan and the record can have a one to one correspondence but donâ€™t need to. It is possible to ask whether a record satisfies a plan and alternately given evidence that a plan has been carried out that all the required inputs must have existed at some point.

Why does this matter? It matters because for a particular experiment we can describe a plan. For instance a UV-Vis spectrophotometer measurement requires a sample, a specific instrument, and emits a digital file, usually in a specific format. If our webservice above knows that a particular DropBox account is associated with a UV-Vis instrument and it sees a new file of the right type it knows that the plan of a UV-Vis measurement must have been carried out. It also knows which instrument was used (based on the DropBox account) and might know who it was who did the measurement (based on the specific folder the file appeared in). The web service is therefore able to infer that there must exist (or have existed) a sample. Knowing this it can attempt to discover a record of this sample from known resources, the public web, or even by emailing the user, asking them for it, and then creating a record for them.

A quick and dirty way of building a data model and linking it to objects on the web is to use Freebase and the Freebase API. This also has the advantage that we can leverage Freebase Gridworks to add records from spreadsheets (e.g. sample lists) into the same data model. So Step Two:

Implement OREChem experiment ontology in Freebase. Describe a small set of plans as examples of particular experimental procedures.

And then Step Three:

Expand the web service built in Step One to annotate digital research objects captured in Freebase and connect them to plans. Attempt to build in automatic discovery of inferred resources from known and unknown resources, and a system to failover to ask the user directly.

Freebase and DropBox may not be the best way to do this but both provide a documented API that could enable something to be lashed up quickly. I’m equally happy to be told that SugarSync, Open Calais, or Talis Connected Commons might be better ways to do this, especially if someone will be at ScienceHackDay with expertise in this. Demonstrating something like this could be extremely valuable as it would actually leverage semantic web technology to do something useful for researchers, linking their data into a wider web, while not actually bothering them with the details of angle brackets

Dropbox API Lets You Add Cloud Storage to Your Apps (webmonkey.com)
Why the web of data needs to be social (cameronneylon.net)
Preview: Freebase Gridworks (simonwillison.net)
Update: CML, Chem4Word, CheTA, OREChem, Lensfield (wwmm.ch.cam.ac.uk)

: Image by cameronneylon via Flickr

If youâ€™ve been around either myself or Deepak Singh you will almost certainly have heard the Jeff Jonas/Jon Udell soundbite: â€˜Data finds data. Then people find peopleâ€™. Jonas is referring to data management frameworks and knowledge discovery and Udell is referring to the power of integrated data to bring people together.

At some level Jonasâ€™ vision (see his chapter[pdf] in Beautiful Data) is what the semantic web ought to enable, the automated discovery of data or objects based on common patterns or characteristics. Thus far in practical terms we have signally failed to make this a reality, particularly for research data and objects.

Udellâ€™s angle (or rather, myÂ interpretationÂ of his overall stance) is more linked to the social web â€“ the discovery of common contexts through shared data frameworks. These contexts might be social groups, as in conventional social networks, a particular interest or passion, or – in the case of Jonâ€™s championing of the iCalendar standard – Â a date and place as demonstrated by the Â the elmcity project supporting calendar curation and aggregation. Shared context enables the making of new connection, the creation of new links. But still mainly links between people.

Itâ€™s not the scientists who are social; itâ€™s the data â€“ Neil Saunders

The naÃ¯ve analysis of the success of consumer social networks and the weaknesses of science communication has lead to efforts that almost precisely invert the Jonas/Udell concept. In the case of most of these â€œFacebooks for Scientistsâ€ the idea is that people find people, and then they connect with data through those people.

My belief is that it is this approach that has led to the almost complete failure of these networks to gain traction. Services that place the object Â research at the centre; the reference management and bookmarking services, to some extent Twitter and Friendfeed, appear to gain much more real scientific use because they mediate the interactions that researchers are interested in, those between themselves and research objects. Friendfeed in particular seems to support this discovery pattern. Objects of interest are brought into your stream, which then leads to discovery of the person behind them.Â I often use Citeulike in this mode. I find a paper of interest, identify the tags other people have used for it and the papers that share those tags. If these seems promising, I then might look at the library of the person, but I get to that person through the shared context of the research object, the paper, and the tags around that object.

Data, data everywhere, but not a lot of links â€“ Simon Coles

A common complaint made of research data is that people donâ€™t make it available. This is part of the problem but increasingly it is a smaller part. It is easy enough to put data up that many researchers are doing so, in supplementary data of journal articles, on personal websites, or on community or consumer sites. From a linked data perspective we ought to be having a field day with this, even if it represents only a small proportion of the total. However little of this data is easily discoverable and most of it is certainly not linked in any meaningful way.

A fundamental problem that I feel like Iâ€™ve been banging on about for years now is that dearth of well built tools for creating these links. Finally these tools are starting to appear with Freebase Gridworks being an early example. There is a good chance that it will become easier over time for people to create links as part of the process of making their own record. But the fundamental problems we always face, that this is hard work, and often unrewarded work, are limiting progress.

Data friends dataâ€¦then knowledge becomes discoverable

Human interaction is unlikely to work at scale. We are going to need automated systems to wire the web of data together. The human process simply cannot keep up with the ongoing annotation and connection of data at the volumes that are being generated today. And we canâ€™t afford not to if we want to optimize the opportunities of research to deliver useful outcomes.

When we think about social networks we always place people at their centre. But there is nothing to stop us replacing people with data or other research objects. Software that wants to find data, data that wants to find complementary or supportive data, or wants to find the right software to convert or analyze it. Instead of Farmville or Mafia Wars imagine useful tools that make these connections, negotiate content, and identify common context. As pointed out to me by Paul Walk this is very similar to what was envisioned in the 90s as the role of software agents. In this view the human research users are the poorly connected users on the outskirts of the web.

The point is that the hard part of creating linked data is making the links, not publishing the data. The semantic web has always suffered from the chicken and egg problem of a lack of user-friendly ways to generate RDF and few tools that could really use that RDF in exciting ways even if it did exist. I still can’t do a useful search on which restaurants in Bath will be open next Sunday. The reality is that the innards of this should be hidden from the user, the making of connections needs to be automated as far as possible, and as natural as possible when the user has to be involved. As easy as hitting that “like” button, or right clicking and adding a citation.

We have learnt a lot about the principles of when and how social networks work. If we can apply those lessons to the construction of open data management and discovery frameworks then we may stand some chance of actually making some of the original vision of the web work.

Facebook and the Open Graph: good for Linked Data? (blogs.talis.com)
Freebase Gridworks: A power tool for data scrubbers ” Jon Udell (faganm.com)

Tag: freebase

Capturing and connecting research objects: A pitch for @sciencehackday

Related articles by Zemanta

Related articles by Zemanta