Implementing the “Publication as Aggregation”

Google Wave Lab mockup
Image by cameronneylon via Flickr

I wrote a few weeks back about the idea of re-imagining the formally published scientific paper as an aggregation of objects. The idea behind this is that it provides both continuity, through enabling the display of such papers in more or less the same way as we do currently, enhancing functionality by, for instance, embedding active versions of figures in a native form, and at the same time providing a route towards a linked data web for research.

Fundamentally the idea is that we publish fragments, and then aggregate these fragments together. The mechanism of aggregation is supposed to be an expanded version of the familiar paradigm of citation: the introduction of a paper will link to and cite other papers as usual, and these would be incorporated as links and citations within the presentation of the paper. But in addition the paper will cite the introduction. By default the introduction would be included in the view of the paper presented to the user, but equally the user might choose to only look at figures, only conclusions, or only the citations. Figures would be citations to data, again available on the web, again with a default visualization that might be an embedded active graph or a static image.

I asserted that the tools for achieving this are more or less in place. Actually that is only half true. The tools for storing, displaying, and even to some extent archiving communications in this form do exist, at least in the form of examples.

An emerging standard for aggregated objects on the web is the Open Archives Initiative – Object Re-use and Exchange (OAI-ORE). The OAI-ORE object is a generic description of the address of a series of things on the web and how they relate to each other. It is the natural approach for representing the idea of a paper as an aggregation of pieces. The OAI-ORE object itself is RDF with no concept of how it should be displayed or used. It is just a set of things, each labeled with their role in the overall object. In principle at least makes it straightforward to display it in any number of ways. A simple example would be converting OAI-ORE to the NLM-DTD XML format. The devil as always, is in the detail here but this makes a good first pass technical requirement for the details of how the pieces of the OAI-ORE object are described; it must be straightforward to convert to NLM-DTD.

Once we have both the collection of pieces, their relationship to each other, and the idea that we can choose to display some or all of these pieces in any way we choose then a lot of the rest falls into place. Figures can be data objects, which have a default visualization method. These visualizations can be embedded in the way which is now familiar with audio and video files. But equally references to gene names, structures, and chemical entities could be treated the same way. Want the chemical name? Just click a button, the visualization tool will deliver that. Want the structure? Again, just click the button, toggle the menu, or write the script to ask for it in that form if you are doing that kind of thing. We would need more open standards for embedding objects; probably less Flash, and more open standards but that’s a fairly minor issue.

There needs to be some communication between the citing object (the paper) and the cited object (the data, figure, text, external reference). This could be built up from the TrackBack or Pingback protocols. There also needs to be default content negotiation: “I want this data, what can you give me? Graph? Table?…ok I’ll take the graph…” That’s just a RESTful API, something which is more or less standard for the consumer web data services but which is badly missing on the research web. None of this is actually terribly difficult and there are good tools out there to do it.

But I said that we only had half solved. The other side of the problem is good authoring and publishing tools. All of the above assumes that these OAI-ORE objects exist or can be easily built, and that the pieces we want to aggregate are already on the web, ready to be pointed at and embedded. They are not. We have two fundamental problems. First we have to get these things onto the web in a useful and re-useable form. Some of this can be done with existing data services such as ChemSpider, Genbank, PubChem, GEO, etc. but that is the easy end of the problem. The hard bit is the heterogenous mass of pieces of data, Excel spreadsheets, CSV files, XML and binaries, that make up the majority of the research outputs we generate.

Publication could be made easy, using automatic upload tools and lightweight data services that provide a place for them on the web. The criticism is often made that “just publishing” is not enough because there is no context. What is often missed is that the best way to provide context is for the person who generated the research object to link it in to a larger record. The catch is, for that to be useful they have to publish it to the web first, otherwise the record they create points at a local and inaccessible object. So we need tools that simply push the raw material up onto the web, probably in the short to medium term to secure servers, but ones where the individual objects can be made public at some point.

So the other tools we need are for authoring these documents. These will look and behave like a Word Processor (or like a LaTeX document for those who prefer that route) but with a clever reference manager and citation creator. Today our reference libraries only contain papers. But imagine that your library contained all of the data you’ve generated as well and that the easiest way to write up your lab notebook was to simply right click and include the reference, select the visualization that you want and you’re done. All the details of default visualizations, of where the data really is, of adding the record to the OAI-ORE root node, all of this is done for you behind the scenes. You might need to select lumps of text to say what their role is, probably using some analogue of styles, similar to the way that the Integrated Content Environment (ICE) system does.

This environment could be built for Word, for Open Office, for LaTeX. One of the reasons I remain excited about Google Wave is that it should be relatively easy to prototype such an environment, the hooks are already there in a way that they aren’t in traditional document authoring tools because Wave is much more web native. It will however take quite a lot of work. There is also a chicken and egg problem in that such an environment isn’t a whole lot of use without the published objects to aggregate together, and the publication services to provide rich views of the final aggregated documents.  It will be quite a lot of work to build all of these pieces, and it will take some time before the benefits become clear. But I think it is a direction that is worth pursuing because it takes the best of what we already know works on the web and applies it to an evolutionary adaption of the communication style that is already familiar. The revolution comes once the pieces are there for people to work with in new ways.

Reblog this post [with Zemanta]

4 Replies to “Implementing the “Publication as Aggregation””

  1. Have you looked at Utopia Documents for an idea of how this could look. They offer a lot of this kind of functionality in the here and now.

  2. I guess I struggled a bit with Utopia because it is in the context of a pdf which just seems completely the wrong direction to me. Seems to me that display isn't a problem really, the web can do it all already, and I didn't really understand why they didn't expose all of that information more on the html version. For me it's the authoring side that needs the work. On the other hand the continued preference for PDF does seem to be pushing things in the direction of laid out content. See under iPads and magazines for instance.

  3. Hi Cameron, great blog! The ideas you propose in your blog are similar as I have seen being developed in the Netherlands by various institutions and supported by SURFfoundation. The joint effort of several Dutch universities have created a lot of knowledge and tools for creating so called “enhanced publications” which use OAI-ORE to describe the semantic relations between scientific web-resources.
    read more: http://bit.ly/au3HkR

    Yes, There are author tools for these aggregations! Many are domain specific, but one tool is generic that came out of the tender projects. see: http://escape.utwente.nl/ It is public so try and test yourself. If things go well there might be pilots running for creating aggregations during the publication process for open access journals, transferring the supplementary material to long term preservation archives and preserving the link interity by making use of persistent identifiers.

    Work that needs to be done:
    Regarding visualisations; we distinguish two kind of visualisations. 1. the visualisation of the data (like IBM's Many-Eyes, or Google's Gapminder) and 2. the visualisation of the aggregation/the relations interconnecting scientific web-resources. These visualisations could possibly be generically available in the near future using Rich Internet Applications, in order to embed in webpages. We are preparing to work in cooperation with several universities and institutions for semantic web and media design.

    Regarding OAI-ORE Datamodeling: in order to create visualisations for generic use, like HTML, the RDF must be modelled according to specific rules. Also here several universities are working together to create a more strict standard, at least for generic use by the Dutch universities. Apart from these core set of rules there is room for extensions that describe the specific domain-related stuff. This then can easily be added in the RDF and used by domain specific visualisations.

    Citing: for citing trusted scientific content that scales, millions of persistent identifiers can be used in the urn:nbn namespace, for publications, dataset, or even cells in a table. These id’s are minted by the national libraries who create a trusted information space, guaranteeing technical quality and persistence of the content. Using resolvers, persistent identifiers and a strongly committed preservation policy, the content can be independent for future (re-)location, duplication and future technology. These Persistent Identifiers make it ideal for Enhanced Publications to refer to scientific webcontent, creating trust and reassurance in future web-presence and availability of the digital content.

    Link-back citation: scientific web-resources are themselves unaware they are used as part of an aggregation. The aggregation, created in an authoring or storage environment, should notify the resource that is has been used as part of the aggregation. This mechanism can be automated using the XML-RPC protocol for Ping-back, (less spam sensitive then track-back) which is also used in modern blog systems. The web-resource keeps track of the aggregations it is part of creating a ORE resource-map itself, with all the URI’s of the aggregations it is has been told to be a part of. Making use of ping-back, automatic citation discovery on resource level will occur.

    Please do let me know what you think of these developments.

Comments are closed.