Implementing the “Publication as Aggregation”

Google Wave Lab mockup
Image by cameronneylon via Flickr

I wrote a few weeks back about the idea of re-imagining the formally published scientific paper as an aggregation of objects. The idea behind this is that it provides both continuity, through enabling the display of such papers in more or less the same way as we do currently, enhancing functionality by, for instance, embedding active versions of figures in a native form, and at the same time providing a route towards a linked data web for research.

Fundamentally the idea is that we publish fragments, and then aggregate these fragments together. The mechanism of aggregation is supposed to be an expanded version of the familiar paradigm of citation: the introduction of a paper will link to and cite other papers as usual, and these would be incorporated as links and citations within the presentation of the paper. But in addition the paper will cite the introduction. By default the introduction would be included in the view of the paper presented to the user, but equally the user might choose to only look at figures, only conclusions, or only the citations. Figures would be citations to data, again available on the web, again with a default visualization that might be an embedded active graph or a static image.

I asserted that the tools for achieving this are more or less in place. Actually that is only half true. The tools for storing, displaying, and even to some extent archiving communications in this form do exist, at least in the form of examples.

An emerging standard for aggregated objects on the web is the Open Archives Initiative – Object Re-use and Exchange (OAI-ORE). The OAI-ORE object is a generic description of the address of a series of things on the web and how they relate to each other. It is the natural approach for representing the idea of a paper as an aggregation of pieces. The OAI-ORE object itself is RDF with no concept of how it should be displayed or used. It is just a set of things, each labeled with their role in the overall object. In principle at least makes it straightforward to display it in any number of ways. A simple example would be converting OAI-ORE to the NLM-DTD XML format. The devil as always, is in the detail here but this makes a good first pass technical requirement for the details of how the pieces of the OAI-ORE object are described; it must be straightforward to convert to NLM-DTD.

Once we have both the collection of pieces, their relationship to each other, and the idea that we can choose to display some or all of these pieces in any way we choose then a lot of the rest falls into place. Figures can be data objects, which have a default visualization method. These visualizations can be embedded in the way which is now familiar with audio and video files. But equally references to gene names, structures, and chemical entities could be treated the same way. Want the chemical name? Just click a button, the visualization tool will deliver that. Want the structure? Again, just click the button, toggle the menu, or write the script to ask for it in that form if you are doing that kind of thing. We would need more open standards for embedding objects; probably less Flash, and more open standards but that’s a fairly minor issue.

There needs to be some communication between the citing object (the paper) and the cited object (the data, figure, text, external reference). This could be built up from the TrackBack or Pingback protocols. There also needs to be default content negotiation: “I want this data, what can you give me? Graph? Table?…ok I’ll take the graph…” That’s just a RESTful API, something which is more or less standard for the consumer web data services but which is badly missing on the research web. None of this is actually terribly difficult and there are good tools out there to do it.

But I said that we only had half solved. The other side of the problem is good authoring and publishing tools. All of the above assumes that these OAI-ORE objects exist or can be easily built, and that the pieces we want to aggregate are already on the web, ready to be pointed at and embedded. They are not. We have two fundamental problems. First we have to get these things onto the web in a useful and re-useable form. Some of this can be done with existing data services such as ChemSpider, Genbank, PubChem, GEO, etc. but that is the easy end of the problem. The hard bit is the heterogenous mass of pieces of data, Excel spreadsheets, CSV files, XML and binaries, that make up the majority of the research outputs we generate.

Publication could be made easy, using automatic upload tools and lightweight data services that provide a place for them on the web. The criticism is often made that “just publishing” is not enough because there is no context. What is often missed is that the best way to provide context is for the person who generated the research object to link it in to a larger record. The catch is, for that to be useful they have to publish it to the web first, otherwise the record they create points at a local and inaccessible object. So we need tools that simply push the raw material up onto the web, probably in the short to medium term to secure servers, but ones where the individual objects can be made public at some point.

So the other tools we need are for authoring these documents. These will look and behave like a Word Processor (or like a LaTeX document for those who prefer that route) but with a clever reference manager and citation creator. Today our reference libraries only contain papers. But imagine that your library contained all of the data you’ve generated as well and that the easiest way to write up your lab notebook was to simply right click and include the reference, select the visualization that you want and you’re done. All the details of default visualizations, of where the data really is, of adding the record to the OAI-ORE root node, all of this is done for you behind the scenes. You might need to select lumps of text to say what their role is, probably using some analogue of styles, similar to the way that the Integrated Content Environment (ICE) system does.

This environment could be built for Word, for Open Office, for LaTeX. One of the reasons I remain excited about Google Wave is that it should be relatively easy to prototype such an environment, the hooks are already there in a way that they aren’t in traditional document authoring tools because Wave is much more web native. It will however take quite a lot of work. There is also a chicken and egg problem in that such an environment isn’t a whole lot of use without the published objects to aggregate together, and the publication services to provide rich views of the final aggregated documents.  It will be quite a lot of work to build all of these pieces, and it will take some time before the benefits become clear. But I think it is a direction that is worth pursuing because it takes the best of what we already know works on the web and applies it to an evolutionary adaption of the communication style that is already familiar. The revolution comes once the pieces are there for people to work with in new ways.

Reblog this post [with Zemanta]

The future of research communication is aggregation

Paper as aggregation
Image by cameronneylon via Flickr

“In the future everyone will be a journal editor for 15 minutes” – apologies to Andy Warhol

Suddenly it seems everyone wants to re-imagine scientific communication. From the ACS symposium a few weeks back to a PLoS Forum, via interesting conversations with a range of publishers, funders and scientists, it seems a lot of people are thinking much more seriously about how to make scientific communication more effective, more appropriate to the 21st century and above all, to take more advantage of the power of the web.

For me, the “paper” of the future has to encompass much more than just the narrative descriptions of processed results we have today. It needs to support a much more diverse range of publication types, data, software, processes, protocols, and ideas, as well provide a rich and interactive means of diving into the detail where the user in interested and skimming over the surface where they are not. It needs to provide re-visualisation and streaming under the users control and crucially it needs to provide the ability to repackage the content for new purposes; education, public engagement, even main stream media reporting.

I’ve got a lot of mileage recently out of thinking about how to organise data and records by ignoring the actual formats and thinking more about what the objects I’m dealing with are, what they represent, and what I want to do with them. So what do we get if we apply this thinking to the scholarly published article?

For me, a paper is an aggregation of objects. It contains, text, divided up into sections, often with references to other pieces of work. Some of these references are internal, to figures and tables, which are representations of data in some form or another. The paper world of journals has led us to think about these as images but a much better mental model for figures on the web is of an embedded object, perhaps a visualisation from a service like Many Eyes, Swivel, and Tableau Public. Why is this better? It is better because it maps more effectively onto what we want to do with the figure. We want to use it to absorb the data it represents, and to do this we might want to zoom, pan, re-colour, or re-draw the data. But we want to know if we do this that we are using the same underlying data, so the data needs a home, an address somewhere on the web, perhaps with the journal, or perhaps somewhere else entirely, that we can refer to with confidence.

If that data has an individual identity it can in turn refer back to the process used to generate it, perhaps in an online notebook or record, perhaps pointing to a workflow or software process based on another website. Maybe when I read the paper I want that included, maybe when you read it you don’t – it is a personal choice, but one that should be easy to make. Indeed, it is a choice that would be easy to make with today’s flexible web frameworks if the underlying pieces were available and represented in the right way.

The authors of the paper can also be included as a reference to a unique identifier. Perhaps the authors of the different segments are different. This is no problem, each piece can refer to the people that generated it. Funders and other supporting players might be included by reference. Again this solves a real problem of today, different players are interested in how people contributed to a piece of work, not just who wrote the paper. Providing a reference to a person where the link show what their contribution was can provide this much more detailed information. Finally the overall aggregation of pieces that is brought together and finally published also has a unique identifier, often in the form of the familiar DOI.

This view of the paper is interesting to me for two reasons. The first is that it natively supports a wide range of publication or communication types, including data papers, process papers, protocols, ideas and proposals. If we think of publication as the act of bringing a set of things together and providing them with a coherent identity then that publication can be many things with many possible uses. In a sense this is doing what a traditional paper should do, bringing all the relevant information into a single set of pages that can be found together, as opposed to what they usually do, tick a set of boxes about what a paper is supposed to look like. “Is this publishable?” is an almost meaningless question on the web. Of course it is. “Is it a paper?” is the question we are actually asking. By applying the principles of what the paper should be doing as opposed to the straightjacket of a paginated, print-based document, we get much more flexibility.

The second aspect which I find exciting revolves around the idea of citation as both internal and external references about the relationships between these individual objects. If the whole aggregation has an address on the web via a doi or a URL, and if its relationship both to the objects that make it up and to other available things on the web are made clear in a machine readable citation then we have the beginnings of a machine readable scientific web of knowledge. If we take this view of objects and aggregates that cite each other, and we provide details of what the citations mean (this was used in that, this process created that output, this paper is cited as an input to that one) then we are building the semantic web as a byproduct of what we want to do anyway. Instead of scaring people with angle brackets we are using a paradigm that researchers understand and respect, citation, to build up meaningful links between packages of knowledge. We need the authoring tools that help us build and aggregate these objects together and tools that make forming these citations easy and natural by using the existing ideas around linking and referencing but if we can build those we get the semantic web for science as a free side product – while also making it easier for humans to find the details they’re looking for.

Finally this view blows apart the monolithic role of the publisher and creates an implicit marketplace where anybody can offer aggregations that they have created to potential customers. This might range from a high school student putting their science library project on the web through to a large scale commercial publisher that provides a strong brand identity, quality filtering, and added value through their infrastructure or services. And everything in between. It would mean that large scale publishers would have to compete directly with the small scale on a value-for-money basis and that new types of communication could be rapidly prototyped and deployed.

There are a whole series of technical questions wrapped up in this view, in particular if we are aggregating things that are on the web, how did they get there in the first place, and what authoring tools will we need to pull them together. I’ll try to start on that in a follow-up post.

Reblog this post [with Zemanta]