Talking to the next generation – NESTA Crucible Workshop

Yesterday I was privileged to be invited to give a talk at the NESTA Crucible Workshop being held in Lancaster. You can find the slides on slideshare. NESTA, the National Endowment for Science, Technology, and the Arts,  is an interesting organization funded via a UK government endowment to support innovation and enterprise and more particularly the generation of a more innovative and entrepreneurial culture in the UK. Among the programmes it runs in pursuit of this is the Crucible program where a small group of young researchers, generally looking for or just in their first permanent or independent positions, attend a series of workshops to get them thinking broadly about the role of their research in the wider world and to help them build new networks for support and collaboration.

My job was to talk about “Science in Society” or “Open Science”. My main theme was the question of how we justify taxpayer expenditure on research; that to me this implies an obligation to maximise the efficiency of how we do our research. Research is worth doing but we need to think hard about how and what we do. Not surprisingly I focussed on the potential of using web based tools and open approaches to make things happen cheaper, quicker, and more effectively. To reduce waste and try to maximise the amount of research output for the money spent.

Also not surprisingly there was significant pushback – much of it where you would expect. Concerns over data theft, over how “non-traditional” contributions might appear (or not) on a CV, and over the costs in time were all mentioned. However what surprised me most was the pushback against the idea of putting material on the open web versus traditional journal formats. There was a real sense that the group had a respect for the authority of the printed, versus online, word which really caught me out. I often use a gotcha moment in talks to try and illustrate how our knowledge framework is changed by the web. It goes “how many people have opened a physical book for information in the last five years?”. Followed by “and how many haven’t used Google in the last 24 hours”. This is shamelessly stolen from Jamie Boyle incidentally.

Usually you get three or four sheepish hands going up admitting a personal love of real physical books. Generally it is around 5-10% of the audience, and this has been pretty consistent amongst mid-career scientists in both academia and industry, and people in publishing. In this audience about 75% put their hands up.  Some of these were specialist “tool” books, mathematical forms, algorithmic recipes, many of them were specialist texts and many referred to the use of undergraduate textbooks. Interestingly they also brought up an issue that I’ve never had an audience bring up before; that of how do you find a good route into a new subject area that you know little about, but that you can trust?

My suspicion is that this difference comes from three places, firstly that these researchers were already biased towards being less discipline bound by the fact that they’d applied for the workshop. They were therefore more likely to discipline hoppers,  jumping into new fields where they had little experience and needed a route in. Secondly, they were at a stage of their career where they were starting to teach, again possibly slightly outside their core expertise and therefore looking for good, reliable material, to base their teaching on. Finally though there was a strong sense of respect for the authority of the printed word. The printing of the German Wikipedia was brought up as evidence that printed matter was, at least perceived to be, more trustworthy. Writing this now I am reminded of the recent discussion on the hold that the PDF has over the imagination of researchers. There is a real sense that print remains authoritative in a way that online material is not. Even though the journal may never be printed the PDF provides the impression that it could or should be. I would guess also that the group were young enough also to be slightly less cynical about authority in general.

Food for thought, but it was certainly a lively discussion. We actually had to be dragged off to lunch because it went way over time (and not I hope just because I had too many slides!). Thanks to all involved in the workshop for such an interesting discussion and thanks also to the twitter people who replied to my request for 140 character messages. They made a great way of structuring the talk.

Pub-sub/syndication patterns and post publication peer review

I think it is fair to say that even those of us most enamored of post-publication peer review would agree that its effectiveness remains to be demonstrated in a convincing fashion. Broadly speaking there are two reasons for this; the first is the problem of social norms for commenting. As in there aren’t any. I think it was Michael Nielsen who referred to the “Kabuki Dance of scientific discourse”. It is entirely allowed to stab another member of the research community in the back, or indeed the front, but there are specific ways and forums in which it is acceptable to do. No-one quite knows what the appropriate rules are for commenting on online fora, as best described most recently by Steve Koch.

My feeling is that this is a problem that will gradually go away as we evolve norms of behaviour in specific research communities. The current “rules” took decades to build up. It should not be surprising if it takes a few years or more to sort out an adapted set for online interactions. The bigger problem is the one that is usually surfaced as “I don’t have any time for this kind of thing”. This in turn can be translated as, “I don’t get any reward for this”. Whether that reward is a token for putting on your CV, actual cash, useful information coming back to you, or just the warm feeling that someone else found your comments useful, rewards are important for motivating people (and researchers).

One of the things that links these two together is a sense of loss of control over the comment. Commenting on journal web-sites is just that, commenting on the journal’s website. The comment author has “given up” their piece of value, which is often not even citeable, but also lost control over what happens to their piece of content. If you change your mind, even if the site allows you to delete it, you have no way of checking whether it is still in the system somewhere.

In a sense, when the Web 2.0 world was built it was got nearly precisely wrong for personal content. For me Jon Udell has written most clearly about this when he talks about the publish-subscribe pattern for successful frameworks. In essence I publish my content and you choose to subscribe to it. This works well for me, the blogger, at this site, but it is not so great for the commenter who has to leave their comment to my tender mercies on my site. It would be better if the commenter could publish their comment and I could syndicate it back to my blog. This creates all sorts of problems; it is challenging for you to aggregate your own comments together and you have to rely on the functionality of specific sites to help you follow responses to your comments. Jon wrote about this better than I can in his blog post.

So a big part of the problem could be solved if people streamed their own content. This isn’t going to happen quickly in the general sense of everyone having a web server of their own – it still remains too difficult for even moderately skilled people to be bothered doing this. Services will no doubt appear in the future but current broadcast services like twitter offer a partial solution (its “my” twitter account, I can at least pretend to myself that I can delete all of it). The idea of using something like the twitter service at microrevie.ws as suggested by Daniel Mietchen this week can go a long way towards solving the problem. This takes a structured tweet of the form @hreview {Object};{your review} followed optionally by a number of asterisks for a star rating. This doesn’t work brilliantly for papers because of problems with the length of references for the paper, even with shortened dois, the need for sometimes lengthy reviews and the shortness of tweets. Additionally the twitter account is not automatically associated with a unique research contributor ID. However the principle of the author of the review controlling their own content, while at the same time making links between themselves and that content in a linked open data kind of way is extremely powerful.

Imagine a world in which your email outbox or local document store is also webserver (via any one of an emerging set of tools like Wave, DropBox, or Opera Unite). You can choose who to share your review with and change that over the time. If you choose to make it public the journal, or the authors can give you some form of credit. It is interesting to think that author-side charges could perhaps be reduced for valuable reviews. This wouldn’t work in a naive way, with $10 per review, because people would churn out  large amounts of rubbish reviews, but if those reviews are out on the linked data web then their impact can be measured by their page rank and the authors rewarded accordingly.

Rewards and control linked together might provide a way of solving the problem – or at least of solving it faster than we are at the moment.

Why the Digital Britain report is a missed opportunity

A few days ago the UK Government report on the future of Britain’s digital infrastructure, co-ordinated by Lord Carter, was released. I haven’t had time to read the whole report, I haven’t even really had time to skim it completely. But two things really leapt out at me.

On page four:

“If, as expected, the volume of digital content will increase 10x to 100x over the next 3 to 5 years then we are on the verge of a big bang in the communications industry that will provide the UK with enormous economic and industrial opportunities”

And on page 18:

“Already today around 7.5% of total UK music album purchases are digital and a smaller but rapidly increasing percentage of film and television consumption is streamed online or downloaded…User-generated and social content will be very significant but should not be the main or only content” – this brought to my attention by Brian Kelly.

The first extract, is to me symptomatic of a serious, even catastrophic lack of ambition and understanding of how the web is changing. If the UK’s digital content only increases by 10-100 fold over the next three years then we will be living in a country lagging behind those that will be experiencing huge economic benefits from getting the web right for their citizens.

But that is just a lack of understanding at core. The Government’s lack of appreciation for how fast this content is growing isn’t really an issue because the Government isn’t an effective content producer online. It would be great if it were, pushing out data, making things happen but they will probably catch up one day, when forced to by events. What is disturbing to me is that second passage. “User generated and social content should not be the main or only content”? It probably already is the main content on the open web, at least by volume, and the volume and traffic rates of user generated content are rising exponentially. But putting that aside, the report appears to be saying that basically the content generated by British citizens, is not, and will not be “good enough”; that it has no real value. Lord Carter hasn’t just said that he doesn’t believe that enough useful content could be produced by “non-professionals”, but that it shouldn’t be produced.

The Digital Britain Unconferences were a brilliant demonstration of how the web can enable democracy by bringing interested people together to debate and respond to specific issues. Rapid, high quality, and grass roots it showed the future of how government’s could actually interact effectively with their citizens. The potential for economic benefits from the web are not in broadcast, are not in professional production, but are in many to many communication and sharing. Selling a few more videos will not get us out of this recession. Letting millions of people add a small amount of value, or have more efficient interactions, could. This report fails to reflect that opportunity. It is a failure of understanding and a failure of imagination. The only saving grace is that, aside from the need for physical infrastructure, the Government is becoming increasingly irrelvant to the debate anyway. The world will move on, and the web will enable it, faster or slower than we expect, and in ways that will be suprising. It will just go that much slower in the UK.

Google Wave in Research – the slightly more sober view – Part I – Papers

I, and many others have spent the last week thinking about Wave and I have to say that I am getting more, rather than less, excited about the possibilities that this represents. All of the below will have to remain speculation for the moment but I wanted to walk through two use cases and identify how the concept of a collaborative automated document will have an impact. In this post I will start with the drafting and publication of a paper because it is an easier step to think about. In the next post I will move on to the use of Wave as a laboratory recording tool.

Drafting and publishing a paper via Wave

I start drafting the text of a new paper. As I do this I add the Creative Commons robot as a participant. The robot will ask what license I wish to use and then provide a stamp, linked back to the license terms. When a new participant adds text or material to the document, they will be asked whether they are happy with the license, and their agreement will be registered within a private blip within the Wave controlled by the Robot (probably called CC-bly, pronounced see-see-bly). The robot may also register the document with a central repository of open content. A second robot could notify the authors respective institutional repository, creating a negative click repository in, well one click. More seriously this would allow the IR to track, and if appropriate modify, the document as well as harvest its content and metadata automatically.

I invite a series of authors to contribute to the paper and we start to write. Naturally the inline commenting and collaborative authoring tools get a good workout and it is possible to watch the evolution of specific sections with the playback tool. The authors are geographically distributed but we can organize scheduled hacking sessions with inline chat to work on sections of the paper. As we start to add references the Reference Formatter gets added (not sure whether this is a Robot or an Gadget, but it is almost certainly called “Reffy”). The formatter automatically recognizes text of the form (Smythe and Hoofback 1876) and searches the Citeulike libraries of the authors for the appropriate reference, adds an inline citation, and places a formatted reference in a separate Wavelet to keep it protected from random edits. Chunks of text can be collected from reports or theses in other Waves and the tracking system notes where they have come from, maintaing the history of the whole document and its sources and checking licenses for compatibility. Terminology checkers can be run over the document, based on the existing Spelly extension (although at the moment this works on the internal not the external API – Google say they are working to fix that) that check for incorrect or ambiguous use of terms, or identify gene names, structures etc. and automatically format them and link them to the reference database.

It is time to add some data and charts to the paper. The actual source data are held in an online spreadsheet. A chart/graphing widget is added to the document and formats the data into a default graph which the user can then modify as they wish. The link back to the live data is of course maintained. Ideally this will trigger the CC-bly robot to query the user as to whether they wish to dedicate the data to the Public Domain (therefore satisfying both the Science Commons Data protocol and the Open Knowledge Definition – see how smoothly I got that in?). When the users says yes (being a right thinking person) the data is marked with the chosen waiver/dedication and CKAN is notified and a record created of the new dataset.

The paper is cleaned up – informal comments can be easily obtained by adding colleagues to the Wave. Submission is as simple as adding a new participant, the journal robot (PLoSsy obviously) to the Wave. The journal is running its own Wave server so referees can be given anonymous accounts on that system if they choose. Review can happen directly within the document with a conversation between authors, reviewers, and editors. You don’t need to wait for some system to aggregate a set of comments and send them in one hit and you can deal with issues directly in conversation with the people who raise them. In addition the contribution of editors and referees to the final document is explicitly tracked. Because the journal runs its own server, not only can the referees and editors have private conversations that the authors don’t see, those conversations need never leave the journal server and are as secure as they can reasonably be expected to be.

Once accepted the paper is published simply by adding a new participant. What would traditionally happen at this point is that a completely new typeset version would be created, breaking the link with everything that has gone before. This could be done by creating a new Wave with just the finalized version visible and all comments stripped out. What would be far more exciting would be for a formatted version to be created which retained the entire history. A major objection to publishing referees comments is that they refer to the unpublished version. Here the reader can see the comments in context and come to their own conclusions. Before publishing any inline data will need to be harvested and placed in a reliable repository along with any other additional information. Supplementary information can simple be hidden under “folds” within the document rather than buried in separate documents.

The published document is then a living thing. The canonical “as published” version is clearly marked but the functionality for comments or updates or complete revisions is built in. The modular XML nature of the Wave means that there is a natural means of citing a specific portion of the document. In the future citations to a specific point in a paper could be marked, again via a widget or robot, to provide a back link to the citing source. Harvesters can traverse this graph of links in both directions easily wiring up the published data graph.

Based on the currently published information none of the above is even particularly difficult to implement. Much of it will require some careful study of how the work flows operate in practice and there will likely be issues of collisions and complications but most of the above is simply based on the functionality demonstrated at the Wave launch. The real challenge will lie in integration with existing publishing and document management systems and with the subtle social implications that changing the way that authors, referees, editors, and readers interact with the document. Should readers be allowed to comment directly in the Wave or should that be in a separate Wavelet? Will referees want to be anonymous and will authors be happy to see the history made public?

Much will depend on how reliable and how responsive the technology really is, as well as how easy it is to build the functionality described above. But the bottom line is that this is the result of about four day’s occasional idle thinking about what can be done. When we really start building and realizing what we can do, that is when the revolution will start.

Part II is here.

What would you say to Elsevier?

In a week or so’s time I have been invited to speak as part of a forward planning exercise at Elsevier. To some this may seem like an opportunity to go in for an all guns blazing OA rant or perhaps to plant some incendiary device but I see it more as opportunity to nudge, perhaps cajole, a big player in the area of scholarly publishing in the right direction. After all if we are right about the efficiency gains for authors and readers that will be created by Open Access publication and we are right about the way that web based systems utterly changes the rules of scholarly communication then even an organization of the size of Elsevier has to adapt or wither away. Persuading them to move in right direction because it is in their own interests would be an effective way of speeding up the process of positive change.

My plan is to focus less on the arguments for making more research output Open Access and more on what happens as a greater proportion of those outputs become freely available, something that I see as increasingly inevitable. Where that proportion may finally be is anyone’s guess but it is going to be a much bigger proportion than it is now. What will authors and funders want and need from their publication infrastructure and what are the business opportunities that arise from those. For me these fall into four main themes:

  • Tracking via aggregation. Funders and institutions want more and more to track the outputs of their research investment. Providing tools and functionality that will enable them to automatically aggregate and slice and dice these outputs is a big business opportunity. The data themselves will be free but providing it in the form that people need it rapidly and effectively will add value that they will be prepared to pay for.
  • Speed to publish as a market differentiator. Authors will want their content out and available and being acted on fast. Speed to publication is potentially the biggest remaining area for competition between journals. This is important because there will almost certainly be less journals with greater “quality” or “brand” differentiation. There is a plausible future in which there are only two journals, Nature and PLoS ONE.
  • Data publication, serving, and archival. There may be less journals but there will be much greater diversity of materials being published through a larger number of mechanisms. There are massive opportunities in providing high quality infrastructure and services to funders and institutions to aggregate, publish, and archive the full set of research outputs. I intend to draw heavily on Dorothea Salo‘s wonderful slideset on data publication for this part.
  • Social search. Literature searching is the main area where there are plausible efficiency gains to be made in the current scholarly publications cycle. According to the Research Information Network‘s model of costs search accounts for a very significant proportion of the non-research costs of  publishing. Building the personal networks (Bill Hooker‘s, Distributed Wetware Online Information Filter [down in the comments] or DWOIF) that make this feasible may well be the new research skill of the 21st century. Tools that make this work effectively are going to be very popular. What will they look like?

But what have I missed? What (constructive!) ideas and thoughts would you want to place in the minds of the people thinking about where to take one of the world’s largest scholarly publication companies and its online information and collaboration infrastructure.?

Full disclosure: Part of the reason for writing this post is to disclose publicly that I am doing this gig. Elsevier are covering my travel and accommodation costs but are not paying any fee.

OMG! This changes EVERYTHING! – or – Yet Another Wave of Adulation

Yes, I’m afraid it’s yet another over the top response to yesterday’s big announcement of Google Wave, the latest paradigm shifting gob-smackingly brilliant piece of technology (or PR depending on your viewpoint) out of Google. My interest, however is pretty specific, how can we leverage it to help us capture, communicate, and publish research? And my opinion is that this is absolutely game changing – it makes a whole series of problems simply go away, and potentially provides a route to solving many of the problems that I was struggling to see how to manage.

Firstly, lets look at the grab bag of generic issues that I’ve been thinking about. Most recently I wrote about how I thought “real time” wasn’t the big deal but giving the user control back over the timeframe in which streams came into them. I had some vague ideas about how this might look but Wave has working code. When the people who you are in conversation with are online and looking at the same wave they will see modifications in real time. If they are not in the same document they will see the comments or changes later, but can also “re-play” changes. But a lot of thought has clearly gone into thinking about the default views based on when and how a person first comes into contact with a document.

Another issue that has frustrated me is the divide between wikis and blogs. Wikis have generally better editing functionality, but blogs have workable RSS feeds, Wikis have more plugins, blogs map better onto the diary style of a lab notebook. None of these were ever fundamental philosophical differences but just historical differences of implementations and developer priorities. Wave makes most of these differences irrelevant by creating a collaborative document framework that easily incorporates much of the best of all of these tools within a high quality rich text and media authoring platform. Bringing in content looks relatively easy and pushing content out in different forms also seems to be pretty straightforward. Streams, feeds, and other outputs, if not native, look to be easily generated either directly or by passing material to other services. The Waves themselves are XML which should enable straightforward parsing and tweaking with existing tools as well.

One thing I haven’t written much about but have been thinking about is the process of converting lab records into reports and onto papers. While there wasn’t much on display about complex documents a lot of just nice functionality, drag and drop links, options for incorporating and embedding content was at least touched on. Looking a little closer into the documentation there seems to be quite a strong provenance model, built on a code repository style framework for handling document versioning and forking. All good steps in the right direction and with the open APIs and multitouch as standard on the horizon there will no doubt be excellent visual organization and authoring tools along very soon now. For those worried about security and control, a 30 second mention in the keynote basically made it clear that they have it sorted. Private messages (documents? mecuments?) need never leave your local server.

Finally the big issue for me has for some time been bridging the gap between unstructured capture of streams of events and making it easy to convert those to structured descriptions of the intepretation of experiments.  The audience was clearly wowed by the demonstration of inline real time contextual spell checking and translation. My first thought was – I want to see that real-time engine attached to an ontology browser or DbPedia and automatically generating links back to the URIs for concepts and objects. What really struck me most was the use of Waves with a few additional tools to provide authoring tools that help us to build the semantic web, the web of data, and the web of things.

For me, the central challenges for a laboratory recording system are capturing objects, whether digital or physical, as they are created, and then serve those back to the user, as they need them to describe the connections between them. As we connect up these objects we will create the semantic web. As we build structured knowledge against those records we will build a machine-parseable record of what happened that will help us to plan for the future. As I understand it each wave, and indeed each part of a wave, can be a URL endpoint; an object on the semantic web. If they aren’t already it will be easy to make them that. As much as anything it is the web native collaborative authoring tool that will make embedding and pass by reference the default approach rather than cut and past that will make the difference. Google don’t necessarily do semantic web but they do links and they do embedding, and they’ve provided a framework that should make it easy to add meaning to the links. Google just blew the door off the ELN market, and they probably didn’t even notice.

Those of us interested in web-based and electronic recording and communication of science have spent a lot of the last few years trying to describe how we need to glue the existing tools together, mailing lists, wikis, blogs, documents, databases, papers. The framework was never right so a lot of attention was focused on moving things backwards and forwards, how to connect one thing to another. That problem, as far as I can see has now ceased to exist. The challenge now is in building the right plugins and making sure the architecture is compatible with existing tools. But fundamentally the framework seems to be there. It seems like it’s time to build.

A more sober reflection will probably follow in a few days ;-)

It’s not easy being clear…

There has been some debate going backwards and forwards over the past few weeks about licensing, peoples expectations, and the extent to which researchers can be expected to understand, or want to understand, the details of legal terms, licensing and other technical minutiae. It is reasonable for scientific researchers not to wish to get into the details. One of the real successes of Creative Commons has been to provide a relatively small set of reasonably clear terms that enable people to express their wishes about what people can do with their work. But even here there is the potential for significant confusion as demonstrated by the work that CC is doing on the perception of what “non commercial” means.

The end result of this is two-fold. Firstly people are genuinely confused about what to do and a result they give up. In giving up there is often an unspoken assumption that “people will understand what I want/mean”. Two examples yesterday illustrated exactly how misguided this can be and showed the importance of being clear, and thinking about, what you want people to do with your content and information.

The first was pointed out by Paulo Nuin who linked to a post on The Matrix Cookbook, a blog and PDF containing much useful information on matrix transforms. The post complained that Amazon were selling a Kindle version of the PDF, apparently without asking permission or even bothering to inform the authors. So far, so big corporation. But digging a little deeper I went to the front page of the site and found this interesting “license”:

“License? No, there is no license. It is provided as a knowledge sharing project for anyone to use. But if you use it in an academic or research like context, we would love to be cited appropriately.”

Now I would intepret this as meaning that the authors had intended to place the work in the public domain. They clearly felt that while educational and research re-use was fine that commercial use was not. I would guess that someone at Amazon read the statement “there is no license” and felt that it was free to re-use. It seems odd that they wouldn’t email the authors to notify them but if it were public domain there is no requirement to. Rude, yes. Theft? Well it depends on your perspective. Going back today the authors have made a significant change to the “license”:

It is provided as a knowledge sharing project for anyone to use. But if you use it in an academic or research like context, we would love to be cited appropriately. And NO, you are not allowed to make money on it by reselling The Matrix Cookbook in any form or shape.

Had the authors made the content CC-BY-NC then their intentions would have been much clearer. My personal belief is that an NC license would be counter-productive (meaning the work couldn’t be used for teaching at a fee charging college or for research funded by a commercial sponsor for instance) but the point of the CC licenses is to give people these choices. What is important is that people make those choices and make them clear.

The second example related to identity. As part of an ongoing discussion involving online commenting genereg, a Friendfeed user, linked to their blog which included their real name. Mr Gunn, the nickname used by Dr William Gunn online wrote a blog post in which he referred to genereg’s contribution by linking to their blog from their real name [subsequently removed on request]. I probably would have done the same, wanting to ascribe the contribution clearly to the “real person” so they get credit for it. Genereg objected to this feeling that as their real name wasn’t directly in that conversational context it was inappropriate to use it.

So in my view, “Genereg” was a nickname that someone was happy to have connected with their real name, while in their view this was inappropriate. No-one is right or wrong here, we are evolving the rules of conduct more or less as we go and frankly, identity is a mess. But this wasn’t clear to me or to Mr Gunn. I am often uncomfortable with trying to tell whether a specific person who has linked two apparently separate identities is happy with that link being public, has linked the two by mistake, or just regards one as an alias. And you can’t ask in public forum can you?

What links these, and this week’s other fracas, is confusion over people’s expectations. The best way to avoid this is to be as clear as you possibly can. Don’t assume that everyone thinks the same way that you do. And definitely don’t assume that what is obvious to you is obvious to everyone else. When it comes to content, make a clear statement of your expectations and wishes, preferably using a widely recognized and understood licenses. If you’re reading this at OWW you should be seeing my nice shiny new cc0 waiver in the right hand navbar (I haven’t figured how to get it into the RSS feed yet). Most of my slidesets at Slideshare are CC-BY-SA. I’d prefer them to be CC-BY but most include images with CC-BY-SA licenses which (try to make sure) I respect. Overall I try to make the work I generate as widely re-usable as possible and aim to make that as clear as possible.

There are no such tools to make clear statements about how you wish your identity to be treated (and perhaps there should be). But a plain english statement on the appropriate profile page might be useful “I blog under a pseudonym because…and I don’t want my identity revealed”…”Bunnykins is the Friendfeed handle of Professor Serious Person”. Consider whether what you are doing is sending mixed messages or potentially confusing. Personally I like to keep things simple so I just use my real name or variants of it. But that is clearly not for everyone.

Above all, try to express clearly what you expect and wish to happen. Don’t expect others necessarily to understand where you’re coming from. It is very easy for one person’s polite and helpful to be another person’s deeply offensive. When you put something online, think about how you want people to use it, think about how you don’t want people to use it (and remember you may need to balance the allowing of one against the restricting of the other) and make those as clear as you possibly can, where possible using a statement or license that is widely recognized and has had some legal attention at some point like the CC licenses, cc0 waiver, or the PDDL. Clarity helps everyone. If we get this wrong we may end up with a web full of things we can’t use.

And before anyone else gets in to tell me I’ve made plenty of unjustified, and plain wrong, assumptions about other people’s views before. Pot. Kettle. Black. Welcome to being human.

A breakthrough on data licensing for public science?

I spent two days this week visiting Peter Murray-Rust and others at the Unilever Centre for Molecular Informatics at Cambridge. There was a lot of useful discussion and I learned an awful lot that requires more thinking and will no doubt result in further posts. In this one I want to relay a conversation we had over lunch with Peter, Jim Downing, Nico Adams, Nick Day and Rufus Pollock that seemed extremely productive. It should be noted that what follows is my recollection so may not be entirely accurate and shouldn’t be taken to accurately represent other people’s views necessarily.

The appropriate way to license published scientific data is an argument that has now been rolling on for some time. Broadly speaking the argument has devolved into two camps. Firstly those who have a belief in the value of share-alike or copyleft provisions of GPL and similar licenses. Many of these people come from an Open Source Software or Open Content background. The primary concern of this group is spreading the message and use of Open Content and to prevent “freeloaders” from being able to use Open material and not contribute back to the open community. A presumption in this view is that a license is a good, or at least acceptable, way of achieving both these goals. Also included here are those who think that it is important to allow people the freedom to address their concerns through copyleft approaches. I think it is fair to characterize Rufus as falling into this latter group.

On the other side are those, including myself, who are concerned more centrally with enabling re-use and re-purposing of data as far as is possible. Most of us are scientists of one sort or another and not programmers per se. We don’t tend to be concerned about freeloading (or in some cases welcome it as effective re-use). Another common characteristic is that we have been prevented from being able to make our own content as free as we would like due to copyleft provisions. I prefer to make all my content CC-BY (or cc0 where possible). I am frequently limited in my ability to do this by the wish to incorporate CC-BY-SA or GFDL material. We are deeply worried by the potential for licensing to make it harder to re-use and re-mix disparate sets of data and content into new digital objects. There is a sense amongst this group that “data is different” to other types of content, particulary in its diversity of types and re-uses. More generally there is the concern that anything that “smells of lawyers”, like something called a “license”, will have scientists running screaming in the opposite direction as they try to avoid any contact with their local administration and legal teams.

What I think was productive about the discussion on Tuesday is that we focused on what we could agree on with the aim of seeing whether it was possible to find a common position statement on the limited area of best practice for the publication of data that arises from public science. I believe such a statement is important because there is a window of opportunity to influence funder positions. Many funders are adopting data sharing policies but most refer to “following best practice” and that best practice is thin on the ground in most areas. With funders wielding the ultimate potential stick there is great potential to bootstrap good practice by providing clear guidance and tools to make it easy for researchers to deliver on their obligations. Funders in turn will likely adopt this best practice as policy if it is widely accepted by their research communities.

So we agreed on the following (I think – anyone should feel free to correct me of course!):

  1. A simple statement is required along the forms of  “best practice in data publishing is to apply protocol X”. Not a broad selection of licenses with different effects, not a complex statement about what the options are, but “best practice is X”.
  2. The purpose of publishing public scientific data and collections of data, whether in the form of a paper, a patent, data publication, or deposition to a database, is to enable re-use and re-purposing of that data. Non-commercial terms prevent this in an unpredictable and unhelpful way. Share-alike and copyleft provisions have the potential to do the same under some circumstances.
  3. The scientific research community is governed by strong community norms, particularly with respect to attribution. If we could successfully expand these to include share-alike approaches as a community expectation that would obviate many concerns that people attempt to address via licensing.
  4. Explicit statements of the status of data are required and we need effective technical and legal infrastructure to make this easy for researchers.

So in aggregate I think we agreed a statement similar to the following:

Where a decision has been taken to publish data deriving from public science research, best practice to enable the re-use and re-purposing of that data, is to place it explicitly in the public domain via {one of a small set of protocols e.g. cc0 or PDDL}.”

The advantage of this statement is that it focuses purely on what should be done once a decision to publish has been made, leaving the issue of what should be published to a separate policy statement. This also sidesteps issues of which data should not be made public. It focuses on data generated by public science, narrowing the field to the space in which there is a moral obligation to make such data available to the public that fund it. By describing this as best practice it also allows deviations that may, for whatever reason, be justified by specific people in specific circumstances. Ultimately the community, referees, and funders will be the judge of those justifications. The BBSRC data sharing policy states for instance:

BBSRC expects research data generated as a result of BBSRC support to be made available…no later than the release through publication…in-line with established best practice  in the field [CN – my emphasis]…

The key point for me that came out of the discussion is perhaps that we can’t and won’t agree on a general solution for data but that we can articulate best practice in specific domains. I think we have agreed that for the specific domain of published data from public science there is a way forward. If this is the case then it is a very useful step forward.

“Real Time”: The next big thing or a pointer to a much more interesting problem?

There has been a lot written and said recently about the “real time” web most recently in an interview of Paul Buchheit on ReadWriteWeb. The premise is that if items and conversations are carried on in “real time” then they are more efficient and more engaging. The counter argument has been that they become more trivial. That by dropping the barrier to involvement to near zero, the internal editorial process that forces each user to think a little about what they are saying, is lost generating a stream of drivel. I have to admit upfront that I really don’t get the excitement. It isn’t clear to me that the difference between a five or ten second refresh rate versus a 30 second one is significant.

In one sense I am all for getting a more complete record onto the web, at least if there is some probability of it being archived. After all this is what we are trying to do with the laboratory recording effort; creat as complete a record on the web as possible. But at some point there is always going to be an editorial process. In a blog it takes some effort to write a post and publish it, creating a barrier which imposes some editorial filter. Even on Twitter the 140 character limit forces people to be succinct and often means a pithy statement gets refined before hitting return. In an IM or chat window you will think before hitting return (hopefully!). Would true “real time” mean watching as someone typed or would it have to be a full brain dump as it happened? I’m not sure I want either of these, if I want real time conversation I will pick up the phone.

But while everyone is focussed on “real time” I think it is starting to reveal a more interesting problem. One I’ve been thinking about for quite a while but have been unable to get a grip on. All of these services have different intrinsic timeframes. One of the things I dislike about the new FriendFeed interface is the “real time” nature of it. What I liked previously was that it had a slower intrinsic time than, say, Twitter or instant messenging, but a faster intrinsic timescale than a blog or email. On Twitter/IM conversations are fast, seconds to minutes, occassionally hours. On FriendFeed they tend to run from minutes to hours, with some continuing on for days, all threaded and all kept together. Conversations in blog comments run over hours, to days, email over days, newspapers over weeks, academic literature over months and years.

Different people are comfortable with interacting with streams running at these different rates. Twitter is too much for some, as is FriendFeed, or online content at all. Many don’t have time to check blog comments, but perhaps are happy to read the posts once a day. But these people probably appreciate that the higher rate data is there. Maybe they come across an interesting blog post referring to a comment and want to check the comment, maybe the comment refers to a conversation on Twitter and they can search to find that. Maybe they find a newspaper article that leads to a wiki page and on to a pithy quote from an IM service. This type of digging is enabled by good linking practice. And it is enabled by a type of social filtering where the user views the stream at a speed which is compatible with their own needs.

The tools and social structures are well developed now for this kind of social filtering where a user outsources that function to other people, whether they are on FriendFeed, or are bloggers or traditional dead-tree journalist. What I am less sure about is the tooling for controlling the rate of the stream that I am taking in. Deepak wrote an interesting post recently on social network filtering, with the premise that you needed to build a network that you trusted to bring important material to your attention. My response to this is that there is a fundamental problem that, at the moment, you can’t independently control both the spread of the net you set, and the speed at which information comes in. If you want to cover a lot of areas you need to follow a lot of people and this means the stream is faster.

Fundamentally, as the conversation has got faster and faster, no-one seems to be developing tools that enable us to slow it down. Filtering tools such as those built into Twitter clients help. One of the things I do like about the new Friendfeed interface is the search facility that allows you to set filters that display only those items with a certain number of “likes” or comments help. But what I haven’t seen are tools that are really focussed on controlling the rate of a stream, that work to help you optimize your network to provide both spread and rate. And I haven’t seen much thought go into tools or social practices that enable you to bump an item from one stream to a slower stream to come back to later. Delicious is the obvious tool here; bookmarking objects for later attention, but how many people actually go back to their bookmarks on a regular basis and check over them?

Dave Allen probably best described the concept of a “Tickler File“, a file where you place items into a date marked slot based on when you think you need to be reminded about them.  The way some people regularly review their recent bookmarks and then blog the most interesting ones is an example of a process that achives the same thing. I think this is probably a good model to think about. A tool, or set of practices, that park items for a specified, and item or class specific, period of time and then pulls them back up and puts them in front of you. Or perhaps does it in a context dependent fashion, or both, picking the right moment in a specific time period to have it pop up. Ideally it will also put them, or allow you to put them, back in front of your network for further consideration as well. We still want just the one inbox for everything. It is a question of having control over the intrinsic timeframes of the different streams coming into it, including streams that we set up for ourselves.

As I said, I really haven’t got a good grip on this, but my main point is that I think Real Time is just a single instance of giving users access to one specific intrinsic timeframe. The much more interesting problem, and what I think will be one of the next big things is the general issue of giving users temporal control within a service, particularly for enterprise applications.

Digital Britain Unconference Oxfordshire – Friday 1 May – RAL

On Friday (yes, that’s this Friday) a series of Unconferences that has been pulled together in response to the Digital Britain Report and Forum will kick of with one being held at the Rutherford Appleton Laboratory, near Didcot. The object of the meeting is to contribute to a coherent and succinct response to the current interim report and to try and get across to Government what a truly Digital Britain would look like. There is another unconference scheduled in Leeds and it is expected that more will follow.

If you are interested in attending the Oxfordshire meeting please register at the Eventbrite page. Registrations will close on Wednesday evening because I need to finalise the list for security the day before the meeting. I will send directions to registered attendees first thing on Thursday morning. In terms of the conduct of the unconference itself please bear in mind the admonishment of Alan Patrick:

One request I’d make – the other organisers are too polite to say it, but I will – one of the things that the Digital Britain team has made clear is that they will want feedback that is “positive, concise, based in reality and sent in as soon as possible”. That “based in reality” bit (that mainly means economics) puts a responsibility on us all to ensure all of us as attendees are briefed and educated on the subject before attending the unconference – ie come prepared, no numpties please, as that will dilute the hard work of others.

For information see the links on the right hand side of the main unconference series website, or search on “Digital Britain”