What would you say to Elsevier?

In a week or so’s time I have been invited to speak as part of a forward planning exercise at Elsevier. To some this may seem like an opportunity to go in for an all guns blazing OA rant or perhaps to plant some incendiary device but I see it more as opportunity to nudge, perhaps cajole, a big player in the area of scholarly publishing in the right direction. After all if we are right about the efficiency gains for authors and readers that will be created by Open Access publication and we are right about the way that web based systems utterly changes the rules of scholarly communication then even an organization of the size of Elsevier has to adapt or wither away. Persuading them to move in right direction because it is in their own interests would be an effective way of speeding up the process of positive change.

My plan is to focus less on the arguments for making more research output Open Access and more on what happens as a greater proportion of those outputs become freely available, something that I see as increasingly inevitable. Where that proportion may finally be is anyone’s guess but it is going to be a much bigger proportion than it is now. What will authors and funders want and need from their publication infrastructure and what are the business opportunities that arise from those. For me these fall into four main themes:

  • Tracking via aggregation. Funders and institutions want more and more to track the outputs of their research investment. Providing tools and functionality that will enable them to automatically aggregate and slice and dice these outputs is a big business opportunity. The data themselves will be free but providing it in the form that people need it rapidly and effectively will add value that they will be prepared to pay for.
  • Speed to publish as a market differentiator. Authors will want their content out and available and being acted on fast. Speed to publication is potentially the biggest remaining area for competition between journals. This is important because there will almost certainly be less journals with greater “quality” or “brand” differentiation. There is a plausible future in which there are only two journals, Nature and PLoS ONE.
  • Data publication, serving, and archival. There may be less journals but there will be much greater diversity of materials being published through a larger number of mechanisms. There are massive opportunities in providing high quality infrastructure and services to funders and institutions to aggregate, publish, and archive the full set of research outputs. I intend to draw heavily on Dorothea Salo‘s wonderful slideset on data publication for this part.
  • Social search. Literature searching is the main area where there are plausible efficiency gains to be made in the current scholarly publications cycle. According to the Research Information Network‘s model of costs search accounts for a very significant proportion of the non-research costs of  publishing. Building the personal networks (Bill Hooker‘s, Distributed Wetware Online Information Filter [down in the comments] or DWOIF) that make this feasible may well be the new research skill of the 21st century. Tools that make this work effectively are going to be very popular. What will they look like?

But what have I missed? What (constructive!) ideas and thoughts would you want to place in the minds of the people thinking about where to take one of the world’s largest scholarly publication companies and its online information and collaboration infrastructure.?

Full disclosure: Part of the reason for writing this post is to disclose publicly that I am doing this gig. Elsevier are covering my travel and accommodation costs but are not paying any fee.

OMG! This changes EVERYTHING! – or – Yet Another Wave of Adulation

Yes, I’m afraid it’s yet another over the top response to yesterday’s big announcement of Google Wave, the latest paradigm shifting gob-smackingly brilliant piece of technology (or PR depending on your viewpoint) out of Google. My interest, however is pretty specific, how can we leverage it to help us capture, communicate, and publish research? And my opinion is that this is absolutely game changing – it makes a whole series of problems simply go away, and potentially provides a route to solving many of the problems that I was struggling to see how to manage.

Firstly, lets look at the grab bag of generic issues that I’ve been thinking about. Most recently I wrote about how I thought “real time” wasn’t the big deal but giving the user control back over the timeframe in which streams came into them. I had some vague ideas about how this might look but Wave has working code. When the people who you are in conversation with are online and looking at the same wave they will see modifications in real time. If they are not in the same document they will see the comments or changes later, but can also “re-play” changes. But a lot of thought has clearly gone into thinking about the default views based on when and how a person first comes into contact with a document.

Another issue that has frustrated me is the divide between wikis and blogs. Wikis have generally better editing functionality, but blogs have workable RSS feeds, Wikis have more plugins, blogs map better onto the diary style of a lab notebook. None of these were ever fundamental philosophical differences but just historical differences of implementations and developer priorities. Wave makes most of these differences irrelevant by creating a collaborative document framework that easily incorporates much of the best of all of these tools within a high quality rich text and media authoring platform. Bringing in content looks relatively easy and pushing content out in different forms also seems to be pretty straightforward. Streams, feeds, and other outputs, if not native, look to be easily generated either directly or by passing material to other services. The Waves themselves are XML which should enable straightforward parsing and tweaking with existing tools as well.

One thing I haven’t written much about but have been thinking about is the process of converting lab records into reports and onto papers. While there wasn’t much on display about complex documents a lot of just nice functionality, drag and drop links, options for incorporating and embedding content was at least touched on. Looking a little closer into the documentation there seems to be quite a strong provenance model, built on a code repository style framework for handling document versioning and forking. All good steps in the right direction and with the open APIs and multitouch as standard on the horizon there will no doubt be excellent visual organization and authoring tools along very soon now. For those worried about security and control, a 30 second mention in the keynote basically made it clear that they have it sorted. Private messages (documents? mecuments?) need never leave your local server.

Finally the big issue for me has for some time been bridging the gap between unstructured capture of streams of events and making it easy to convert those to structured descriptions of the intepretation of experiments.  The audience was clearly wowed by the demonstration of inline real time contextual spell checking and translation. My first thought was – I want to see that real-time engine attached to an ontology browser or DbPedia and automatically generating links back to the URIs for concepts and objects. What really struck me most was the use of Waves with a few additional tools to provide authoring tools that help us to build the semantic web, the web of data, and the web of things.

For me, the central challenges for a laboratory recording system are capturing objects, whether digital or physical, as they are created, and then serve those back to the user, as they need them to describe the connections between them. As we connect up these objects we will create the semantic web. As we build structured knowledge against those records we will build a machine-parseable record of what happened that will help us to plan for the future. As I understand it each wave, and indeed each part of a wave, can be a URL endpoint; an object on the semantic web. If they aren’t already it will be easy to make them that. As much as anything it is the web native collaborative authoring tool that will make embedding and pass by reference the default approach rather than cut and past that will make the difference. Google don’t necessarily do semantic web but they do links and they do embedding, and they’ve provided a framework that should make it easy to add meaning to the links. Google just blew the door off the ELN market, and they probably didn’t even notice.

Those of us interested in web-based and electronic recording and communication of science have spent a lot of the last few years trying to describe how we need to glue the existing tools together, mailing lists, wikis, blogs, documents, databases, papers. The framework was never right so a lot of attention was focused on moving things backwards and forwards, how to connect one thing to another. That problem, as far as I can see has now ceased to exist. The challenge now is in building the right plugins and making sure the architecture is compatible with existing tools. But fundamentally the framework seems to be there. It seems like it’s time to build.

A more sober reflection will probably follow in a few days ;-)

It’s not easy being clear…

There has been some debate going backwards and forwards over the past few weeks about licensing, peoples expectations, and the extent to which researchers can be expected to understand, or want to understand, the details of legal terms, licensing and other technical minutiae. It is reasonable for scientific researchers not to wish to get into the details. One of the real successes of Creative Commons has been to provide a relatively small set of reasonably clear terms that enable people to express their wishes about what people can do with their work. But even here there is the potential for significant confusion as demonstrated by the work that CC is doing on the perception of what “non commercial” means.

The end result of this is two-fold. Firstly people are genuinely confused about what to do and a result they give up. In giving up there is often an unspoken assumption that “people will understand what I want/mean”. Two examples yesterday illustrated exactly how misguided this can be and showed the importance of being clear, and thinking about, what you want people to do with your content and information.

The first was pointed out by Paulo Nuin who linked to a post on The Matrix Cookbook, a blog and PDF containing much useful information on matrix transforms. The post complained that Amazon were selling a Kindle version of the PDF, apparently without asking permission or even bothering to inform the authors. So far, so big corporation. But digging a little deeper I went to the front page of the site and found this interesting “license”:

“License? No, there is no license. It is provided as a knowledge sharing project for anyone to use. But if you use it in an academic or research like context, we would love to be cited appropriately.”

Now I would intepret this as meaning that the authors had intended to place the work in the public domain. They clearly felt that while educational and research re-use was fine that commercial use was not. I would guess that someone at Amazon read the statement “there is no license” and felt that it was free to re-use. It seems odd that they wouldn’t email the authors to notify them but if it were public domain there is no requirement to. Rude, yes. Theft? Well it depends on your perspective. Going back today the authors have made a significant change to the “license”:

It is provided as a knowledge sharing project for anyone to use. But if you use it in an academic or research like context, we would love to be cited appropriately. And NO, you are not allowed to make money on it by reselling The Matrix Cookbook in any form or shape.

Had the authors made the content CC-BY-NC then their intentions would have been much clearer. My personal belief is that an NC license would be counter-productive (meaning the work couldn’t be used for teaching at a fee charging college or for research funded by a commercial sponsor for instance) but the point of the CC licenses is to give people these choices. What is important is that people make those choices and make them clear.

The second example related to identity. As part of an ongoing discussion involving online commenting genereg, a Friendfeed user, linked to their blog which included their real name. Mr Gunn, the nickname used by Dr William Gunn online wrote a blog post in which he referred to genereg’s contribution by linking to their blog from their real name [subsequently removed on request]. I probably would have done the same, wanting to ascribe the contribution clearly to the “real person” so they get credit for it. Genereg objected to this feeling that as their real name wasn’t directly in that conversational context it was inappropriate to use it.

So in my view, “Genereg” was a nickname that someone was happy to have connected with their real name, while in their view this was inappropriate. No-one is right or wrong here, we are evolving the rules of conduct more or less as we go and frankly, identity is a mess. But this wasn’t clear to me or to Mr Gunn. I am often uncomfortable with trying to tell whether a specific person who has linked two apparently separate identities is happy with that link being public, has linked the two by mistake, or just regards one as an alias. And you can’t ask in public forum can you?

What links these, and this week’s other fracas, is confusion over people’s expectations. The best way to avoid this is to be as clear as you possibly can. Don’t assume that everyone thinks the same way that you do. And definitely don’t assume that what is obvious to you is obvious to everyone else. When it comes to content, make a clear statement of your expectations and wishes, preferably using a widely recognized and understood licenses. If you’re reading this at OWW you should be seeing my nice shiny new cc0 waiver in the right hand navbar (I haven’t figured how to get it into the RSS feed yet). Most of my slidesets at Slideshare are CC-BY-SA. I’d prefer them to be CC-BY but most include images with CC-BY-SA licenses which (try to make sure) I respect. Overall I try to make the work I generate as widely re-usable as possible and aim to make that as clear as possible.

There are no such tools to make clear statements about how you wish your identity to be treated (and perhaps there should be). But a plain english statement on the appropriate profile page might be useful “I blog under a pseudonym because…and I don’t want my identity revealed”…”Bunnykins is the Friendfeed handle of Professor Serious Person”. Consider whether what you are doing is sending mixed messages or potentially confusing. Personally I like to keep things simple so I just use my real name or variants of it. But that is clearly not for everyone.

Above all, try to express clearly what you expect and wish to happen. Don’t expect others necessarily to understand where you’re coming from. It is very easy for one person’s polite and helpful to be another person’s deeply offensive. When you put something online, think about how you want people to use it, think about how you don’t want people to use it (and remember you may need to balance the allowing of one against the restricting of the other) and make those as clear as you possibly can, where possible using a statement or license that is widely recognized and has had some legal attention at some point like the CC licenses, cc0 waiver, or the PDDL. Clarity helps everyone. If we get this wrong we may end up with a web full of things we can’t use.

And before anyone else gets in to tell me I’ve made plenty of unjustified, and plain wrong, assumptions about other people’s views before. Pot. Kettle. Black. Welcome to being human.

A breakthrough on data licensing for public science?

I spent two days this week visiting Peter Murray-Rust and others at the Unilever Centre for Molecular Informatics at Cambridge. There was a lot of useful discussion and I learned an awful lot that requires more thinking and will no doubt result in further posts. In this one I want to relay a conversation we had over lunch with Peter, Jim Downing, Nico Adams, Nick Day and Rufus Pollock that seemed extremely productive. It should be noted that what follows is my recollection so may not be entirely accurate and shouldn’t be taken to accurately represent other people’s views necessarily.

The appropriate way to license published scientific data is an argument that has now been rolling on for some time. Broadly speaking the argument has devolved into two camps. Firstly those who have a belief in the value of share-alike or copyleft provisions of GPL and similar licenses. Many of these people come from an Open Source Software or Open Content background. The primary concern of this group is spreading the message and use of Open Content and to prevent “freeloaders” from being able to use Open material and not contribute back to the open community. A presumption in this view is that a license is a good, or at least acceptable, way of achieving both these goals. Also included here are those who think that it is important to allow people the freedom to address their concerns through copyleft approaches. I think it is fair to characterize Rufus as falling into this latter group.

On the other side are those, including myself, who are concerned more centrally with enabling re-use and re-purposing of data as far as is possible. Most of us are scientists of one sort or another and not programmers per se. We don’t tend to be concerned about freeloading (or in some cases welcome it as effective re-use). Another common characteristic is that we have been prevented from being able to make our own content as free as we would like due to copyleft provisions. I prefer to make all my content CC-BY (or cc0 where possible). I am frequently limited in my ability to do this by the wish to incorporate CC-BY-SA or GFDL material. We are deeply worried by the potential for licensing to make it harder to re-use and re-mix disparate sets of data and content into new digital objects. There is a sense amongst this group that “data is different” to other types of content, particulary in its diversity of types and re-uses. More generally there is the concern that anything that “smells of lawyers”, like something called a “license”, will have scientists running screaming in the opposite direction as they try to avoid any contact with their local administration and legal teams.

What I think was productive about the discussion on Tuesday is that we focused on what we could agree on with the aim of seeing whether it was possible to find a common position statement on the limited area of best practice for the publication of data that arises from public science. I believe such a statement is important because there is a window of opportunity to influence funder positions. Many funders are adopting data sharing policies but most refer to “following best practice” and that best practice is thin on the ground in most areas. With funders wielding the ultimate potential stick there is great potential to bootstrap good practice by providing clear guidance and tools to make it easy for researchers to deliver on their obligations. Funders in turn will likely adopt this best practice as policy if it is widely accepted by their research communities.

So we agreed on the following (I think – anyone should feel free to correct me of course!):

  1. A simple statement is required along the forms of  “best practice in data publishing is to apply protocol X”. Not a broad selection of licenses with different effects, not a complex statement about what the options are, but “best practice is X”.
  2. The purpose of publishing public scientific data and collections of data, whether in the form of a paper, a patent, data publication, or deposition to a database, is to enable re-use and re-purposing of that data. Non-commercial terms prevent this in an unpredictable and unhelpful way. Share-alike and copyleft provisions have the potential to do the same under some circumstances.
  3. The scientific research community is governed by strong community norms, particularly with respect to attribution. If we could successfully expand these to include share-alike approaches as a community expectation that would obviate many concerns that people attempt to address via licensing.
  4. Explicit statements of the status of data are required and we need effective technical and legal infrastructure to make this easy for researchers.

So in aggregate I think we agreed a statement similar to the following:

Where a decision has been taken to publish data deriving from public science research, best practice to enable the re-use and re-purposing of that data, is to place it explicitly in the public domain via {one of a small set of protocols e.g. cc0 or PDDL}.”

The advantage of this statement is that it focuses purely on what should be done once a decision to publish has been made, leaving the issue of what should be published to a separate policy statement. This also sidesteps issues of which data should not be made public. It focuses on data generated by public science, narrowing the field to the space in which there is a moral obligation to make such data available to the public that fund it. By describing this as best practice it also allows deviations that may, for whatever reason, be justified by specific people in specific circumstances. Ultimately the community, referees, and funders will be the judge of those justifications. The BBSRC data sharing policy states for instance:

BBSRC expects research data generated as a result of BBSRC support to be made available…no later than the release through publication…in-line with established best practice  in the field [CN – my emphasis]…

The key point for me that came out of the discussion is perhaps that we can’t and won’t agree on a general solution for data but that we can articulate best practice in specific domains. I think we have agreed that for the specific domain of published data from public science there is a way forward. If this is the case then it is a very useful step forward.

“Real Time”: The next big thing or a pointer to a much more interesting problem?

There has been a lot written and said recently about the “real time” web most recently in an interview of Paul Buchheit on ReadWriteWeb. The premise is that if items and conversations are carried on in “real time” then they are more efficient and more engaging. The counter argument has been that they become more trivial. That by dropping the barrier to involvement to near zero, the internal editorial process that forces each user to think a little about what they are saying, is lost generating a stream of drivel. I have to admit upfront that I really don’t get the excitement. It isn’t clear to me that the difference between a five or ten second refresh rate versus a 30 second one is significant.

In one sense I am all for getting a more complete record onto the web, at least if there is some probability of it being archived. After all this is what we are trying to do with the laboratory recording effort; creat as complete a record on the web as possible. But at some point there is always going to be an editorial process. In a blog it takes some effort to write a post and publish it, creating a barrier which imposes some editorial filter. Even on Twitter the 140 character limit forces people to be succinct and often means a pithy statement gets refined before hitting return. In an IM or chat window you will think before hitting return (hopefully!). Would true “real time” mean watching as someone typed or would it have to be a full brain dump as it happened? I’m not sure I want either of these, if I want real time conversation I will pick up the phone.

But while everyone is focussed on “real time” I think it is starting to reveal a more interesting problem. One I’ve been thinking about for quite a while but have been unable to get a grip on. All of these services have different intrinsic timeframes. One of the things I dislike about the new FriendFeed interface is the “real time” nature of it. What I liked previously was that it had a slower intrinsic time than, say, Twitter or instant messenging, but a faster intrinsic timescale than a blog or email. On Twitter/IM conversations are fast, seconds to minutes, occassionally hours. On FriendFeed they tend to run from minutes to hours, with some continuing on for days, all threaded and all kept together. Conversations in blog comments run over hours, to days, email over days, newspapers over weeks, academic literature over months and years.

Different people are comfortable with interacting with streams running at these different rates. Twitter is too much for some, as is FriendFeed, or online content at all. Many don’t have time to check blog comments, but perhaps are happy to read the posts once a day. But these people probably appreciate that the higher rate data is there. Maybe they come across an interesting blog post referring to a comment and want to check the comment, maybe the comment refers to a conversation on Twitter and they can search to find that. Maybe they find a newspaper article that leads to a wiki page and on to a pithy quote from an IM service. This type of digging is enabled by good linking practice. And it is enabled by a type of social filtering where the user views the stream at a speed which is compatible with their own needs.

The tools and social structures are well developed now for this kind of social filtering where a user outsources that function to other people, whether they are on FriendFeed, or are bloggers or traditional dead-tree journalist. What I am less sure about is the tooling for controlling the rate of the stream that I am taking in. Deepak wrote an interesting post recently on social network filtering, with the premise that you needed to build a network that you trusted to bring important material to your attention. My response to this is that there is a fundamental problem that, at the moment, you can’t independently control both the spread of the net you set, and the speed at which information comes in. If you want to cover a lot of areas you need to follow a lot of people and this means the stream is faster.

Fundamentally, as the conversation has got faster and faster, no-one seems to be developing tools that enable us to slow it down. Filtering tools such as those built into Twitter clients help. One of the things I do like about the new Friendfeed interface is the search facility that allows you to set filters that display only those items with a certain number of “likes” or comments help. But what I haven’t seen are tools that are really focussed on controlling the rate of a stream, that work to help you optimize your network to provide both spread and rate. And I haven’t seen much thought go into tools or social practices that enable you to bump an item from one stream to a slower stream to come back to later. Delicious is the obvious tool here; bookmarking objects for later attention, but how many people actually go back to their bookmarks on a regular basis and check over them?

Dave Allen probably best described the concept of a “Tickler File“, a file where you place items into a date marked slot based on when you think you need to be reminded about them.  The way some people regularly review their recent bookmarks and then blog the most interesting ones is an example of a process that achives the same thing. I think this is probably a good model to think about. A tool, or set of practices, that park items for a specified, and item or class specific, period of time and then pulls them back up and puts them in front of you. Or perhaps does it in a context dependent fashion, or both, picking the right moment in a specific time period to have it pop up. Ideally it will also put them, or allow you to put them, back in front of your network for further consideration as well. We still want just the one inbox for everything. It is a question of having control over the intrinsic timeframes of the different streams coming into it, including streams that we set up for ourselves.

As I said, I really haven’t got a good grip on this, but my main point is that I think Real Time is just a single instance of giving users access to one specific intrinsic timeframe. The much more interesting problem, and what I think will be one of the next big things is the general issue of giving users temporal control within a service, particularly for enterprise applications.

Digital Britain Unconference Oxfordshire – Friday 1 May – RAL

On Friday (yes, that’s this Friday) a series of Unconferences that has been pulled together in response to the Digital Britain Report and Forum will kick of with one being held at the Rutherford Appleton Laboratory, near Didcot. The object of the meeting is to contribute to a coherent and succinct response to the current interim report and to try and get across to Government what a truly Digital Britain would look like. There is another unconference scheduled in Leeds and it is expected that more will follow.

If you are interested in attending the Oxfordshire meeting please register at the Eventbrite page. Registrations will close on Wednesday evening because I need to finalise the list for security the day before the meeting. I will send directions to registered attendees first thing on Thursday morning. In terms of the conduct of the unconference itself please bear in mind the admonishment of Alan Patrick:

One request I’d make – the other organisers are too polite to say it, but I will – one of the things that the Digital Britain team has made clear is that they will want feedback that is “positive, concise, based in reality and sent in as soon as possible”. That “based in reality” bit (that mainly means economics) puts a responsibility on us all to ensure all of us as attendees are briefed and educated on the subject before attending the unconference – ie come prepared, no numpties please, as that will dilute the hard work of others.

For information see the links on the right hand side of the main unconference series website, or search on “Digital Britain”

Now that’s what I call social networking…

So there’s been a lot of antagonistic and cynical commentary about Web2.0 tools particularly focused on Twitter, but also encompassing Friendfeed and the whole range of tools that are of interest to me. Some of this is ill informed and some of it more thoughtful but the overall tenor of the comments is that “this is all about chattering up the back, not paying attention, and making a disruption” or at the very least that it is all trivial nonsense.

The counter argument for those of us who believe in these tools is that they offer a way of connecting with people, a means for the rapid and efficient organization of information, but above all, a way of connecting problems to the resources that can let us make things happen. The trouble has been that the best examples that we could point to were flashmobs, small scale conversations and collaborations, strangers meeting in a bar, the odd new connection made. But overall these are small things; indeed in most cases trivial things. Nothing that registers on the scale of “stuff that matters” to the powers that be.

That was two weeks ago. In the last couple of weeks I have seen a number of remarkable things happen and I wanted to talk about one of them here because I think it is instructive.

On Friday last week there was a meeting held in London to present and discuss the draft Digital Britain Report. This report, commissioned by the government is intended to map out the needs of the UK in terms of digital infrastructure, both physical, legal, and perhaps even social. The current tenor of the draft report is what you might expect, heavy on the need of putting broadband everywhere, to get content to people, and heavy on the need to protect big media from the rising tide of piracy. Actually it’s not all that bad but many of the digerati felt that it is missing important points about what happens when consumers are also content producers and what that means for rights management as the asymmetry of production and consumption is broken but the asymmetry of power is not. Anyway, that’s not what’s important here.

What is important is that the sessions were webcast, a number of people were twittering from the physical audience, and a much larger number were watching and twittering from outside, aggregated around a hashtag #digitalbritain. There was reportage going on in real time from within the room and a wideranging conversation going on beyond the walls of the room. In this day and age nothing particularly remarkable there. It is still relatively unusual for the online audience to be bigger than the physical one for these kind of events but certainly not unheard of.

Nor was it remarkable when Kathryn Corrick tweeted the suggestion that an unconference should be organized to respond to the forum (actually it was Bill Thomson who was first with the suggestion but I didn’t catch that one). People say “why don’t we do something?” all the time; usually in a bar. No, what was remarkable was what followed this as a group of relative strangers aggregated around an idea, developed and refined it, and then made it happen. One week later, on Friday evening, a website went live, with two scheduled events [1, 2], and at least two more to follow. There is an agreement with the people handling the Digital Britain report on the form an aggregated response should take. And there is the beginning of a plan as to how to aggregate the results of several meetings into that form. They want the response by 13 May.

Lets rewind that. In a matter of hours a group of relative strangers, who met each other through something as intangible as a shared word, agreed on, and started to implement a nationwide plan to gather the views of maybe a few hundred, perhaps a few thousand people, with the aim, and the expectation of influencing government policy. Within a week there was a scalable framework for organizing the process of gathering the response (anyone can organize one of the meetings) and a process for pulling together a final report.

What made this possible? Essentially the range of low barrier communication, information, and aggregation tools that Web2.0 brings us.

  1. Twitter: without twitter the conversation could never have happened. Friendfeed never got a look in because that wasn’t where this specific community was. But much more than just twitter, the critical aspect was;
  2. The hashtag #digitalbritain: the hashtag became the central point of a conversation between people who didn’t know each other, weren’t following each other, and without that link would never have got in contact. As the conversation moved to discussing the idea of an unconference the hashtags morphed first to #digitalbritain #unconference (an intersection of ideas) and then to #dbuc09. In a sense it became serious when the hashtag was coined. The barrier to a group of sufficiently motivated people to identify each other was low.
  3. Online calendars: it was possible for me to identify specific dates when we might hold a meeting at my workplace in minutes because we have all of our rooms on an online calendar system. Had it been more complex I might not have bothered. As it was it was easy to identify possible dates. The barrier to organization was low.
  4. Free and easy online services: A Yahoo Group was set up very early and used as a mailing list. WordPress.com provides a simple way of throwing up a website and giving specified people access to put up material. Eventbrite provies an easy method to manage numbers for the specific events. Sure someone could have set these up for us on a private site but the almost zero barrier of these services makes it easy for anyone to do this.
  5. Energy and community: these services  lead to low barriers, not zero barrier. There still has to be the motivation to carry it through. In this case Kathryn provided the majority of the energy and others chipped in along the way. Higher barriers could have put a stop to the whole thing, or perhaps stopped it going national, but there needs to be some motivation to get over the barriers that do remain. What was key was that a small group of people had sufficient energy to carry these through.
  6. Flexible working hours: none of this would be possible if the people who would be interested in attending such meetings couldn’t come on short notice. The ability of people to either arrange their own working schedule or to have the flexibility to take time out of work is crucial, otherwise no-one could come. Henry Gee had a marvelous riff on the economic benefits of flexible working just before the budget. The feasibility of our meetings is an example of the potential efficiency benefits that such flexibility could bring.

The common theme here is online services making it easy to aggregate the right people and the right information quickly, to re-publish that information in a useful form. We will use similar services, blogs, wikis, online documents to gather back the outputs from these meetings to push back into the policy making process. Will it make a big difference? Maybe not, but even in showing that this kind of response, this kind of community consultation can be done effectively in a matter of days and weeks, I think we’re showing what a Digital Britain ought to be about.

What does this mean for science or research? I will come back to more research related examples over the next few weeks but one key point was that this happened because there was a pretty large audience watching the webcast and communicating around it. As I and others have recently argued in research the community sizes probably aren’t big enough in most cases for these sort of network effects to kick in effectively. Building up community quantity and quality will be the main challenge of the next 6 – 12 months but where the community exists and where the time is available we are starting to see rapid, agile, and bursty efforts in projects and particularly in preparing documents.

There is clearly a big challenge in taking this into the lab but there is a good reason why when I talk to my senior management about the resources I need that the keywords are “capacity” and “responsiveness”. Bursty work requires the capacity to be in place to resource it. In a lab this is difficult, but it is not impossible. It will probably require a reconfiguring of resource distribution to realize its potential. But if that potential can be demonstrated then the resources will almost certainly follow.

Use Cases for Provenance – eScience Institute – 20 April

On Monday I am speaking as part of a meeting on Use Cases for Provenance (Programme), which has a lot of interesting talks scheduled. I appear to be last. I am not sure whether that means I am the comedy closer or the pre-dinner entertainment. This may, however, be as a result of the title I chose:

In your worst nightmares: How experimental scientists are doing provenance for themselves

On the whole experimental scientists, particularly those working in traditional, small research groups, have little knowledge of, or interest in, the issues surrounding provenance and data curation. There is however an emerging and evolving community of practice developing the use of the tools and social conventions related to the broad set of web based resources that can be characterised as “Web 2.0”. This approach emphasises social, rather than technical, means of enforcing citation and attribution practice, as well as maintaining provenance. I will give examples of how this approach has been applied, and discuss the emerging social conventions of this community from the perspective of an insider.

The meeting will be webcast (link should be available from here) and my slides will with any luck be up at least a few minutes before my talk in the usual place.

Creating a research community monoculture – just when we need diversity

This post is a follow on from a random tweet that I sent a few weeks back in response to a query on twitter from Lord Drayson, the UK’s Minister of State for Science and Innovation. I thought it might be an idea to expand from the 140 characters that I had to play with at the time but its taken me a while to get to it. It builds on the ideas of a post from last year but is given a degree of urgency by the current changes in policy proposed by EPSRC.

Government money for research is limited, and comes from the pockets of taxpayers. It is incumbent on those of us who spend it to ensure that this investment generates maximum impact. Impact, for me comes in two forms. Firstly there is straightforward (although not straightforward to measure) economic impact; increases in competitivenes, standard of living, development of business opportunities, social mobility, reductions in the burden of ill health and hopefully in environmental burden at some point in the future. The problem with economic impact is that it is almost impossible to measure in any meaningful way. The second area of impact is, at least on the surface, a little easier to track, that is research outputs delivere. How efficiently do we turn money into science? Scratch beneath the surface and you realise rapidly that measurement is a nightmare, but we can at least look at where there are inefficiencies, where money is being wasted, and being lost from the pipelines before it can be spent on research effort.

The approach that is being explicitly adopted in the UK is to concentrate research in “centres of excellence” and to “focus research on areas where the UK leads” and where “they are relevant to the UK’s needs”. At one level this sounds like motherhood and apple pie. It makes sense in terms of infrastructure investment to focus research funding both geographically and in specific subject areas. But at another level it has the potential to completely undermine the UK’s history of research excellence.

There is a fundamental problem with trying to maximise the economic impact of research. And it is one that any commercial expert, or indeed politician should find obvious. Markets are good at picking winners, commitees are very bad at it. Using committees of scientists, with little or no experience of commercialising research outputs is likely to be an unmitigated disaster. There is no question that some research leds to commercial outcomes but to the best of my knowledge there is no evidence that anyone has ever had any success in picking the right projects in advance. The simple fact is that the biggest form of economic impact from research is in providing and supporting the diverse and skilled workforce that support a commercially responsive, high technology economy. To a very large extent it doesn’t actually matter what specific research you support as long as it is diverse. And you will probably generate just exactly the same amount of commercial outcomes by picking at random as you will by trying to pick winners.

The world, and the UK in particular, is facing severe challenges both economic and environmental for which there may be technological solutions. Indeed there is a real opportunity in the current economic climate to reboot the economy with low carbon technologies and at the same time take the opportunity to really rebuild the information economy in a way that takes advantage of the tools the web provides, and in turn to use this to improve outcomes in health, social welfare, to develop new environmentally friendly processes and materials. The UK has great potential to lead these developments precisely because it has a diverse research community and a diverse highly trained research and technology workforce. We are well placed to solve todays problems with tomorrow’s technology.

Now let us return to the current UK policy proposals. These are to concentrate research, to reduce diversity, and to focus on areas of UK strength. How will those strengths be identified? No doubt by committee. Will they be forward looking strengths? No, they will be what a bunch of old men, already selected by their conformance to a particular stereotype, i.e. the ones doing fundable research i fundable places, identify in a closed room. It is easy to identify the big challenges. It is not easy, perhaps not even possible, to identify the technological solutions that will eventually solve them. Not the currently most promising solutions, the ones that will solve the problem five or ten years down the track.

As a thought experiment think back to what the UK’s research strengths and challenges were 20 years ago and imagine a world in which they were exclusively funded. It would be easy to argue that many of the UK’s current strengths simply wouldn’t even exist (web technology? biotechnology? polymer materials?). And that disciplines that have subsequently reduced in size or entirely disappeared would have been maintained at the cost of new innovation. Concentrating research in a few places, on a few subjects, will reduce diversity, leading to the loss of skills, and probably the loss of skilled people as researchers realise there is no future career for them in the UK. It will not provide the diverse and skilled workforce required to solve the problems we face today. Concentrating on current strengths, no matter how worthy, will lead to ossification and conservatism making UK research ultimately irrelevant on a world stage.

What we need more than ever now, is a diverse and vibrant research community working on a wide range of problems, and to find better communication tools so as to efficiently connect unexpected solutions to problems in different areas. This is not the usual argument for “blue skies research”, whatever that may be. It is an argument for using market forces to do what they are best at (pick the winners from a range of possible technologies) and to use the smart people currently employed in research positions at government expense to actually do what they are good at; do research and train new researchers. It is an argument for critically looking at the expenditure of government money in a wholistic way and to seriously consider radical change where money is being wasted. I have estimated in the past that the annual cost of failed grant proposals to the UK government is somewhere between £100M – £500M, a large sum of money in anybody’s books. More rigorous economic analysis of a Canadian government funding scheme has shown that the cost of preparing and refeering the proposals ($CAN40k) is more than the cost of giving every eligible applicant a support grantof $CAN30k. This is not just farcical, it is an offensive waste of taxpayer’s money.

The funding and distribution of research money requires radicaly overhaul. I do not beleive that simply providing more money is the solution. Frankly we’ve had a lot more money, it makes life a little more comfortable if you are in the right places, but it has reduced the pressure to solve the underlying problems. We need responsive funding at a wide range of levels that enables both bursts of research, the kind of instant collaboration that we know can work, with little or no review, and large scale data gathering projects of strategic importance that need extensive and careful critical review before being approved.  And we need mechanisms to tension these against each other. We need baseline funding to just let people get on with research and we need access to larger sums where appropriate.

We need less buearacracy, less direction from the top, and more direction from the sides, from the community, and not just necessarily the community of researchers. What we have at the moment are strategic initiatives announced by research councils that are around five years behind the leading edge, which distort and constrain real innovation. Now we have ministers proposing to identify the UK’s research strengths. No doubt these will be five to ten years out of date and they will almost certainly stifle those pockets of excellence that will grow in strengths over the next decade. No-one will ever agree what tomorrow’s strengths will be. Much better would be to get on and find out.

Open Data, Open Source, Open Process: Open Research

There has been a lot of recent discussion about the relative importance of Open Source and Open Data (Friendfeed, Egon Willighagen, Ian Davis). I don’t fancy recapitulating the whole argument but following a discussion on Twitter with Glyn Moody this morning [1, 2, 3, 4, 5, 6, 7, 8] I think there is a way of looking at this with a slightly different perspective. But first a short digression.

I attended a workshop late last year on Open Science run by the Open Knowledge Foundation. I spent a significant part of the time arguing with Rufus Pollock about data licences, an argument that is still going on. One of Rufus’ challenges to me was to commit to working towards using only Open Source software. His argument was that there wasn’t really any excuses any more. Open Office could do the job of MS Office, Python with SciPy was up to the same level as MatLab, and anything specialist needed to be written anyway so should be open source from the off.

I took this to heart and I have tried, I really have tried. I needed a new computer and, although I got a Mac (not really ready for Linux yet), I loaded it up with Open Office, I haven’t yet put my favourite data analysis package on the computer (Igor if you must know), and have been working in Python to try to get some stuff up to speed. But I have to ask whether this is the best use of my time. As is often the case with my arguments this is a return on investment question. I am paid by the taxpayer to do a job. At what point is the extra effort I am putting into learning to use, or in some cases fight with, new tools cost more than the benefit that is gained, by making my outputs freely available?

Sometimes the problems are imposed from outside. I spent a good part of yesterday battling with an appalling, password protected, macroed-to-the-eyeballs Excel document that was the required format for me to fill in a form for an application. The file crashed Open Office and only barely functioned in Mac Excel at all. Yet it was required, in that format, before I could complete the application. Sometimes the software is just not up to scratch. Open Office Writer is fine, but the presentation and spreadsheet modules are, to be honest, a bit ropey compared to the commercial competitors. And with a Mac I now have Keynote which is just so vastly superior that I have now transferred wholesale to that. And sometimes it is just a question of time. Is it really worth me learning Python to do data analysis that I could knock in Igor in a tenth of the time?

In this case the answer is, probably yes. Because it means I can do more with it. There is the potential to build something that logs process the way I want to , the potential to convert it to run as a web service. I could do these things with other OSS projects as well in a way that I can’t with a closed product. And even better because there is a big open community I can ask for help when I run into problems.

It is easy to lose sight of the fact that for most researchers software is a means to an end. For the Open Researcher what is important is the ability to reproduce results, to criticize and to examine. Ideally this would include every step of the process, including the software. But for most issues you don’t need, or even want, to be replicating the work right down to the metal. You wouldn’t after all expect a researcher to be forced to run their software on an open source computer, with an open source chipset. You aren’t necessarily worried what operating system they are running. What you are worried about is whether it is possible read their data files and reproduce their analysis. If I take this just one step further, it doesn’t matter if the analysis is done in MatLab or Excel, as long as the files are readable in Open Office and the analysis is described in sufficient detail that it can be reproduced or re-implemented.

Lets be clear about this: it would be better if the analysis were done in an OSS environment. If you have the option to work in an OSS environment you can also save yourself time and effort in describing the process and others have a much better chances of identifying the sources of problems. It is not good enough to just generate an Excel file, you have to generate an Excel file that is readable by other software (and here I am looking at the increasing number of instrument manufacturers providing software that generates so called Excel files that often aren’t even readable in Excel). In many cases it might be easier to work with OSS so as to make it easier to generate an appropriate file. But there is another important point; if OSS generates a file type that is undocumented or worse, obfuscated, then that is also unacceptable.

Open Data is crucial to Open Research. If we don’t have the data we have nothing to discuss. Open Process is crucial to Open Research. If we don’t understand how something has been produced, or we can’t reproduce it, then it is worthless. Open Source is not necessary, but, if it is done properly, it can come close to being sufficient to satisfy the other two requirements. However it can’t do that without Open Standards supporting it for documenting both file types and the software that uses them.

The point that came out of the conversation with Glyn Moody for me was that it may be more productive to focus on our ability to re-implement rather than to simply replicate. Re-implementability, while an awful word, is closer to what we mean by replication in the experimental world anyway. Open Source is probably the best way to do this in the long term, and in a perfect world the software and support would be there to make this possible, but until we get there, for many researchers, it is a better use of their time, and the taxpayer’s money that pays for that time, to do that line fitting in Excel. And the damage is minimal as long as source data and parameters for the fit are made public. If we push forward on all three fronts, Open Data, Open Process, and Open Source then I think we will get there eventually because it is a more effective way of doing research, but in the meantime, sometimes, in the bigger picture, I think a shortcut should be acceptable.