The Panton Principles: Finding agreement on the public domain for published scientific data

Drafters of the Panton principlesI had the great pleasure and privilege of announcing the launch of the Panton Principles at the Science Commons Symposium – Pacific Northwest on Saturday. The launch of the Panton Principles, many months after they were first suggested is really largely down to the work of Jonathan Gray. This was one of several projects that I haven’t been able to follow through properly on and I want to acknowledge the effort that Jonathan has put into making that happen. I thought it might be helpful to describe where they came from, what they are intended to do and perhaps just as importantly what they don’t.

The Panton Principles aim to articulate a view of what best practice should be with respect to data publication for science. They arose out of an ongoing conversation between myself Peter Murray-Rust and Rufus Pollock. Rufus founded the Open Knowledge Foundation, an organisation that seeks to promote and support open culture, open source, and open science, with the emphasis on the open. The OKF position on licences has always been that share-alike provisions are an acceptable limitation to complete freedom to re-use content. I have always taken the Science Commons position that share-alike provisions, particularly on data have the potential to make it difficult or impossible to get multiple datasets or systems to interoperate. In another post I will explore this disagreement which really amounts to a different perspective on the balance of the risks and consequences of theft vs things not being used or useful. Peter in turn is particularly concerned about the practicalities – really wanting a straightforward set of rules to be baked right into publication mechanisms.

The Principles came out of a discussion in the Panton Arms a pub near to the Chemistry Department of Cambridge University, after I had given a talk in the Unilever Centre for Molecular Informatics. We were having our usual argument trying to win the others over when we actually turned to what we could agree on. What sort of statement could we make that would capture the best parts of both positions with a focus on science and data. We focussed further by trying to draw out one specific issue. Not the issue or when people should share results, or the details of how, but the mechanisms that should be used for re-use. The principles are intended to focus on what happens when a decision has been made to publish data and where we assume that the wish is for that data to be effectively re-used.

Where we found agreement was that for science, and for scientific data, and particularly science funded by public investment, that the public domain was the best approach and that we would all recommend it. We brought John Wilbanks in both to bring the views of Creative Commons and to help craft the words. It also made a good excuse to return to the pub. We couldn’t agree on everything – we will never agree on everything – but the form of words chosen – that placing data explicitly, irrevocably, and legally in the public domain satisfies both the Open Knowledge Definition and the Science Commons Principles for Open Data was something that we could all personally sign up to.

The end result is something that I have no doubt is imperfect. We have borrowed inspiration from the Budapest Declaration, but there are three B’s. Perhaps it will take three P’s to capture all the aspects that we need. I’m certainly up for some meetings in Pisa or Portland, Pittsburgh or Prague (less convinced about Perth but if it works for anyone else it would make my mother happy). For me it captures something that we agree on – a way forwards towards making the best possible practice a common and practical reality. It is something I can sign up to and I hope you will consider doing so as well.

Above all, it is a start.

Reblog this post [with Zemanta]

Friendfeed for Research? First impressions of ScienceFeed

Image representing FriendFeed as depicted in C...
Image via CrunchBase

I have been saying for quite some time that I think Friendfeed offers a unique combination of functionality that seems to work well for scientists, researchers, and the people they want to (or should want to) have conversations with. For me the core of this functionality lies in two places: first that the system explicitly supports conversations that centre around objects. This is different to Twitter which supports conversations but doesn’t centre them around the object – it is actually not trivial to find all the tweets about a given paper for instance. Facebook now has similar functionality but it is much more often used to have pure conversation. Facebook is a tool mainly used for person to person interactions, it is user- or person-centric. Friendfeed, at least as it is used in my space is object-centric, and this is the key aspect in which “social networks for science” need to differ from the consumer offerings in my opinion. This idea can trace a fairly direct lineage via Deepak Singh to the Jeff Jonas/Jon Udell concatenation of soundbites:

“Data finds data…then people find people”

The second key aspect about Friendfeed is that it gives the user a great deal of control over what they present to represent themselves. If we accept the idea that researchers want to interact with other researchers around research objects then it follows that the objects that you choose to represent yourself is crucial to creating your online persona. I choose not to push Twitter into Friendfeed mainly because my tweets are directed at a somewhat different audience. I do choose to bring in video, slides, blog posts, papers, and other aspects of my work life. Others might choose to include Flickr but not YouTube. Flexibility is key because you are building an online presence. Most of the frustration I see with online social tools and their use by researchers centres around a lack of control in which content goes where and when.

So as an advocate of Friendfeed as a template for tools for scientists it is very interesting to see how that template might be applied to tools built with researchers in mind. ScienceFeed launched yesterday by Ijad Madisch, the person behind ResearchGate. The first thing to say is that this is an out and out clone of Friendfeed, from the position of the buttons to the overall layout. It seems not to be built on the Tornado server that was open sourced by the Friendfeed team so questions may hang over scalability and architecture but that remains to be tested. The main UI difference with Friendfeed is that the influence of another 18 months of development of social infrastructure is evident in the use of OAuth to rapidly leverage existing networks and information on Friendfeed, Twitter, and Facebook. Although it still requires some profile setup, this is good to see. It falls short of the kind of true federation which we might hope to see in the future but then so does everything else.

In terms of specific functionality for scientists the main additions is a specialised tool for adding content via a search of literature databases. This seems to be adapted from the ResearchGate tool for populating a profile’s publication list. A welcome addition and certainly real tools for researchers must treat publications as first class objects. But not groundbreaking.

The real limitation of ScienceFeed is that it seems to miss the point of what Friendfeed is about. There is currently no mechanism for bringing in and aggregating diverse streams of content automatically. It is nice to be able to manually share items in my citeulike library but this needs to happen automatically. My blog posts need to come in as do my slideshows on slideshare, my preprints on Nature Precedings or Arxiv. Most of this information is accessible via RSS feeds so import via RSS/Atom (and in the future real time protocols like XMPP) is an absolute requirement. Without this functionality, ScienceFeed is just a souped up microblogging service. And as was pointed out yesterday in one friendfeed thread we have a twitter-like service for scientists. It’s called Twitter. With the functionality of automatic feed aggregation Friendfeed can become a presentation of yourself as a researcher on the web. An automated publication list that is always up to date and always contains your latest (public) thoughts, ideas, and content. In short your web-native business card and CV all rolled into one.

Finally there is the problem of the name. I was very careful at the top of this post to be inclusive in the scope of people who I think can benefit from Friendfeed. One of the great strengths of Friendfeed is that it has promoted conversations across boundaries that are traditionally very hard to bridge. The ongoing collision between the library and scientific communities on Friendfeed may rank one day as its most important achievement, at least in the research space. I wonder whether the conversations that have sparked there would have happened at all without the open scope that allowed communities to form without prejudice as to where they came from and then to find each other and mingle. There is nothing in ScienceFeed that precludes anyone from joining as far as I can see, but the name is potentially exclusionary, and I think unfortunate.

Overall I think ScienceFeed is a good discussion point, a foil to critical thinking, and potentially a valuable fall back position if Friendfeed does go under. It is a place where the wider research community could have a stronger voice about development direction and an opportunity to argue more effectively for business models that can provide confidence in a long term future. I think it currently falls far short of being a useful tool but there is the potential to use it as a spur to build something better. That might be ScienceFeed v2 or it might be an entirely different service. In a follow-up post I will make some suggestions about what such a service might look like but for now I’d be interested in what other people think.

Other Friendfeed threads are here and here and Techcrunch has also written up the launch.

Reblog this post [with Zemanta]

Science Commons Symposium – Redmond 20th February

Science Commons
Image by dullhunk via Flickr

One of the great things about being invited to speak that people don’t often emphasise is that it gives you space and time to hear other people speak. And sometimes someone puts together a programme that means you just have to shift the rest of the world around to make sure you can get there. Lisa Green and Hope Leman have put together the biggest concentration of speakers in the Open Science space that I think I have ever seen for the Science Commons Symposium – Pacific Northwest to be held on the Microsoft Campus in Redmond on 20 February. If you are in the Seattle area and have an interest in the future of science, whether pro- or anti- the “open” movement, or just want to hear some great talks you should be there. If you can’t be there then watch out for the video stream.

Along with me you’ll get Jean-Claude Bradley, Antony Williams, Peter Murray-Rust, Heather Joseph, Stephen Friend, Peter Binfield, and John Wilbanks. Everything from policy to publication, software development to bench work, and from capturing the work of a single researcher to the challenges of placing several hundred millions dollars worth of drug discovery data into the public domain. All with a focus on how we make more science available and generate more and innovative. Not to be missed, in person or online – and if that sounds too much like self promotion then feel free to miss the first talk… ;-)

Reblog this post [with Zemanta]

Peer review: What is it good for?

Peer Review Monster
Image by Gideon Burton via Flickr

It hasn’t been a real good week for peer review. In the same week that the Lancet fully retract the original Wakefield MMR article (while keeping the retraction behind a login screen – way to go there on public understanding of science), the main stream media went to town on the report of 14 stem cell scientists writing an open letter making the claim that peer review in that area was being dominated by a small group of people blocking the publication of innovative work. I don’t have the information to actually comment on the substance of either issue but I do want to reflect on what this tells us about the state of peer review.

There remains much reverence of the traditional process of peer review. I may be over interpreting the tenor of Andrew Morrison’s editorial in BioEssays but it seems to me that he is saying, as many others have over the years “if we could just have the rigour of traditional peer review with the ease of publication of the web then all our problems would be solved”.  Scientists worship at the altar of peer review, and I use that metaphor deliberately because it is rarely if ever questioned. Somehow the process of peer review is supposed to sprinkle some sort of magical dust over a text which makes it “scientific” or “worthy”, yet while we quibble over details of managing the process, or complain that we don’t get paid for it, rarely is the fundamental basis on which we decide whether science is formally published examined in detail.

There is a good reason for this. THE EMPEROR HAS NO CLOTHES! [sorry, had to get that off my chest]. The evidence that peer review as traditionally practiced is of any value at all is equivocal at best (Science 214, 881; 1981, J Clinical Epidemiology 50, 1189; 1998, Brain 123, 1954; 2000, Learned Publishing 22, 117; 2009). It’s not even really negative. That would at least be useful. There are a few studies that suggest peer review is somewhat better than throwing a dice and a bunch that say it is much the same. It is at its best at dealing with narrow technical questions, and at its worst at determining “importance” is perhaps the best we might say. Which for anyone who has tried to get published in a top journal or written a grant proposal ought to be deeply troubling. Professional editorial decisions may in fact be more reliable, something that Philip Campbell hints at in his response to questions about the open letter [BBC article]:

Our editors […] have always used their own judgement in what we publish. We have not infrequently overruled two or even three sceptical referees and published a paper.

But there is perhaps an even more important procedural issue around peer review. Whatever value it might have we largely throw away. Few journals make referee’s reports available, virtually none track the changes made in response to referee’s comments enabling a reader to make their own judgement as to whether a paper was improved or made worse. Referees get no public credit for good work, and no public opprobrium for poor or even malicious work. And in most cases a paper rejected from one journal starts completely afresh when submitted to a new journal, the work of the previous referees simply thrown out of the window.

Much of the commentary around the open letter has suggested that the peer review process should be made public. But only for published papers. This goes nowhere near far enough. One of the key points where we lose value is in the transfer from one journal to another. The authors lose out because they’ve lost their priority date (in the worse case giving the malicious referees the chance to get their paper in first). The referees miss out because their work is rendered worthless. Even the journals are losing an opportunity to demonstrate the high standards they apply in terms of quality and rigor – and indeed the high expectations they have of their referees.

We never ask what the cost of not publishing a paper is or what the cost of delaying publication could be. Eric Weinstein provides the most sophisticated view of this that I have come across and I recommend watching his talk at Science in the 21st Century from a few years back. There is a direct cost to rejecting papers, both in the time of referees and the time of editors, as well as the time required for authors to reformat and resubmit. But the bigger problem is the opportunity cost – how much that might have been useful, or even important, is never published? And how much is research held back by delays in publication? How many follow up studies not done, how many leads not followed up, and perhaps most importantly how many projects not refunded, or only funded once the carefully built up expertise in the form of research workers is lost?

Rejecting a paper is like gambling in a game where you can only win. There are no real downside risks for either editors or referees in rejecting papers. There are downsides, as described above, and those carry real costs, but those are never borne by the people who make or contribute to the decision. Its as though it were a futures market where you can only lose if you go long, never if you go short on a stock. In Eric’s terminology those costs need to be carried, we need to require that referees and editors who “go short” on a paper or grant are required to unwind their position if they get it wrong. This is the only way we can price in the downside risks into the process. If we want open peer review, indeed if we want peer review in its traditional form, along with the caveats, costs and problems, then the most important advance would be to have it for unpublished papers.

Journals need to acknowledge the papers they’ve rejected, along with dates of submission. Ideally all referees reports should be made public, or at least re-usable by the authors. If full publication, of either the submitted form of the paper or the referees report is not acceptable then journals could publish a hash of the submitted document and reports against a local key enabling the authors to demonstrate submission date and the provenance of referees reports as they take them to another journal.

In my view referees need to be held accountable for the quality of their work. If we value this work we should also value and publicly laud good examples. And conversely poor work should be criticised. Any scientist has received reviews that are, if not malicious, then incompetent. And even if we struggle to admit it to others we can usually tell the difference between critical, but constructive (if sometimes brutal), and nonsense. Most of us would even admit that we don’t always do as good a job as we would like. After all, why should we work hard at it? No credit, no consequences, why would you bother? It might be argued that if you put poor work in you can’t expect good work back out when your own papers and grants get refereed. This again may be true, but only in the long run, and only if there are active and public pressures to raise quality. None of which I have seen.

Traditional peer review is hideously expensive. And currently there is little or no pressure on its contributors or managers to provide good value for money. It is also unsustainable at its current level. My solution to this is to radically cut the number of peer reviewed papers probably by 90-95% leaving the rest to be published as either pure data or pre-prints. But the whole industry is addicted to traditional peer reviewed publications, from the funders who can’t quite figure out how else to measure research outputs, to the researchers and their institutions who need them for promotion, to the publishers (both OA and toll access) and metrics providers who both feed the addiction and feed off it.

So that leaves those who hold the purse strings, the funders, with a responsibility to pursue a value for money agenda. A good place to start would be a serious critical analysis of the costs and benefits of peer review.

Addition after the fact: Pointed out in the comments that there are other posts/papers I should have referred to where people have raised similar ideas and issues. In particular Martin Fenner’s post at Nature Network. The comments are particularly good as an expert analysis of the usefulness of the kind of “value for money” critique I have made. Also a paper in the Arxiv from Stefano Allesina. Feel free to mention others and I will add them here.

Reblog this post [with Zemanta]

Everything I know about software design I learned from Greg Wilson – and so should your students

Visualization of the "history tree" ...
Image via Wikipedia

Which is not to say that I am any good at software engineering, good practice, or writing decent code. And you shouldn’t take Greg to task for some of the dodgy demos I’ve done over the past few months either. What he does need to take the credit for is enabling me to go from someone who knew nothing at all about software design, the management of software development or testing to being able to talk about these things, ask some of the right questions, and even begin to make some of my own judgements about code quality in an amazingly short period of time. From someone who didn’t know how to execute a python script to someone who feels uncomfortable working with services where I can’t use a testing framework before deploying software.

This was possible through the online component of the training programme, called Software Carpentry, that Greg has been building, delivering and developing over the past decade. This isn’t a course in software engineering and it isn’t built for computer science undergraduates. It is a course focussed on taking scientists who have done a little bit of tinkering or scripting and giving them the tools, the literacy, and the knowledge to apply the best of knowledge base of software engineering to building useful high quality code that solves their problems.

Code and computational quality has never been a priority in science and there is a strong argument that we are currently paying, and will continue to pay a heavy price for that unless we sort out the fundamentals of computational literacy and practices as these tools become ubiquitous across the whole spread of scientific disciplines. We teach people how to write up an experiment; but we don’t teach them how to document code. We teach people the importance of significant figures but many computational scientists have never even heard of version control. And we teach the importance of proper experimental controls but never provide the basic training in testing and validating software.

Greg is seeking support to enable him to update Software Carpentry to provide an online resource for the effective training of scientists in basic computational literacy. It won’t cost very much money; we’re talking a few hundred thousand dollars here. And the impact is potentially both important and large. If you care about the training of computational scientists; not computer scientists, but the people who need, or could benefit from, some coding, data managements, or processing in their day to day scientific work, and you have money then I encourage you to contribute. If you know people or organizations with money please encourage them to contribute. Like everything important, especially anything to do with education and preparing for the future, these things are tough to fund.

You can find Greg at his blog: http://pyre.third-bit.com

His description of what wants to do and what he needs to do it is at: http://pyre.third-bit.com/blog/archives/3400.html

Reblog this post [with Zemanta]

Why I am disappointed with Nature Communications

Towards the end of last year I wrote up some initial reactions to the announcement of Nature Communications and the communications team at NPG were kind enough to do a Q&A to look at some of the issues and concerns I raised. Specifically I was concerned about two things. The licence that would be used for the “Open Access” option and the way that journal would be positioned in terms of “quality”, particularly as it related to the other NPG journals and the approach to peer review.

Unfortunately I have to say that I feel these have been fudged, and this is unfortunate because there was a real opportunity here to do something different and quite exciting.  I get the impression that that may even have been the original intention. But from my perspective what has resulted is a poor compromise between my hopes and commercial concerns.

At the centre of my problem is the use of a Creative Commons Attribution Non-commercial licence for the “Open Access” option. This doesn’t qualify under the BBB declarations on Open Access publication and it doesn’t qualify for the SPARC seal for Open Access. But does this really matter or is it just a side issue for a bunch of hard core zealots? After all if people can see it that’s a good start isn’t it? Well yes, it is a good start but non-commercial terms raise serious problems. Putting aside the fact that there is an argument that universities are commercial entities and therefore can’t legitimately use content with non-commercial licences the problem is that NC terms limit the ability of people to create new business models that re-use content and are capable of scaling.

We need these business models because the current model of scholarly publication is simply unaffordable. The argument is often made that if you are unsure whether you are allowed to use content then you can just ask, but this simply doesn’t scale. And lets be clear about some of the things that NC means you’re not licensed for: using a paper for commercially funded research even within a university, using the content of paper to support a grant application, using the paper to judge a patent application, using a paper to assess the viability of a business idea…the list goes on and on. Yes you can ask if you’re not sure, but asking each and every time does not scale. This is the central point of the BBB declarations. For scientific communication to scale it must allow the free movement and re-use of content.

Now if this were coming from any old toll access publisher I would just roll my eyes and move on, but NPG sets itself up to be judged by a higher standard. NPG is a privately held company, not beholden to share holders. It is a company that states that it is committed to advancing scientific communication not simply traditional publication. Non-commercial licences do not do this. From the Q&A:

Q: Would you accept that a CC-BY-NC(ND) licence does not qualify as Open Access under the terms of the Budapest and Bethesda Declarations because it limits the fields and types of re-use?

A: Yes, we do accept that. But we believe that we are offering authors and their funders the choices they require.Our licensing terms enable authors to comply with, or exceed, the public access mandates of all major funders.

NPG is offering the minimum that allows compliance. Not what will most effectively advance scientific communication. Again, I would expect this of a shareholder-controlled profit-driven toll access dead tree publisher but I am holding NPG to a higher standard. Even so there is a legitimate argument to be made that non-commercial licences are needed to make sure that NPG can continue to support these and other activities. This is why I asked in the Q&A whether NPG made significant money off re-licensing of content for commercial purposes. This is a discussion we could have on the substance – the balance between a commercial entity providing a valuable service and the necessary limitations we might accept as the price of ensuring the continued provision of that service. It is a value for money judgement. But not one we can make without a clear view of the costs and benefits.

So I’m calling NPG on this one. Make a case for why non-commercial licences are necessary or even beneficial, not why they are acceptable. They damage scientific communication, they create unnecessary confusion about rights, and more importantly they damage the development of new business models to support scientific communication. Explain why it is commercially necessary for the development of these new activities, or roll it back, and take a lead on driving the development of science communication forward. Don’t take the kind of small steps we expect from other, more traditional, publishers. Above all, lets have that discussion. What is the price we would have to pay to change the license terms?

Because I think it goes deeper. I think that NPG are actually limiting their potential income by focussing on the protection of their income from legacy forms of commercial re-use. They could make more money off this content by growing the pie than by protecting their piece of a specific income stream. It goes to the heart of a misunderstanding about how to effectively exploit content on the web. There is money to be made through re-packaging content for new purposes. The content is obviously key but the real value offering is the Nature brand. Which is much better protected as a trademark than through licensing. Others could re-package and sell on the content but they can never put the Nature brand on it.

By making the material available for commercial re-use NPG would help to expand a high value market for re-packaged content which they would be poised to dominate. Sure, if you’re a business you could print off your OA Nature articles and put them on the coffee table, but if you want to present them to investors you want that Nature logo and Nature packaging that you can only get from one place.  And that NPG does damn well. NPG often makes the case that it adds value through selection, presentation, and aggregation. It is the editorial brand that is of value. Let’s see that demonstrated though monetization of the brand, rather than through unnecessarily restricting the re-use of the content, especially where authors are being charged $5000 to cover the editorial costs.

Reblog this post [with Zemanta]

New Year – New me

FireworksApologies for any wierdness in your feed readers. The following is the reason why as I try to get things working properly again.

The past two years on this blog I wrote made some New Year’s resolutions and last year I assessed my performance against the previous year’s aims. This year I will admit to simply being a bit depressed about how much I achieved in real terms and how effective I’ve been at getting ideas out and projects off the ground. This year I want to do more in terms of walking the walk, creating examples, or at least lashups of the things I think are important.

One thing that has been going around in my head for at least 12 months is the question of identity. How I control what I present, who I depend on, and in the world of a semantic web where I am represented by a URL what should actually be there when someone goes to that address. So the positive thing I did over the holiday break, rather than write a new set of resolutions was to start setting up my own presence on the web, to think about what I might want to put there and what it might look like.

This process is not as far along as I would like but its far enough along that this will be the last post at this address. OpenWetWare has been an amazing resource for me over the past several years and we will continue to use the wiki for laboratory information and I hope to work with the team in whatever way I can as the next generation of tools develops. OpenWetWare was also a safe place where I could learn about blogging without worrying about the mechanics, confident in the knowledge that Bill Flanagan was covering the backstops. Bill is the person who has kept things running through the various technical ups and down and I’d particularly like to thank him for all his help.

However I have now learnt enough to be dangerous and want to try some more things out on my own. More than can be conveniently managed on a website that someone else has to look after. I will write a bit more about the idea and choices I’ve made in setting up the site soon but for the moment I just want to point you to the new site and offer you some choices about subscribing to different feeds.

If you are on the feedburner feed for the blog you should be automatically transferred over to the feed on the new site. If you’re reading in a feed reader you can check this by just clicking through to the item on my site. If you end up at a url starting https://cameronneylon.net/ then you are in the right place. If not, just change your reader to point at http://feeds.feedburner.com/ScienceInTheOpen.

This feed will include posts on things like papers and presentations as well as blog posts so if you are already getting that content in another stream and prefer to just get the blog posts via RSS you should point your reader at http://feeds.feedburner.com/ScienceInTheOpen_blog.  I can’t test this until I actually post something so just hold tight if it doesn’t work and I will try to get it working as soon as I can. The comments feed for all seven of you subscribed to it should keep working. All the posts are mirrored on the new site and will continue to be available at OpenWetWare

Once again I’d like to thank all the people at OpenWetWare that got me going in the blogging game and hope to see you over at the new site as I figure out what it means to present yourself as a scientist on the web.

Reblog this post [with Zemanta]

Google Wave: Ripple or Tsunami?

Big Wave Surfing in Tahiti at Teahupoo
Image by thelastminute via Flickr

A talk given at the Edinburgh University IT Futures meeting late in 2009. The talk discusses the strengths and weaknesses of Wave as a tool for research and provides some pointers on how to think about using it in an academic setting. The talk was recorded in a Wave with members of the audience taking notes around images of the slides which I had previously uploaded.

You will only be able to see the wave if you have a Wave preview account and are logged in. If you don’t have an account the slides are included below (or will be as soon as I can get slideshare to talk to me).

[wave id=”googlewave.com!w+-c2g1ggkA”]

Reblog this post [with Zemanta]

What should social software for science look like?

Nat Torkington, picking up on my post over the weekend about the CRU emails takes a slant which has helped me figure out how to write this post which I was struggling with. He says:

[from my post...my concern is that in a kneejerk response to suddenly make things available no-one will think to put in place the social and technical infrastructure that we need to support positive engagement, and to protect active researchers, both professional and amateur from time-wasters.] Sounds like an open science call for social software, though I’m not convinced it’s that easy. Humans can’t distinguish revolutionaries from terrorists, it’s unclear why we think computers should be able to.

As I responded over at Radar, yes I am absolutely calling for social software for scientists, but I didn’t mean to say that we could expect it to help us find the visionaries amongst the simply wrong. But this raises a very helpful question. What is it that we would hope Social Software for Science would do? And is that realistic?

Over the past twelve months I seem to have got something of a reputation for being a grumpy old man about these things, because I am deeply sceptical of most of the offerings out there. Partly because most of these services don’t actually know what it is they are trying to do, or how it maps on to the success stories of the social web. So prompted by Nat I would like to propose a list of what effective Social Software for Science (SS4S) will do and what it can’t.

  1.  SS4S will promote engagement with online scientific objects and through this encourage and provide paths to those with enthusiasm but insufficient expertise to gain sufficient expertise to contribute effectively (see e.g. Galaxy Zoo). This includes but is certainly not limited to collaborations between professional scientists. These are merely a special case of the general.
  2. SS4S will measure and reward positive contributions, including constructive criticism and disagreement (Stack overflow vs YouTube comments). Ideally such measures will value quality of contribution rather than opinion, allowing disagreement to be both supported when required and resolved when appropriate.
  3. SS4S will provide single click through access to available online scientific objects and make it easy to bring references to those objects into the user’s personal space or stream (see e.g. Friendfeed “Like” button)
  4. SS4S should provide zero effort upload paths to make scientific objects available online while simultaneously assuring users that this upload and the objects are always under their control. This will mean in many cases that what is being pushed to the SS4S system is a reference not the object itself, but will sometimes be the object to provide ease of use. The distinction will ideally be invisible to the user in practice barring some initial setup (see e.g. use of Posterous as a marshalling yard).
  5. SS4S will make it easy for users to connect with other users and build networks based on a shared interest in specific research objects (Friendfeed again).
  6. SS4S will help the user exploit that network to collaboratively filter objects of interest to them and of importance to their work. These objects might be results, datasets, ideas, or people.
  7. SS4S will integrate with the user’s existing tools and workflow and enable them to gradually adopt more effective or efficient tools without requiring any severe breaks (see Mendeley/Citeulike/Zotero/Papers and DropBox)
  8. SS4S will work reliably and stably with high performance and low latency.
  9. SS4S will come to where the researcher is working both with respect to new software and also unusual locations and situations requiring mobile, location sensitive, and overlay technologies (Layar, Greasemonkey, voice/gesture recognition – the latter largely prompted by a conversation I had with Peter Murray-Rust some months ago).
  10. SS4S will be trusted and reliable with a strong community belief in its long term stability. No single organization holds or probably even can hold this trust so solutions will almost certainly need to be federated, open source, and supported by an active development community.

What SS4S won’t do is recognize geniuses when they are out in the wilderness amongst a population of the just plain wrong. It won’t solve the cost problems of scientific publication and it won’t turn researchers into agreeable, supportive, and collaborative human beings. Some things are beyond even the power of Web 2.0

I was originally intending to write this post from a largely negative perspective, ranting as I have in the past about how current services won’t work. I think now there is a much more positive approach. Lets go out there and look at what has been done, what is being done, and how well it is working in this space. I’ve set up a project on my new wiki (don’t look too closely, I haven’t finished the decorating) and if you are interested in helping out with a survey of what’s out there I would appreciate the help. You should be able to log in with an OpenID as long as you provide an email address. Check out this Friendfeed thread for some context.

My belief is that we are near to position where we could build a useful requirements document for such a beast with references to what has worked and what hasn’t. We may not have the resources to build it and maybe the NIH projects currently funded will head in that direction. But what is valuable is to pull the knowledge together to figure out the most effective path forward.

It wasn’t supposed to be this way…

I’ve avoided writing about the Climate Research Unit emails leak for a number of reasons. Firstly it is clearly a sensitive issue with personal ramifications for some and for many others just a very highly charged issue. Probably more importantly I simply haven’t had the time or energy to look into the documents myself. I haven’t, as it were, examined the raw data for myself, only other people’s interpretations. So I’ll try to stick to a very general issue here.

There are appear to be broadly two responses from the research community to this saga. One is to close ranks and to a certain extent say “nothing was done wrong here”. This is at some level, the tack taken by the Nature Editorial of 3 December, which was headed up with “Stolen e-mails have revealed no scientific conspiracy…”. The other response is that the scandal has exposed the shambolic way that we deal with collecting, archiving, and making available both data and analysis in science, as well as the endemic issues around the hoarding of data by those who have collected it.

At one level I belong strongly in the latter camp, but I also appreciate the dismay that must be felt by those who have looked at, and understand what the emails actually contain, and their complete inability to communicate this into the howling winds of what seems to a large extent a media beatup. I have long felt that the research community would one day be shocked by the public response when, for whatever reason, the media decided to make a story about the appalling data sharing practices of publicly funded academic researchers like myself. If I’d thought about it more deeply I should have realised that this would most likely be around climate data.

Today the Times reports on its front page that the UK Metererology Office is to review 160 years of climate data and has asked a range of contributing organisations to allow it to make data public. The details of this are hazy but if the UK Met Office is really going to make the data public this is a massive shift. I might be expected to be happy about this but I’m actually profoundly depressed. While it might in the longer term lead to more strongly worded and enforced policies it will also lead to data sharing being forever associated with “making the public happy”. My hope has always been that the sharing of the research record would come about because people started to see the benefits, because they could see the possibilities in partnership with the wider community, and that it made their research more effective. Not because the tabloids told us we should.

Collecting the best climate data and doing the best possible analysis on it is not an option. If we get this wrong and don’t act effectively then with some probability that is significantly above zero our world ends. The opportunity is there to make this the biggest, most important, and most effective research project ever undertaken. To actively involve the wider community in measurement. To get an army of open source coders to re-write, audit, and re-factor the analysis software. Even to involve the (positively engaged) sceptics, to use their interest and ability to look for holes and issues. Whether politicians will act on data is not the issue that the research community can or should address; what we need to be clear on is that we provide the best data, the best analysis, and an honest view of the uncertainties. Along with the ability of anyone to critically analyse the basis for those conclusions.

There is a clear and obvious problem with this path. One of the very few credible objections to open research that I have come across is that by making material available you open your inbox to a vast community of people who will just waste your time. The people who can’t be bothered to read the background literature or learn to use the tools; the ones who just want the right answer. This is nowhere more the case than it is with climate research and it forms the basis for the most reasonable explanation of why the CRU (and every other repository of climate data as far as I am aware) have not made more data or analysis software directly available.

There are no simple answers here, and my concern is that in a kneejerk response to suddenly make things available no-one will think to put in place the social and technical infrastructure that we need to support positive engagement, and to protect active researchers, both professional and amateur from time-wasters. Interestingly I think this infrastructure might look very similar to that which we need to build to effectively share the research we do, and effectively discover the relevant work of others. Infrastructure is never sexy, particularly in the middle of a crisis. But there is one thing in the practice of research that we forget at our peril. Any given researcher needs to earn the right to be taken seriously. No-one ever earns the right to shut people up. Picking out the objection that happens to be important is something we have to at least attempt to build into our systems.