Reflections on the Open Science workshop at PSB09

In a few hours I will be giving a short presentation to the whole of the PSB conference on the workshop that we ran on Monday. We are still thinking through the details of what has come out of this and hopefully the discussion will continue in any case so this is a personal view. The slides for the presentation are available at Slideshare.

To me there were a couple of key points that came out. Many of these are not surprising but bear repeating:

  • Citation and improving and expanding the way it is used lies at the core of making sure that people get credit for the work they do and making the widest range of useful contributions to the research community
  • Persistence of identity, and persistence of objects (in general the persistence of resources) is absolutely critical to making a wider citation culture work. We must know who generated something and be able to point to it in the long term to deliver on the potential of credits.
  • “If you build it they won’t come” – building a service, whether a technical or a social one, depends on a community that uses and adds value to those services. Build the service for the community and build the community for the service. Don’t solve the problems that you think people have – solve the ones that they tell you they have

The main point for me grew out of the panel session and was perhaps articulated best by Drew Endy. Identify specific problems (not ideological issues) and make process more efficient. Ideology may help to guide us but it can also blind you to specific issues and hide the underlying reasons for specific successes and failures from our view. We have a desperate need for both qualitative data, stories about successes and failures, and quantitative data, hard numbers on uptake and the consequences of uptake of specific practice.

Taking inspiration from Drew’s keynote, we have an evolved system for doing research that is not designed to be easily understood or modified. We need to take an experimental approach to identifying and solving specific problems that would let us increase the efficiency of the research process. Drew’s point was that this should be a proper research discipline in it’s own right, with the funding and respect that goes with it. For the presentation I summarised this as follows:

Improving the research process is an area for (experimental) research that requires the same rigour, standards (and funding) as anything else that we do

Brief running report on the Open Science Workshop at PSB09

Just a very brief rundown of what happened at the workshop this morning and some central themes that came out of it. The slides from the talks are available on Slideshare and recorded video from most of the talks (unfortunately not Dave de Roure‘s or Phil Bourne‘s at the moment) is available on my Mogulus channel (http://www.mogulus.com/cameron_neylon – click on Video on Demand and select the PSB folder). The commentary from the conference is available in the PSB 2009 Friendfeed room.

For me there were three main themes that came through from the talks and the panel session. The first was one that has come up in many contexts but most recently in Phil Bourne and Lyn Fink’s perspectives article in PLos Computational Biology; the need for persistent identity tokens to track people’s contributions, and a need to re-think how citation works, and what citations are used for.

The second theme was a need for more focus on specific issues, including domain specific problems or barriers, where “greasing the wheels” could make a direct difference to people’s ability to do their research. Solving specific problems that are not necessarily directly associated with “openness” as an ideological movement. Similar ideas were raised in discussion of tool and service development, the need to build the user into the process of service design and solve the problems users have, rather than those the developer may think they ought to be worrying about.

But probably the main theme that came through for me was the need to identify and measure real outcomes from adopting more open practice. This was the central theme of Heather‘s talk but also came up strongly in the panel session. We have little if any quantitative information on the benefits of open practice and  there are still relatively few examples of complete examples of open research projects. More research and more aggregation of examples will help here but there is a desperate need for numbers and details to help funders, policy makers, and researchers themselves to make informed choices about what approaches are worth adopting, and indeed which are not.

The session was a good conversation, with some great talks, and lots of people involved throughout. Even with a three hour slot we ran 30 minutes over and could have kept talking for quite a bit longer. We will keep posting material over the next few days so please continue the discussion over at Friendfeed and on the workshop website.

Final countdown to Open Science@PSB

As I noted in the last post we are rapidly counting down towards the final few days before the Open Science Workshop at the Pacific Symposium on Biocomputing. I am flying out from Sydney to Hawaii this afternoon and may or may not have network connectivity in the days leading up the meeting. So just some quick notes here on where you can find any final information if you are coming or if you want to follow online.

The workshop website is available at psb09openscience.wordpress.com and this is where information will be posted in the leadup to the workshop and links to presentations and any other information posted afterwards.

If you want to follow in closer to real time then there is a Friendfeed room available at friendfeed.com/rooms/psb-2009 which will have breaking information and live blogging during the workshop and throughout the conference. I will be aiming to broadcast video of the workshop at www.mogulus.com/cameron_neylon but this will depend on how well the wireless is working on the day. This will not be the highest priority. Updates on whether it is functioning or not will be in the friendfeed room and I will not be monitoring the chat room on the mogulus feed. If there are technical issues please leave a message in the friendfeed room and I will try to fix the problem or at least say if I can’t.

Otherwise I hope to see many of you at the workshop either in person or online!

New Year’s Resolutions 2009

Sydney Harbour Bridge NYE fireworksAll good traditions require someone to make an arbitrary decision to do something again. Last year I threw up a few New Year’s resolutions in the hours before NYE in the UK. Last night I was out on the shore of Sydney Harbour. I had the laptop – I thought about writing something – and then I thought – nah I can just lie here and look at the pretty lights. However I did want to follow up the successes and failures of last year’s resolutions and maybe make a few more for this year.

So last year’s resolutions were, roughly speaking, 1) to adopt the principles of the NIH Open Access mandate when choosing journals for publications, 2) to get more of the existing data within my group online and available, 3) to take the whole research group fully open notebook, 4) to mention Open Notebooks in every talk I gave, and 5) attempt to get explicit funding for developing open notebook approaches.

So successes – the research group at RAL is now (technically) working on an Open Notebook basis. This has taken a lot longer than we expected and the guys are still really getting a feel for what that means both in terms of how the record things and how they feel about it. I think it will improve over time and it just reinforces the message that none of this is easy.  I also made a point about talking about the Open Notebook approach is every talk I gave – mostly this was well received – often there was some scepticism but the message is getting out there.

However we didn’t do so well on picking journals – most of the papers I was on this year were driven by other people or were directed requests for special issues, or both. The papers that I had in mind I still haven’t got written, some drafts exist, but they’re definitely not finished. I also haven’t done any real work on getting older data online – it has been enough work just trying to manage the stuff we already have.

Funding is a mixed bag – the network proposal that was in last New Year’s was rejected. A few proposals have gone in – more haven’t gone in but exist in draft form – and a group of us went close to getting a tender to do some research into the uptake of Web 2. tools in science (more on that later but Gavin Baker has written about it and our tender document itself is available). The success of the year was the funding that Jean-Claude Bradley obtained from Submeta (as well as support from Aldrich Chemicals and Nature Publishing Group) to support the Open Notebook Science Challenge. I can’t take any credit for this but I think it is a good sign that we may have more luck this coming year.

So for this year – there are some follow ons – and some new ones:

  1. I will re-write the network application (and will be asking for help) and re-submit it to a UK funder
  2. I will clean up the “Personal View of Open Science” series of blog posts and see if I can get it published as a perspectives article in a high ranking journal
  3. I will get some of those damn papers finished – and decide which ones are never going to be written and give up on them. Papers I have full control over will go by first preference to Gold OA journals.
  4. I will pull together the pieces needed to take action on the ideas that came out of the Southampton Open Science workshop, specifically the idea of a letter signed by a wide range of scientists and interested people to a high ranking journal stating the importance of working towards published papers being fully supported by data and methodological detail that is fully available
  5. I will focus on doing less things and doing them better – or at least making sure the resources are available to do more of the things I take on…

I think five is enough things to be going on with. Hope you all have a happy new year, whenever it may start, and that it takes you further in the direction you want to go (whether you know what that is now or not) than you thought was possible.

p.s. I noticed in the comments to last year’s post a comment from one Shirley Wu suggesting the idea of running a session at the 2009 Pacific Symposium on Biocomputing – a proposal that resulted in the session we are holding in a few days (again more later on – we hope – streaming video, micro blogging etc). Just thinking about how much has changed in the way such an idea would be raised and explored in the last twelve months is food for thought.

The failure of online communication tools

Coming from me that may sound a strange title, but while I am very positive about the potential for online tools to improve the way we communicate science, I sometimes despair about the irritating little barriers that constantly prevent us from starting to achieve what we might. Today I had a good example of that.

Currently I am in Sydney, a city where many old, and some not so old friends live. I am a bit rushed for time so decided the best way to catch up was to propose a date, send out a broadcast message to all the relevant people, and then sort out the minor details of where and exactly when to meet up. Easy right? After all tools like Friendfeed and Facebook provide good broadcast functionality. Except of course, as many of these are old friends, they are not on Friendfeed. But that’s ok because I’ve many of them are on Facebook. Except some of them are not old friends, or are not people I have yet found on Facebook, but that’s ok, they’re on Friendfeed, so I just need to send two messages. Oh, except there are some people who aren’t on Facebook, so I need to email them – but they don’t all know each other so I shouldn’t send their email addresses in the clear. That’s ok, that’s what bcc is for. Oh, but this email address is about five years old…is it still correct?

So – I end up sending three independent messages, one via Friendfeed, three via Facebook (one status message, one direct message, and another direct message to the person I found but hadn’t yet friended), and one via email (some unfortunate people got all three – and it turns out they have to do their laundry anyway). It almost came down to trying some old mobile numbers to send out text. Twitter (which I don’t use very much) wouldn’t have helped either. But that’s not so bad – only took me ten minutes to cut and paste and get them all sent. They seem to be getting through to people as well which is good.

Except now I am getting back responses via email, via Facebook, and at some point via Friendfeed as well no doubt. All of which are inaccessible to me when I am out and about anyway because I’m not prepared to pay the swinging rates for roaming data.

What should happen is that I have a collection of people, I choose the send them a message, whether private or broadcast, and they choose how to receive that message and how to prioritise it. They then reply to me, and I see all their responses nicely aggregated because they are all related to my one query. As this query was time dependent I would have prioritised responses so perhaps I would receive them by text or direct to my mobile in some other form. The point is that each person controls the way they receive information from different streams and is in control of the way they deal with it.

It’s not just filter failure which is creating the impression of the information overload. The tools we are using, their incompatibility, and the cost of transferring items from one stream to another are also contributing to the problem. The web is designed to be sticky because the web is designed to sell advertising. Every me-too site wants to hold its users and communities, my community, my specific community that I want to meet up with for a drink, is split across multiple services. I don’t have a solution to the business model problem – I just want services with proper APIs that let other people build services that get all of my streams into one place. I hope someone comes up with a business model – but I also have to accept that maybe I just need to pay for it.

The people you meet on the train…

Yesterday on the train I had a most remarkable experience of synchronicity. I had been at the RIN workshop on the costs of scholarly publishing (more on that later) in London and was heading of to Oxford for a group dinner. On the train I was looking for a seat with a desk and took one up opposite a guy with a slightly battered looking mac laptop. As I pulled out my new Macbook (13” 2.4 GHz, 4 Gb memory since you ask) he leaned across to have a good look, as you do, and we struck up a conversation. He asked what I did and I talked a little about being a scientist and my role at work. He was a consultant who worked on systems integration.
At some stage he made a throwaway comment about the fact that he had been going back to learn or re-learn some fairly advanced statistics and that he had had a lot of trouble getting access to some academic papers, certainly he didn’t want to pay for them, but had managed to find free versions of what he wanted online. I managed to keep my mouth somewhat shut at this point, except to say I had been at a workshop looking at these issues. However it gets better, much better. He was looking into quantitative risk issues and this lead into a discussion about the problems of how science and particularly medicine reporting in the media doesn’t provide links back to the original research (which is generally not accessible anyway) and that, what is worse, the original data is usually not available (and this was all unprompted by me, honestly!). To paraphrase his comment “the trouble with science is that I can’t get at the numbers behind the headlines; what is the sample size, how was the trial run…” Well at this point, all thought of getting any work done went out the window and we had a great discussion about data availability, the challenges of recording it in the right form (his systems integration work includes efforts to deal with mining of large, badly organised data sets), drifted into identity management and trust networks and was a great deal of fun.
What do I take from this? That there is a a demand for this kind of information and data from an educated and knowledgable public. One of the questions he asked was whether as a scientist I ever see much in the way of demand from the public. My response was that, aside from pushing the taxpayer access to taxpayer funded research myself, I hadn’t seen much evidence of real demand. His argument was that there is a huge nascent demand there from people who haven’t thought about their need to get into the detail of news stories that effect them. People want the detail, they just have no idea of how to go about getting it. Spread the idea that access to that detail is a right and we will see the demand for access to the outputs of research grow rapidly. The idea that “no-one out there is interested or competent to understand the details” is simply not true. The more respect we have for the people who fund our research the better frankly.

Recording the fiddly bits of experimental and data analysis work

We are in the slow process of gearing up within my group at RAL to adopting the Chemtools LaBLog system and in the process moving properly to an Open Notebook status. This has taken much longer than I had hoped but there have been some interesting lessons along the way. Here I want to think a bit about a problem that has been troubling me for a while.

I haven’t done a very good job of recording what I’ve been doing in the times that I have been in a lab over the past couple of months. Anyone who has been following along will have seen small bursts of apparently unrelated activity where nothing much ever seems to come to a conclusion. This has been divided up mainly into a) a SANS experiment we did in early November which has now moved into a data analysis phase, b) some preliminary, and thus far fairly unconvincing experiments, attempting to use a very new laser tweezers setup at the Central Laser Facility to measure protein-DNA interactions at the single molecule level and c) other random odds and sods that have come by. None of these have been very well recorded for a variety of reasons.

Data analysis, particularly when it uses a variety of specialist software tools, is something I find very challenging to record. A common approach is to take some relatively raw data, run it through some software, and repeat, while fiddling with parameters to get a feel for what is going on. Eventually the analysis is run “for real” and the finalised (at least for the moment) structure/number/graph is generated. The temptation is obviously just to formally record the last step but while this might be ok as a minimum standard if only one person is involved, when more people are working through data sets it makes sense to try and keep track of exactly what has been done and which data has been partially processes in which ways. This helps us both in terms of being able to quickly track where we are with the process but also reduces the risk of replicating effort.

The laser tweezers experiment involves a lot of optimising of buffer conditions, bead loading levels, instrumental parameters and whatnot. Essentially a lot of fiddling, rapid shifts from one thing to another and not always being too sure exactly what is going on. We are still at the stage of getting a feel for things rather than stepping through a well ordered experiment. Again the recording tends to be haphazard as you try on thing  and then another. We’re not even absolutely sure what we should be recording for each “run” or indeed really what a “run” is yet.

The common theme here is “fiddling” and the difficulty of recording it efficiently, accurately, and usefully. What I would prefer to be doing is somehow capturing the important aspects of what we’re doing as we do it. What is less clear is what the best way to do that is. In the case of data analysis we have good model for how to do this well. Good use of repositories and the use of versioned scripts for handling data conversions, in the way the Michael Barton in particular has talked about provide an example of good practice. Unfortunately it is good practice that is almost totally alien to experimental biochemists and is also not easily compatible with a lot of the software we use.

The ideal would be a work bench using a graphical representation of data analysis tools and data repositories that would automatically generate scripts and deposit these and versioned data files into an appropriate repository. This would enable the “docking” of arbitrary web services, software packages and whatever, as well as connection to shared data stores. The purpose of the workbench would be to record what is done, although it might also provide some automation tools. In many ways this is what I think of when look at work flow engines like Taverna and platforms for sharing workflows like MyExperiment.

Its harder in the real world. Here the workbench is, well, the workbench but the idea of recording everything along with contextual metadata is pretty similar. The challenge lies in recording enough different aspects of what is going on to capture the important stuff without generating a huge quantity or data that can never be searched effectively. It is possible to record multiple video streams, audio, screencast any control computers , but it will be almost impossible to find anything in these data streams.

A challenge that emerges over and over again in laboratory recording is that you always seem to not be recording the thing that you really now need to have. Yet if you record everything you still won’t have it because you won’t be able to find it. Video, image, and audio search will one day make a huge difference to this but in the meantime I think we’re just going to keep muddling on.

Quick update from International Digital Curation Conference

Just a quick note from the IDCC given I was introduced as “one of those people who are probably blogging the conference”. I spoke this morning giving a talk on Radical Sharing – Transforming Science? A version of the slides is available at slideshare. It seemed to go reasonably well and I got some positive comments. The highlight for me today was John Wilbanks speaking this evening – John always gives a great talk (slides will also be on his slideshare at some point) and I invariably learn something. Today that was the importance of distinguishing between citation (which is a term from the scholarly community) and attribution (which is a term with specific legal meaning in copyright law). Having used the two interchangeably in my talk (no recording unfortunately) John made the point that it is important to distinguish the two practices, particularly the reasons that motivate themand the different enforcement frameworks.

Interesting talks this afternoon on costing for digital curation – not something I have spent a lot of time thinking about but clearly something that is rather important. Also this morning talks on CARMEN and iPLANT, projects that are delivering on infrastructure for sharing and re-using data. Tonight we are off to Edinburgh castle for the dinner which should be fun and tomorrow I make an early getaway to get to more meetings.

Links as the source code of our thinking – Tim O’Reilly

I just wanted to point to a post that Tim O’Reilly wrote just before the US election a few weeks back. There was an interesting discussion about the rights and wrongs of him posting on his political views and the rights and wrongs of that being linked to from the O’Reilly Media front page. In amongst the abuse that you have come to expect in public political discussions there is some thought provoking stuff. But what I wanted to point out and hopefully revive a discussion of is a point he makes right near the bottom.

[I have conflated two levels of comments here (Tim is quoting his own comment) – see the original post for the context]

“Thanks to everyone for wading in, especially those of you who are marshalling reasoned arguments and sharing actual sources and references, showing you’ve done your homework, and helping other people to see the data that helped to shape your point of view. We need a LOT more of that in this discussion, rather than slinging unsupported allegations back and forth.

Bringing this back to tech – showing the data behind your argument is a lot like open source. It’s a way of verifying the “code” that’s inside your head. If you can’t show us your code, it’s a lot harder to trust your results!”

Links as source code for your thinking: that’s a meme that should survive the particulars of this particular debate!

In a sense Tim is advocating the wholesale adoption of the very strong attribution culture we (like to think we) have in academic research. The importance of acknowedging your sources is clear but it also has much more value than that. By tracing back the influences that have brought someone to a specific conclusion or belief it is possible for other people to gain a much deeper insight into how those ideas evolved. Being able to parse the dependencies between ideas, data, samples, papers, and knowledge in an automatic, machine readable, way is the promise of the semantic web, but in the meantime just helping the poor old humans to trace back and understand where someone is coming from is very helpful.

More on “theft” and the problem of identity

Following my hopefully getting towards three-quarters baked post there has been more helpful comments and discussion both here and on friendfeed. I wanted to pick out a specific issue that has come up in both places. At Friendfeed the discussion ran into the question of plagiarism more generally and why it is bad. Anders Norgaard made the point that plagiarism is bad regardless of whether it breaks rules or not and a discussion on why that is followed.  I think the conclusion we came to is that plagiarism reduces value by making it more difficult to find the right person with the right expertise when you need something done. It reduces the value of the body of work in helping you find the person who can do the job that you need doing.

David Crotty, in a comment on the blog post makes a comment that I think probes the same issues:

Do you mind if I start a blog called “Science in the open” and pretend that my name is “Cameron Neylon” and then fill that blog with dreadful, hateful nonsense? After all, your name and your blog’s name aren’t limited physical resources, right?    Does ownership extend to your online identity?  Isn’t using someone else’s logo a misrepresentation of identity?

Now this is important for two reasons, firstly because it probes the extreme end of my argument that “objects that can be infinitely copied should not be treated as property” and also because it revolves around the issue of identity. Reliable identity lies at the core of building the trust networks that make social web tools work. Does that mean it is one area where the full weight of property based law should be brought to bear? So I think this is worth unpicking in detail.

So, let’s start with the honest answer. If this happened I would be angry and upset. I would be likely to storm around the office/house a bit and possibly rant at people and objects that were unfortunate enough to cross my path. But after, hopefully, calming down a bit I hope I would follow something like the following course.

  1. Write the person a polite note explaining that they seem to have both the same name and same name of blog and that this probably is bad for both of us as there is the potential for confusion. Ambiguity is bad because it reduces trust in attribution. As I used these names first I would ask them to consider changing. I would assume it was a simple coincidence, a mistake made in good faith.
  2. If they did not I would dissociate myself publically from their work making a clear statement about where my work could be found. I would consider changing the name of my blog (after all it is the feed that people follow – does anyone care that much what it is called?), but not my name.
  3. If it was clear that this was a case of deliberate misrepresentation I would present the evidence that this was the case and request the help of the community to make that very publically clear.

My case is that allowing the free re-use of my name and my blog name ought to add value on average. Indeed my experience thus far is that, allowing people to use these names, to point to me and the work I have generated has indeed been net positive. I’ve never objected to people quoting me, using my name, reproducing blog posts, or whatever. Whether it’s “fair use” or a copyright violation, or appropriately licensed re-use is irrelevant. It’s all good because it brings more interested people to my blog and to me. One negative experience would probably not actually tip that balance.Several nasty ones might.

The key here is that the real resource is me. I am not infinitely replicable, no-one else can write my posts. The name is just a pointer. An important pointer and one which I will defend, in as much as I will try to make clear what I think and why I think it, as well as to be clear about who I am and why I say what I do. Someone who plagiarizes my work or reproduces it without attribution or someone who deliberately misrepresents what I write reduces the value of my work because they reduce the ability of people who are looking for someone with my expertise to use that work to find me.

But it is not the reproduction of work that is the problem here, it is the misrepresentation of its origin, either by an author falsely claiming it as theirs, or by some mis-attributing someone else’s work or views to me. The problem is not the act of copying but the act of lying. The problem with lying is that it reduces trust, the problem with reducing trust is that it reduces the value of the networks we used to find things that are useful and the people who have the expertise to make them. Identity is crucial to trust and trust is what adds value to networks. Very few things reduce the value of web based networks more effectively than lying about identity.

We will never build a perfect system that solves this problem. My belief though is that it will be more effective to build strong social and technical systems rather than to apply the rules of “ownership” to my name. Do I own my name? No idea. Will I defend my name and ask others to help me do that if someone attacks it? Yes. Will I use the best technical systems to try to be clear about who I am in all the places where I act? Well I could do better on this, but then a lot of us could really. I will work to build trust in my name, in my brand if you like, and if that trust is attacked I will defend it.

So where does this leave the story of Ricardo’s logo? Well the first point was the plagiarism of the image. This breaks the link between the image and the author which reduces its use to Ricardo. The lack of attribution means that people who think “what a cool logo” will not be able to find Ricardo to do them a cool logo of their own. But it is not the copying per se which does the damage but the plagiarism, the lack of attribution. Arguably, as the community leaps to Ricardo’s defence (and points out what a cool logo it is) he actually benefits from a raised profile across a wider community. I had seen a few examples of his work before but hadn’t realised how many he had done and how good they are. Ricardo pointed out in the original Friendfeed thread that the reason the image was copyright was that he was making a living at the time from design. It is not inconceivable he may be better placed to do that now than he was before the logo was misappropriated. That is for Ricardo to decide though, not me.

Does the use of the logo by a company selling hokum misrepresent Ricardo? Well given they didn’t attribute it to him not directly. But let’s imagine that the image was CC-BY and that the company did attribute it. Arguably Ricardo would not want to be associated with that and that would be fair enough but there wouldn’t be anything he could do about it from a legal perspective. Because the image is actually copyright all rights reserved he can prevent these kinds of re-use. Or can at least in principle. He retains control in a way that CC-BY licenses do not allow. My argument is that to legally defend this position would take much more money and energy than clearly and publically distancing yourself from the re-use of the work. And probably wouldn’t be much more effective. Furthermore my argument is that the good that comes from allowing re-use outweighs the bad. The re-use of your work actually gives you a platform to distance yourself from that re-use if you so choose. Once that is made clear it is just more good publicity for you.

More importantly if you believe, as I do, in the value of allowing re-use then you cannot reasonably pick and choose who and what re-uses are appropriate. Consistency requires that you allow re-use that you do and do not disagree with. I may not approve of that re-use, and it is perfectly reasonable to say so, but that gives me no right to object. To mis-quote Hall channelling Voltaire “I disagree with the way you have re-used my work, but I will defend your right to do so and the value you add by doing it ” – and no I will not defend it to the death. I don’t take it that seriously…