PSB Session – Final call for support

If you’ve been following the discussion here and over at the One Big Lab Blog you will know that tomorrow is the deadline for submission of proposals for sessions at PSB. Shirley Wu has done a great job of putting together the proposal text and the more support we can get from members of the community the better. Whether you can send us an email saying you would like to be there, will commit to being there, or (most preferred) can commit to submitting a presentation it will all help. Don’t get too hung up on the idea that talks have to be ‘research’. This can include the description of tool development (including services), quantitative studies of the impact of open practices, or just a description or review of a particular experience or process.

The proposal text is visible at: http://docs.google.com/Doc?id=dv4t5rx_33fpxx9pw5 and if you want editing access just holler or leave some comments here or at One Big Lab.

Connecting Open Notebook data to a peer reviewed paper

One thing that we have been thinking about a bit recently is how best to link elements of a peer reviewed paper back to an Open Notebook. There are a number of issues that this raises, both technical and philosophical about how and why we might do this. Our first motivation is to provide access to the raw data if people want it. The dream here is that by clicking on a graph you are taken straight through to processed data which you can then backtrack through to get to the raw data itself. This is clearly some way off.

Other simple solutions are to provide a hyperlink back to the open notebook, or to an index page that describes the experiment, how it was done, and how the data was processed. Publishers are always going to have an issue with this because they can’t rely on the stability of external material. So the other solution is to package up a version of the notebook and provide it as supplementary information. This could still provide links back to the ‘real’ notebook but provides additional stability and also protects the data against disaster by duplicating it.

The problem with this is that many journals will only accept pdf. While we can process a notebook to provide a package which is wrapped up as pdf this has a lot of limitations particularly when it comes to data scraping, which after all we want to encourage. An encouraging development was recently described on the BioMedCentral Blog where they describe the capability of uploading a ‘mini website’ as supplementary information. This is great as we can build a static version of our notebook, with lots of lovely rich metadata built in. We can point out to the original notebook and we can point in from the original notebook back to the paper. I am supposed to be working on a paper at the moment which I was considering where to send. I hope we can give BMC Biotechnology or perhaps BMC Research Notes a go to test this out.

Reflections from a parallel universe

On Wednesday and Thursday this week I was lucky to be able to attend a conference on Electronic Laboratory Notebooks run by an organization called SMI. Lucky because the registration fee was £1500 and I got a free ticket. Clearly this was not a conference aimed at academics. This was a discussion of the capabilities and implications for Electronic Laboratory Notebooks used in industry, and primarily in big pharma.

For me it was very interesting to see these commercial packages. I am often asked how what we do compares to these packages and I have always had to answer that I simply don’t know, I’ve never had the chance to look at one because they are way to expensive. Having now seen them I can say that they have very impressive user interfaces with lots of integrated tools and widgets. They are fundamentally built around specific disciplines and this allows them to be reasonably structured in their presentation and organisation. I think we would break them in our academic research setting but it might take a while. More importantly we wouldn’t be able to afford the customisation that it looks as though you need to get a product that does just what you want it to. Deployment costs of around around £10,000 per person were being bandied around with total contracts costs clearly in the millions of dollars.

Coming out of various recent discussions I would say that I think the overall software design of these products is flawed going forward. The vendors are being paid a lot by companies who want things integrated into their systems so there is no motivation for them to develop open platforms with data portability and easy integration of web services etc. All of these systems run on thick clients against a central database. Going forward these have to go into web portals as a first step before working towards a full  customisable interface with easily collectable widgets to enable end-user configured integration.

But these were far from the most interesting things at the meeting. We commonly assume that keeping, preserving, and indexing data is a clear good. And indeed many of the attendees were assuming the same thing. Then we got a talk on ‘Compliance and ELNs’ by Simon Coles of Amphora Research Systems. The talk can be found here. In this was an example of just how bizarre the legal process for patent protection can make industrial process. In the process for preparing for a patent suit you will need to pay your lawyers to go through all the relevant data and paperwork. Indeed if you lose you will probably pay for the oppositions lawyers to go through all the relevant paperwork. These are not just lawyers, they are expensive lawyers. If you have a whole pile of raw data floating around this is not just going to give the lawyers a field day finding something to pin you to the wall on, it is going to burn through money like nobody’s business. The simple conclusion: It is far cheaper to re-do the experiment than it is to risk the need for lawyers to go through raw data. Throw the raw data away as soon as you can afford to! Like I said, a parallel universe where you think things are normal until they suddenly go sideways on you.

On a more positive sense there were some interesting talks on big companies deploying ELNs. Now we can look at this at some level as a model of a community adopting open notebooks. At least within the company (in most cases) everyone can see everyone else’s notebook. A number of speakers mentioned that this had caused problems and a couple said that it had been necessary to develop and promulgate standards of behaviour. This is interesting in the light of the recent controversy over the naming of a new dinosaur (see commentary at Blog around the Clock) and Shirley Wu’s post on One Big Lab. It reinforces the need for generally accepted standards of behaviour and the growing importance of these as data becomes more open.

The rules? The first two came from the talk, the rest are my suggestion. Basically they boil down to ‘Be Polite’.

  1. Always ask before using someone else’s data or results
  2. User beware: if you rely on someone else’s results its your problem if it blows up in your face (especially if you didn’t ask them about it)
  3. If someone asks if they can use your data or results you say yes. If you don’t want them to, give them a clear timeline on which they can or specific reasons why you can’t release the data. Give clear warnings about any caveats or concerns
  4. If someone asks you not to use their results (whether or not they are helpful or reasonable about it) think very carefully about whether you should ignore their request. If having done this you still feel you are being reasonable in using them, then think again.
  5. Any data that has not been submitted for peer review after 18 months is fair game
  6. If you incorporate someone else’s data within a paper discuss your results with them. Then include them as an author.
  7. Always, without fail and under any cicrumstances, acknowledge any source of information and do so generously and without conditions.

OPEN Network proposal – referees comments are in

So we have received the referees comments on the network proposal and after a bit of a delay I have received permission to make them public. You can find a pdf of the referee’s comments here. I have started to draft a reply which is published on google docs. I have given a number of people access but if you are feeling left out and would like to contribute just drop me a line.

These are broadly pretty critical comments and our chance of getting this funded is not looking at all good on the basis of these. For those who have not written grant proposals before or not dealt with these types of criticisms there is an object less on here. Many of the criticisms relate to assumptions the referees have made about how a UK Network Proposal should be written and what it should do. It is always a good idea to identify precisely what the expectations are. In this case I simply didn’t have time to do this.

However there are some good aspects of this. Many of the critical comments made by the referees are contradicted by other referees (too many meetings, not enough meetings). A couple arise from misunderstandings or perhaps a lack of clarity in the proposal. The key thing is to answer the criticisms on how we expand the network – while at the same time explaining that until we have achieved this the network doesn’t really exist in its ideal form. Also we are asking for relatively little money so once all the big networks get their slice their may be realtively few proposal left small enough to pick up the scraps as it were.

The reply is due back on Monday (UK time) and I will gratefully receive any assistance in getting this response honed to a fine point. I would also point you in the direction of Shirley Wu’s draft of the proposal for a PSB session which is due at the end of next week. We know this collaborative process can work and we also know it has weaknesses and disconnects. If we can use the good part to convince funding agencies that we need to sort out the weaknesses I think that would be a great step forward.

An abstract for the International Meeting on E-social Sciences

I have said before that I think we could benefit from the involvement of social scientists in understanding the possible cultural issues involved in the move towards more open practises. To this end we are submitting an abstract for the 4th International Meeting on e-social Science to present a ‘short paper’. I’ve put the abstract below: the deadline is next Monday (4th February). If you have any comments and/or would like to be included as an author on the paper. I am a bit pressed for time this week and google services seem slow this morning so I will probably stick to using comments from here rather than using Google Docs. Any/all comments welcome.

The Effect of Network Size and Connectivity on Open Notebook Approaches to Scientific Research: The view from the inside

Cameron Neylon with contributions from the Open Notebook Science Collective

A small but growing group of researchers in the physical and biological scientists are interested in developing and applying open approaches to their research practise. The logical extreme of this approach is ‘Open Notebook Science’ a term coined by Jean-Claude Bradley to refer to the practise of making the raw data from an experimental laboratory available as soon as practicable after it is generated. The promise of such open approaches is that loose coalitions of scientists can aggregate around specific problems according to interest, expertise, and resource availability and that such an approach can allow significantly more rapid solutions to problems to be developed. Specific recent examples of such approaches include the aggregation of a group of significant size to rapidly (five days) prepare a full scale grant application, attempts, successful and unsuccessful to identify collaborators to provide specific experimental capabilities to allow the completion of experimental results, and requests for experts to examine specific chemical datasets to identify potential errors. We will describe the experience of these different examples from the inside as well as the tools and resources used; their usefulness and limitation. The key observation is that successful application of these approaches requires a critical mass of interested scientists with sufficient times to provide a large enough pool of resources to solve the problem and that the network be sufficiently well connected for requests to be routed to the those best suited to help. In most cases the record of these efforts are fully publically available and may provide useful data for social science research in this area.

Sharing is caring…and not sharing can be reprehensible

Sometimes you read things that just make you angry. I’m not sure I can add much to this eloquent article written by Andrew Vickers in the New York Times (via Neil Saunders and the 23andme blog).

Shirley Wu has recently written on the fears and issues of being scooped and whether this is field dependent or not. Her discussion, and the NYT article seems to suggest that these fears are greatest in precisely those disciplines where sharing could lead to advances with direct implications for people, their survival, and their quality of life.

I, to be honest, have been getting more and more depressed about the fact that this keeps coming back as the focus of any discussion about Open Notebooks or Open Science. Why is the assumption that by sharing we are going to be cheated? Surely we should be debating about the balance between benefits and risks. And about how this compares to the balance of  benefits and risks in not being open. Particularly when those risks relate to people’s chance of survival.

Picture this…

There has been a bit of discussion recently about identifying and promoting ‘wins’ for Open Science and Open Notebook Science. I was particularly struck by a comment made by Hemai Parthasarathy at the ScienceBlogging Meeting that she wasn’t aware of any really good examples that illustrate the power of open approches. I think sometimes we miss the most powerful examples right under our nose because they are such a familiar part of the landscape that we have forgotten they are there. So let us imagine two alternate histories; I have to admit I am very ignorant of the actual history of these resources but I am not sure that matters in making my point.

History the first…

In the second half of the twentieth century scientists developed methods for sequencing proteins and DNA. Not long after this the decades of hard work on developing methods for macromolecular structure determination started to bear fruit and the science of protein crystallography was born. There was a great feeling that in understanding the molecular detail of biological systems that disease was a beatable problem, that it was simply a matter of understanding the systems, to know how to treat any disease. Scientists, their funders, pharmaceutical companies, and publishers could see this was an important area for development, both in terms of the science and also with significant commercial potential.

There was huge excitement and a wide range of proprietary databases containing this information proliferated. Later there came suggestions that the NIH and EMBL should fund public databases with mandated deposition of data but a broad coalition of scientists, pharmaceutical companies, and publishers objected saying that this would hamper their ability to exploit their research effort and would reduce their ability to turn research into new drugs. Besides, the publishers said, all he important information is in the papers…By the mid-noughties a small group of scientists calling themselves ‘bioinformaticians’ started to appear and began to look at evolution of genetic sequences using those pieces of information they could legally scrape from the, now electronically available, published literature. One scientist was threatened with legal action for taking seven short DNA sequences from a published paper…

Imagine a world with no GenBank, no PDB, no SwissProt, and no culture growing out of these of publically funded freely available databases of biological information like Brenda, KEGG, etc etc. Would we still be living in the 90s, the 80s, or even the 70s compared to where we have got to?

History the second…

In the second half of the twentieth century synthetic organic chemistry went through an enormous technical revolution. The availability of modern NMR and Mass spectrometry radically changed the whole approach to synthesis. Previously the challenging problem had been figuring out what it was you had made. Careful degradation, analysis, and induction was required to understand what a synthetic procedure had generated. NMR and MS made this part of the process much easier shifting the problem to developing new synthetic methdology. Organic chemistry experienced a flowering as creative scientists flocked to develop new approaches that might bear their names if they were lucky.

There was tremendous excitement as people realised that virtually any molecule could be made, if only the methodology could be figured out. Diseases could be expected to fall as the synthetic methodology was developed to match the advances in the biological understanding. The new biological databases were providing huge quantities of information that could aid in the targeting of synthetic approaches. However it was clear that quality control was critical and sharing of quality control data was going to make a huge difference to the rate of advance. So many new compounds were being generated that it was impossible for anyone to check on the quality and accuracy of characterisation data. So, in the early 80s, taking inspiration from the biological community a coalition of scientists, publishers, government funders, and pharmaceutical companies developed public databases of chemical characterisation data with mandatory deposition policies for any published work. Agreed data formats were a problem but relatively simple solutions were found fast enough to solve these problems.

The availability of this data kick started the development of a ‘chemoinformatics’ community in the mid 80s leading to the development of sophisticated prediction tools that aided the synthetic chemists in identifying and optimising new methodology. By 1990, large natural products were falling to the synthetic chemists with such regularity that new academics moved into developing radically different methodologies targeted at entirely new classes of molecules. New databases containing information on the activity of compounds as substrates, inhibitors, and activators (with mandatory deposition policies for published data) provided the underlying datasets for validation that meant by the mid 90s structure based drug discovery was a solved problem. By the late 90s the chemoinformatic tools available made the development of tools for identifying test sets of small molecules to selectively target any biological process relatively straightforward.

Ok. Possibly a little utopian, but my point is this. Imagine how far behind we would be without Genbank, PDB, and without the culture of publically available databases that this embedded in the biological sciences. And now imagine how much further ahead chemical biology, organic synthesis, and drug discovery might have been with NMRBank, the Inhibitor Data Bank…

More on the PSB proposal

Shirley Wu has followed up on her original proposal to submit a session proposal for PSB. She asks a series of important questions about going forward on this and I thought I would reply to these here to widen exposure.

I think it is worth going for a session and I am happy to lead the application but there may well be better people; Jean-Claude, Antony Williams, Peter Murray-Rust, Egon Willighagen to get to lead it depending on focus. I think the important question to ask is whether we can generate enough research papers to justify a session. I believe we can and should and I will commit to generating one if we go ahead, but we need at least another 3-4 to go ahead I think.

So, to answer Shirley’s questions:

1. What should be the focus of this session on Open Science? (first, frame it as a traditional PSB session, then perhaps as a “creative” session)
2. What kind of substantial/technical/research papers can be written about Open Science?
3. Who are the major players in the field? Who would the session chair invite to submit a paper?
4. Who is willing to help write/organize the actual proposal and session?

Given it is a computing symposium I would say that it should focus on tools and standards and how they effect what we can, or would like to do. This also gives us a chance to provide research type papers describing such tools and standards and investigating their implementation. So we could write papers describing different implementations of Open Notebooks and critical analysis of the differences, the organisation of Open Data, standards for describing data, and social and cultural aspects of what is happening etc etc.

People to invite to write papers include Jean-Claude Bradley, OpenWetWare group, Egon, Peter MR, Deepak Singh (willing to write a review/scoping type paper?), Antony Williams (ChemSpider), Simile Group (www.simile.mit.edu), other repository, data archival groups, Nature Publishing/PLoS/PMC/UK-PMC to describe systems, Heather Piwowar to analyse what happens, and social sciences groups that are becoming interested in what is going on.

Finally, as I say, I am willing to help, but as you can see time becomes a constraint for me and things have a habit if getting left to the last minute. If anyone else would like to step in to lead then I am more than happy to be a co-chair.  If no one else is available I am happy to lead. I at least have the advantage that I can probably source the resources so that I can get there!

I am going to tag this “Open Science PSB09” if that seems a good tag to aggregate around.

Open Science Session at PSB 2009?

Shirley Wu from Stanford left a comment on my New Years Resolutions post suggesting the possibility of a session on Open Science at the PSB meeting in Hawaii in 2009 which I wanted to bring to front for peoples attention.

[…] Since you mentioned organizing an international meeting on the subject and publicizing open science, I’m curious what your thoughts (and anyone else’s who reads this!) would be on participating in a session on Open Science at the Pacific Symposium on Biocomputing at PSB. They don’t traditionally cover non-primary research/methods tracks, but they do pride themselves on being at the cutting edge of biology and biocomputing, so I am hoping they will be amenable to the idea. If there was support from, shall we say, the founders of this movement, I think it would help a great deal towards making it happen. […]

She also has a post on her new blog One Big Lab where she fleshes out the idea in a bit more detail and which is probably the best place to continue the discussion.

Hi Shirley! Great to have more people out there blogging and commenting. I am not sure whether I really qualify as a ‘founder of the movement’. I know things are moving fast, but I don’t think having been around for nine months or so makes me that venerable!

This sounds broadly like a good idea to me. I was considering trying to organise a meeting in the UK towards October – November this year but the timelines are tight and really dependent on money coming through. I would be happy to push back to Jan 2009 in Hawaii if people felt this was a good idea; if the grant comes through we could use this as the first annual meeting. My only concern is that Hawaii probably increases average costs for people as more people have to come further and book accomodation than if it is either Western Europe or East Coast US. The other issues is how and whether to focus such a session. I also don’t see a problem with having two meetings ~6 months apart. What do people think?