Policy and technology for e-science – A forum on on open science policy

I’m in Barcelona at a satellite meeting of the EuroScience Open Forum organised by Science Commons and a number of their partners.  Today is when most of the meeting will be with forums on ‘Open Access Today’, ‘Moving OA to the Scientific Enterprise:Data, materials, software’, ‘Open access in the the knowledge network’, and ‘Open society, open science: Principle and lessons from OA’. There is also a keynote from Carlos Morais-Pires of the European Commission and the lineup for the panels is very impressive.

Last night was an introduction and social kickoff as well. James Boyle (Duke Law School, Chair of board of directors of Creative Commons, Founder of Science commons) gave a wonderful talk (40 minutes, no slides, barely taking breath) where his central theme was the relationship between where we are today with open science and where international computer networks were in 1992. He likened making the case for open science today with that of people suggesting in 1992 that the networks would benefit from being made freely accessible, freely useable, and based on open standards. The fears that people have today of good information being lost in a deluge of dross, of their being large quantities of nonsense, and nonsense from people with an agenda, can to a certain extent be balanced against the idea that to put it crudely, that Google works. As James put it (not quite a direct quote) ‘You need to reconcile two statements; both true. 1) 99% of all material on the web is incorrect, badly written, and partial. 2) You probably  haven’t opened an encylopedia as a reference in ten year.

James gave two further examples, one being the availability of legal data in the US. Despite the fact that none of this is copyrightable in the US there are thriving businesses based on it. The second, which I found compelling, for reasons that Peter Murray-Rust has described in some detail. Weather data in the US is free. In a recent attempt to get long term weather data a research effort was charged on the order of $1500, the cost of the DVDs that would be needed to ship the data, for all existing US weather data. By comparison a single German state wanted millions for theirs. The consequence of this was that the European data didn’t go into the modelling. James made the point that while the European return on investment for weather data was a respectable nine-fold, that for the US (where they are giving it away remember) was 32 times. To me though the really compelling part of this argument is if that data is not made available we run the risk of being underwater in twenty years with nothing to eat. This particular case is not about money, it is potentially about survival.

Finally – and this you will not be surprised was the bit I most liked – he went on to issue a call to arms to get on and start building this thing that we might call the data commons. The time has come to actually sit down and start to take these things forward, to start solving the issues of reward structures, of identifying business models, and to build the tools and standards to make this happen. That, he said was the job for today. I am looking forward to it.

I will attempt to do some updates via twitter/friendfeed (cameronneylon on both) but I don’t know how well that will work. I don’t have a roaming data tariff and the charges in Europe are a killer so it may be a bit sparse.

Defining error rates in the Illumina sequence: A useful and feasible open project?

Panorama image of the EBI (left) and Sulston Laboratories (right) of the Sanger Institute on the Genome campus in Cambridgeshire, England.

Regular readers will know I am a great believer in the potential of Web2.0 tools to enable rapid aggregation of loose networks of collaborators to solve a particular problem and the possibilities of using this approach to do science better, faster, and more efficiently. The reason why we haven’t had great successes on this thus far is fundamentally down to the size of the network we have in place and the bias in the expertise of that network towards specific areas. There is a strong bioinformatics/IT bias in the people interested in these tools and this plays out in a number of fields from the people on Friendfeed, to the relative frequency of commenting on PLoS Computational Biology versus PLoS ONE.

Putting these two together one obvious solution is to find a problem that is well suited to the people who are around, may be of interest to them, and is also quite useful to solve. I think I may have found such a problem.

The Illumina next generation sequencing platform developed originally by Solexa is the latest kid on the block as far as the systems that have reached the market. I spent a good part of today talking about how the analysis pipeline for this system could be improved. But one thing that came out as an issue is that no-one seems to have published  detailed analysis of the types of errors that are generated experimentally by this system. Illumina probably have done this analysis in some form but have better things to do than write it up.

The Solexa system is based on sequencing by synthesis. A population of DNA molecules, all amplified from the same single molecule, is immobilised on a surface. A new strand of DNA is added, one base at a time. In the Solexa system each base has a different fluorescent marker on it plus a blocking reagent. After the base is added, and the colour read, the blocker is removed and the next base can be added. More details can be found on the genographia wiki. There are two major sources of error here. Firstly, for a proportion of each sample, the base is not added successfully. This means in the next round, that part of the sample may generate a readout for the previous base. Secondly the blocker may fail, leading to the addition of two bases, causing a similar problem but in reverse. As the cycles proceed the ends of each DNA strand in the sample get increasingly out of phase making it harder and harder to tell which is the correct signal.

These error rates are probably dependent both on the identity of the base being added and the identity of the previous base. It may also be related to the number of cycles that have been carried out. There is also the possibility that the sample DNA has errors in it due to the amplification process though these are likely to be close to insignificant. However there is no data on these error rates available. Simple you might think to get some of the raw data and do the analysis – fit the sequence of raw intensity data to a model where the parameters are error rates for each base.

Well we know that the availability of data makes re-processing possible and we further believe in the power of the social network. And I know that a lot of you guys are good at this kind of analysis, and might be interested in having a play with some of the raw data. It could also be a good paper – Nature Biotech/Nature Methods perhaps and I am prepared to bet it would get an interesting editorial writeup on the process as well. I don’t really have the skills to do the work but if others out there are interested then I am happy to coordinate. This could all be done, in the wild, out in the open and I think that would be a brilliant demonstration of the possibilities.

Oh, the data? We’ve got access to the raw and corrected spot intensities and the base calls from a single ‘tile’ of the phiX174 control lane for a run from the 1000 Genomes Project which can be found at http://sgenomics.org/phix174.tar.gz courtesy of Nava Whiteford from the Sanger Centre. If you’re interested in the final product you can see some of the final read data being produced here.

What I had in mind was taking the called sequence, align onto phiX174 so we know the ‘true’ sequence. Then use that sequence plus a model with error rates to parameterise those error rates. Perhaps there is a better way to approach the problem? There are a series of relatively simple error models that could be tried and if the error rates can be defined then it will enable a really significant increase in both the quality and quantity of data that can be determined by these machines. I figure splitting the job up into a few small groups working on different models, putting the whole thing up on google code with a wiki there to coordinate and capture other issues as we go forward. Anybody up for it (and got the time)?

Related articles

Avoid the pain and embarassment – make all the raw data available

Enzyme

A story of two major retractions from a well known research group has been getting a lot of play over the last few days with a News Feature (1) and Editorial (2) in the 15 May edition of Nature. The story turns on claim that Homme Hellinga’s group was able to convert the E. coli ribose binding protein into a Triose phosphate isomerase (TIM) using a computational design strategy. Two papers on the work appeared, one in Science (3) and one in J Mol Biol (4). However another group, having obtained plasmids for the designed enzymes, could not reproduce the claimed activity. After many months of work the group established that the supposed activity appeared to that of the bacteria’s native TIM and not that of the designed enzyme. The paper’s were retracted and Hellinga went on to accuse the graduate student who did the work of fabricating the results, a charge of which she was completely cleared.

Much of the heat the story is generating is about the characters involved and possible misconduct of various players, but that’s not what I want to cover here. My concern is about how much time, effort, and tears could have been saved if all the relevant raw data was made available in the first place. Demonstrating a new enzymatic activity is very difficult work. It is absolutely critical to rigorously exclude the possibility of any contaminating activity and in practice this is virtually impossible to guarantee. Therefore a negative control experiment is very important. It appears that this control experiment was carried out, but possibly only once, against a background of significant variability in the results. All of this lead to another group wasting on the order of twelve months trying to replicate these results. Well, not wasting, but correcting the record, arguably a very important activity, but one for which they will get little credit in any meaningful sense (an issue for another post and mentioned by Noam Harel in a comment at the News Feature online).

So what might have happened if the original raw data were available? Would it have prevented the publication of the papers in the first place? It’s very hard to tell. The referees were apparently convinced by the quality of the data. But if this was ‘typical data’ (using the special scientific meaning of typical vis ‘the best we’ve got’) and the referees had seen the raw data with greater variability then maybe they would have wanted to see more or better controls; perhaps not. Certainly if the raw data were available the second group would have realised much sooner that something was wrong.

And this is a story we see over and over again. The selective publication of results without reference to the full set of data; a slight shortcut taken or potential issues with the data somewhere that is not revealed to referees or to the readers of the paper; other groups spending months or years attempting to replicate results or simply use a method described by another group. And in the meantime graduate students and postdocs get burnt on the pyre of scientific ‘progress’ discovering that something isn’t reproducible.

The Nature editorial is subtitled ‘Retracted papers require a thorough explanation of what went wrong in the experiments’. In my view this goes nowhere near far enough. There is no longer any excuse for not providing all the raw and processed data as part of the supplementary information for published papers. Even in the form of scanned lab book pages this could have made a big difference in this case, immediately indicating the degree of variability and the purity of the proteins. Many may say that this is too much effort, that the data cannot be found. But if this is the case then serious questions need to be asked about the publication of the work. Publishers also need to play a role by providing more flexible and better indexed facilities for supplementary information, and making sure they are indexed by search engines.

Some of us go much further than this, and believe that making the raw data immediately available is a better way to do science. Certainly in this case it might have reduced the pressure to rush to publish, might have forced a more open and more thorough scrutiny of the underlying data. This kind of radical openness is not for everyone perhaps but it should be less prone to gaffes of the sort described here. I know I can have more faith in the work of my group where I can put my fingers on the raw data and check through the detail. We are still going through the process of implementing this move to complete (or as complete as we can be) openness and its not easy. But it helps.

Science has moved on from the days where the paper could only contain what would fit on the printed pages. It has moved on from the days when an informal circle of contacts would tell you which group’s work was repeatable and which was not. The pressures are high and potential for career disaster probably higher. In this world the reliability and completeness of the scientific record is crucial. Yes there are technical difficulties in making it all available. Yes it takes effort, and yes it will involve more work, and possibly less papers. But the only thing that ultimately can really be relied on is the raw data (putting aside deliberate fraud). If the raw data doesn’t form a central part of the scientific record then we perhaps need to start asking whether the usefulness of that record in its current form is starting to run out.

  1. Editorial Nature 453, 258 (2008)
  2. Wenner M. Nature 453, 271-275 (2008)
  3. Dwyer, M. A. , Looger, L. L. & Hellinga, H. W. Science 304, 1967–1971 (2004).
  4. Allert, M. , Dwyer, M. A. & Hellinga, H. W. J. Mol. Biol. 366, 945–953 (2007).

Protocols for Open Science

interior detail, stata center, MIT. just outside science commons offices.

One of the strong messages that came back from the workshop we held at the BioSysBio meeting was that protocols and standards of behaviour were something that people would appreciate having available. There are many potential issues that are raised by the idea of a ‘charter’ or ‘protocol’ for open science but these are definitely things that are worth talking about. I thought I would through a few ideas out and see where they go. There are some potentially serious contradictions to be worked through. Continue reading “Protocols for Open Science”

The economic case for Open Science

I am thinking about how to present the case for Open Science, Open Notebook Science, and Open Data at Science in the 21st Century, the meeting being organised by Sabine Hossenfelder and Michael Nielsen at the Perimeter Institute for Theoretical Physics. I’ve put up a draft abstract and as you might guess from this I wanted to make an economic case that the waste of resources, both human and monetary is not something that is sustainable for the future. Here I want to rehearse that argument a bit further as well as explore the business case that could be presented to Google/Gates Foundation as a package that would include the development of the Science Exchange ideas that I blogged about last week. Continue reading “The economic case for Open Science”

Somewhat more complete report on BioSysBio workshop

The Queen's Tower, Imperial CollegeImage via Wikipedia

This has taken me longer than expected to write up. Julius Lucks, John Cumbers, and myself lead a workshop on Open Science on Monday 21st at the BioSysBio meeting at Imperial College London.  I had hoped to record screencast, audio, and possibly video as well but in the end the laptop I am working off couldn’t cope with both running the projector and Camtasia at the same time with reasonable response rates (its a long story but in theory I get my ‘proper’ laptop back tomorrow so hopefully better luck next time). We had somewhere between 25 and 35 people throughout most of the workshop and the feedback was all pretty positive. What I found particularly exciting was that, although the usual issues of scooping, attribution, and the general dishonestly of the scientific community were raised, they were only in passing, with a lot more of the discussion focussing on practical issues. Continue reading “Somewhat more complete report on BioSysBio workshop”

Science in the 21st Century

Perimiter InstitutePerimeter Institute by hungryhungrypixels (Picture found by Zemanta).

Sabine Hossenfelder and Michael Nielsen of the Perimeter Institute for Theoretical Physics are organising a conference called ‘Science in the 21st Century‘ which was inspired in part by SciBarCamp. I am honoured, and not a little daunted, to have been asked to speak considering the star studded line up of speakers including, well lots of really interesting people, read the list. The meeting looks to be a really interesting mix of science, tools, and how these interact with people (and scientists). I’m looking forward to it. Continue reading “Science in the 21st Century”

Open Science in the Undergraduate Laboratory: Could this be the success story we’re looking for?

A whole series of things have converged in the last couple of days for me. First was Jean-Claude’s description of the work [1, 2] he and Brent Friesen of the Dominican University are doing putting the combi-Ugi project into an undergraduate laboratory setting. The students will make new compounds which will then be sent for testing as antimalarial agents by Phil Rosenthal at UCSF. This is a great story and a testament in particular to Brent’s work to make the laboratory practical more relevant and exciting for his students.

At the same time I get an email from Anna Croft, University of Bangor, Wales, after meeting up the previous day; Continue reading “Open Science in the Undergraduate Laboratory: Could this be the success story we’re looking for?”

Give me the feed tools and I can rule the world!

Two things last week gave me more cause to think a bit harder about the RSS feeds from our LaBLog and how we can use them. First, when I gave my talk at UKOLN I made a throwaway comment about search and aggregation. I was arguing that the real benefits of open practice would come when we can use other people’s filters and aggregation tools to easily access the science that we ought to be seeing. Google searching for a specific thing isn’t enough. We need to have an aggregated feed of the science we want or need to see delivered automatically. i.e. we need systems to know what to look for even before the humans know it exists. I suggested the following as an initial target;

‘If I can automatically identify all the compounds recently made in Jean-Claude’s group and then see if anyone has used those compounds [or similar compounds] in inhibitor screens for drug targets then we will be on our way towards managing the information’

The idea here would be to take a ‘Molecules’ feed (such as the molecules Blog at UsefulChem or molecules at Chemical Blogspace) extract the chemical identifiers (InChi, Smiles, CML or whatever) and then use these to search feeds from those people exposing experimental results from drug screening. You might think some sort of combination of Yahoo! Pipes and Google Search ought to do it.

So I thought I’d give this a go. And I fell at the first hurdle. I could grab the feed from the UsefulChem molecules Blog but what I actually did was set up a test post in the Chemtools Sandpit Blog. Here I put the InChi of one of the compounds from UsefulChem that was recently tested as a falcipain 2 inhibitor. The InChi went in as both clear text and as the microformat approach suggested by Egon Willighagen. Pipes was perfectly capable of pulling the feed down, and reducing it to only the posts that contained InChi’s but I couldn’t for the life of me figure out how to extract the InChi itself. Pipes doesn’t seem to see microformats. Another problem is that there is no obvious way of converting a Google Search (or Google Custom Search) to an RSS feed.

Now there may well be ways to do this, or perhaps other tools to do it better but they aren’t immediately obvious to me. Would the availability of such tools help us to take the Open Research agenda forwards? Yes, definitely. I am not sure exactly how much or how fast but without easy to use tools, that are well presented, and easily available, the case for making the information available is harder to make. What’s the point of having it on the cloud if you can’t customise your aggregation of it? To me this is the killer app; being able to identify, triage, and collate data as it happens with easily useable and automated tools. I want to see the stuff I need to see in feed reader before I know it exists. Its not that far away but we ain’t there yet.

The other thing this brought home to me was the importance of feeds and in particular of rich feeds. One of the problems with Wikis is that they don’t in general provide an aggregated or user configurable feed of the site in general or a name space such as a single lab book. They also don’t readily provide a means of tagging or adding metadata. Neither Wikis nor Blogs provide immediately accessible tools that provide the ability to configure multiple RSS feeds, at least not in the world of freely hosted systems. The Chemtools blogs each put out an RSS feed but it doesn’t currently include all the metadata. The more I think about this the more crucial I think it is.

To see why I will use another example. One of the features that people liked about our Blog based framework at the workshop last week was the idea that they got a catalogue of various different items (chemicals, oligonucleotides, compound types) for free once the information was in the system and properly tagged. Now this is true but you don’t get the full benefits of a database for searching, organisation, presentation etc. We have been using DabbleDB to handle a database of lab materials and one of our future goals has been to automatically update the database. What I hadn’t realised before last week was the potential to use user configured RSS feeds to set up multiple databases within DabbleDB to provide more sophisticated laboratory stocks database.

DabbleDB can be set up to read RSS or other XML or JSON feeds to update as was pointed out to me by Lucy Powers at the workshop. To update a database all we need is a properly configured RSS feed. As long as our templates are stable the rest of the process is reasonably straightforward and we can generate databases of materials of all sorts along with expiry dates, lot numbers, ID numbers, safety data etc etc. The key to this is rich feeds that carry as much information as possible, and in particular as much of the information we have chosen to structure as possible. We don’t even need the feeds to be user configurable within the system itself as we can use Pipes to easily configure custom feeds.

We, or rather a noob like me, can do an awful lot with some of the tools already available and a bit of judicious pointing and clicking. When these systems are just a little bit better at extracting information (and when we get just a little bit better at putting information in, by making it part of the process) we are going to be doing lots of very exciting things. I am trying to keep my diary clear for the next couple of months…Data flow and sharing

Connecting Open Notebook data to a peer reviewed paper

One thing that we have been thinking about a bit recently is how best to link elements of a peer reviewed paper back to an Open Notebook. There are a number of issues that this raises, both technical and philosophical about how and why we might do this. Our first motivation is to provide access to the raw data if people want it. The dream here is that by clicking on a graph you are taken straight through to processed data which you can then backtrack through to get to the raw data itself. This is clearly some way off.

Other simple solutions are to provide a hyperlink back to the open notebook, or to an index page that describes the experiment, how it was done, and how the data was processed. Publishers are always going to have an issue with this because they can’t rely on the stability of external material. So the other solution is to package up a version of the notebook and provide it as supplementary information. This could still provide links back to the ‘real’ notebook but provides additional stability and also protects the data against disaster by duplicating it.

The problem with this is that many journals will only accept pdf. While we can process a notebook to provide a package which is wrapped up as pdf this has a lot of limitations particularly when it comes to data scraping, which after all we want to encourage. An encouraging development was recently described on the BioMedCentral Blog where they describe the capability of uploading a ‘mini website’ as supplementary information. This is great as we can build a static version of our notebook, with lots of lovely rich metadata built in. We can point out to the original notebook and we can point in from the original notebook back to the paper. I am supposed to be working on a paper at the moment which I was considering where to send. I hope we can give BMC Biotechnology or perhaps BMC Research Notes a go to test this out.