Defining error rates in the Illumina sequence: A useful and feasible open project?

Regular readers will know I am a great believer in the potential of Web2.0 tools to enable rapid aggregation of loose networks of collaborators to solve a particular problem and the possibilities of using this approach to do science better, faster, and more efficiently. The reason why we havenâ€™t had great successes on this thus far is fundamentally down to the size of the network we have in place and the bias in the expertise of that network towards specific areas. There is a strong bioinformatics/IT bias in the people interested in these tools and this plays out in a number of fields from the people on Friendfeed, to the relative frequency of commenting on PLoS Computational Biology versus PLoS ONE.

Putting these two together one obvious solution is to find a problem that is well suited to the people who are around, may be of interest to them, and is also quite useful to solve. I think I may have found such a problem.

The Illumina next generation sequencing platform developed originally by Solexa is the latest kid on the block as far as the systems that have reached the market. I spent a good part of today talking about how the analysis pipeline for this system could be improved. But one thing that came out as an issue is that no-one seems to have publishedÂ detailed analysis of the types of errors that are generated experimentally by this system. Illumina probably have done this analysis in some form but have better things to do than write it up.

The Solexa system is based on sequencing by synthesis. A population of DNA molecules, all amplified from the same single molecule, is immobilised on a surface. A new strand of DNA is added, one base at a time. In the Solexa system each base has a different fluorescent marker on it plus a blocking reagent. After the base is added, and the colour read, the blocker is removed and the next base can be added. More details can be found on the genographia wiki. There are two major sources of error here. Firstly, for a proportion of each sample, the base is not added successfully. This means in the next round, that part of the sample may generate a readout for the previous base. Secondly the blocker may fail, leading to the addition of two bases, causing a similar problem but in reverse. As the cycles proceed the ends of each DNA strand in the sample get increasingly out of phase making it harder and harder to tell which is the correct signal.

These error rates are probably dependent both on the identity of the base being added and the identity of the previous base. It may also be related to the number of cycles that have been carried out. There is also the possibility that the sample DNA has errors in it due to the amplification process though these are likely to be close to insignificant. However there is no data on these error rates available. Simple you might think to get some of the raw data and do the analysis â€“ fit the sequence of raw intensity data to a model where the parameters are error rates for each base.

Well we know that the availability of data makes re-processing possible and we further believe in the power of the social network. And I know that a lot of you guys are good at this kind of analysis, and might be interested in having a play with some of the raw data. It could also be a good paper â€“ Nature Biotech/Nature Methods perhaps and I am prepared to bet it would get an interesting editorial writeup on the process as well. I donâ€™t really have the skills to do the work but if others out there are interested then I am happy to coordinate. This could all be done, in the wild, out in the open and I think that would be a brilliant demonstration of the possibilities.

Oh, the data? Weâ€™ve got access to the raw and corrected spot intensities and the base calls from a single â€˜tileâ€™ of the phiX174 control lane for a run from the 1000 Genomes Project which can be found at http://sgenomics.org/phix174.tar.gz courtesy of Nava Whiteford from the Sanger Centre. If youâ€™re interested in the final product you can see some of the final read data being produced here.

What I had in mind was taking the called sequence, align onto phiX174 so we know the ‘true’ sequence. Then use that sequence plus a model with error rates to parameterise those error rates. Perhaps there is a better way to approach the problem? There are a series of relatively simple error models that could be tried and if the error rates can be defined then it will enable a really significant increase in both the quality and quantity of data that can be determined by these machines. I figure splitting the job up into a few small groups working on different models, putting the whole thing up on google code with a wiki there to coordinate and capture other issues as we go forward. Anybody up for it (and got the time)?

Project to map 1,000 people’s DNA [viaÂ Zemanta]
Watson and … Kriek [viaÂ Zemanta]
Detailed gene map will lift lid on diseases [viaÂ Zemanta]

8 Replies to “Defining error rates in the Illumina sequence: A useful and feasible open project?”

Talk to Matt Wood and others at the Sanger first. I think they have more experience on Solexas than anyone else on the planet, and might be able to shed some light. If this is a problem that’s close to being addressed (answers vary depending on whom you talk to), then the project might be short lived (not necessarily a bad thing). There are enough people working on these error models anyway. Best would be to rope them in and publish something on Google code cause it’s important to have re-usable access to those models

This came out of a conversation at Sanger, although not with Matt Wood’s group. The guys I’m talking to are the ones actually handling the raw data coming off the instruments and trying to improve on the analysis pipeline. I think Nav is out there somewhere though so I’ll leave it to him to make more precise comments on exactly what they are doing, but the plan is very cool.

In that case, with them leading the way, I believe it would be a great idea, and not just for solexa, but all the platforms.

Hi, there’s some exploratory work on creating an error model for Solexa sequencing being done here which should appear shortly. Most of the work is geared towards quantifying and identifying errors, what we’d be interested in doing is creating a good model, parameterizing that based on observations and then using that to correct for a given systematic error or “artifact”.

Phasing is a particularly interesting target for this, it should be possible to create a accurate, well founded, model for phasing and then parameterize that based on observed phasing. This could then be used to correct the signal we get out of the device.

Obviously phasing is corrected for at the moment in the Illumina pipeline, and we have our own in house corrections for the open source pipeline we’re developing but both these are heuristically defined.

Yep, we’d certainly be interested in looking at other next-gen technologies as well. We have 454s and Solids here, at the moment we are focused on the Solexas though, as they seem to throw out the most high quality data.

Comments are closed.