Collaboration – Science in the Open

July 17, 2009December 30, 2009

Sci – Bar – Foo etc. Part II – SciFoo – Engaging with the world

Last Friday afternoon (was it really only a week ago?) about 200 people made their way to the Googleplex in Mountain View for the fourth SciFoo. There are many people who got their blog posts out well before me so I will focus on the sessions which don’t seem to have been heavily discussed and try to draw a few themes out.

For me, the over riding theme that came through was Engagement. Engaging people beyond the narrow confines of the professional research community in real research projects, making science more engaging for students, and engaging in a serious way with both the tools that are available to help us do these things, and increasingly with data generation and dissemination processes that are not under our control.

I was involved in running two sessions. The first with Peter Murray-Rust was on Open Data, focussed on getting feedback on the current form of the Panton Principles and has been blogged in detail by Peter. For me the main message from this was a lack of push-back. Many of the more technical people in the room were bemused that there was a problem. “Just put it on the web” was a common response. Other’s were concerned about where data stops and creative works begin but the main message for me was that “for published data just put it explicitly in the public domain” was seen as the right thing to do by the people in the room. Indeed most were suprised it was even worth discussing.

The second session I ran was on Google Wave in research and this will get a whole post of its own very soon so I won’t discuss it in detail here. Suffice to say that there was excitement, great ideas about what could be done, and concerns about the details of technical implementation. Which to me seems like an excellent mix to make progress with. Engagement for these two sessions was engagement with the data and engagement with the technology for generating, annotating, and sharing that data.

The other sessions I would like to draw a common theme through were more focussed on public engagement and education. The first session I attended on Saturday morning was run by Daniel Glaser called Doing Science in Non-Science Spaces. This was an interesting discussion on many levels but particularly for me because it challenged my ideas about multi-disciplinary working and deploying research projects into an educational setting. Daniel described disciplinary boundaries as fractal and described multidisciplinary projects as requiring as space where people can come together in a safe common space to share ideas, but also a requirement for people to then disperse again and re-intepret the outputs in the context of their own experience and discipline. In this view disciplinary boundaries are important in enabling effective summarisation and communication of outputs. I’ve been kicking myself ever since for not thinking to ask whether that means these boundaries are any less arbitrary than those of us who are interdisciplinary always feel.

Another challenge to my thinking from this session was the need to give up control over the shared collaboration space. In thinking about putting research projects into educational settings I’ve always looked at the process as trying to find a question within the research that can be understood and answered by students. The argument here was that to truly engage students it would be necessary to let them find and answer their own questions. I’m not sure how in practice to think about that in terms of drug discovery or how it maps on the success of projects like Galaxy Zoo but it bears some thinking about.

Also focussed on interactions beyond the professional research community was Ariel Waldman‘s session “Open collaboration between scientists, communities, and the unknown” which followed on from a session of the same title at SciBarCamp which I somehow missed. Here the focus was on problems with sharing research with the wider world, with similar problems to those of sharing between researchers identified,Â and potential solutions. Some great projects were discussed and showcased with contributions on a new collaboration site for research into Parkinsons, getting the public to search for surface exposed fossils in high resolution ground images (Louise Leakey, Turkana Basin Institute), and the experience of being the public conduit for a spacecraft from Veronica “Mars Phoenix” McGregor. Once again a major theme was “just get the data out there” so that people can do something with it if they want to. If it isn’t available no-one is going to do anything.

The final session was lead by Joan Peckham on Computational Thinking, the idea that the principles behind good computing design should be taught as a core skill on a par with reading and writing, and that this techniques are widely applicable beyond computing per se. For more on the background to this you can checkout John Udell interviewing Joan on his Interviews with Innovators podcast. The point for me was to try and understand how I can most effectively learn these principles and techniques as it is clear to me that I need a better understanding of good software and system design for the work I would like to do. What was interesting to me was whether my needs mapped onto what would be required for teaching children and whether willing and interested guinea pigs such as myself might be useful in helping to develop educational programmes. Here engagement means effective use of technology and design of systems that will make our work and collaborations efficient.

Scifoo is always challenging, requiring that you re-think and re-examine many of the assumptions that your everyday work is built on. Many smart people with very different perspectives and experiences make a great environment to stress test your ideas, sometimes to destruction. The challenge can be actually applying those insights in the real world with limited resources and time. But it provides some goals to work towards and much food for thought.

February 13, 2009December 30, 2009

The growth of linked up data in chemistry – and good community projects

It’s been an interesting week or so in the Chemistry online world. Following on from my musings about data services and the preparation I was doing for a talk the week before last I asked Tony Williams whether it was possible to embed spectra from ChemSpider on a generic web page in the same way that you would embed a YouTube video, Flickr picture, or Slideshare presentation. The idea is that if there are services out on the cloud that make it easier to put some rich material in your own online presence by hosting it somewhere that understands about your data type, then we have a chance of pulling all of these disparate online presences together.

Tony went on to release two features, one that enables you to embed a molecule, which Jean-Claude has demonstrated over on the ONS Challenge Wiki. Essentially by cutting and pasting a little bit of text from ChemSpider into Wikispaces you get a nicely drawn image of the molecule, and the machinery is in place to enable good machine readability of the displayed page (by embedding chemical identifiers within the code) as well as enabling the aggregation of web based information about the molecule back at Chemspider.

The second feature was the one I had asked about, the embedding of spectra. Again this is really useful because it means that as an experimentalist you can host spectra on a service that gets what they are, but you can also incorporate them in a nice way back into your lab book, online report, or whatever it is you are doing. This has already enabled Andy Lang and Jean-Claude to build a very cool game, initially in Second Life but now also on the web. Using the spectral and chemical information from Chemspider the player is presented with the spectrum and three molecules; if they select the correct molecule they get some points, if they get it wrong they lose some. As Tony has pointed out, this is also a way of crowdsourcing the curation process – if the majority of people disagree with the “correct” assignment then maybe the spectrum needs a second look. Chemistry Captchas anyone?

The other even this week has been the efforts by Mitch over at the Chemistry Blog to set up an online resource for named reactions by crowdsourcing contributions and ultimately turning it into a book. Mitch deserves plaudits for this because he’s gone on and done something rather than just talked about it and we need more people like that. Some of us have criticised the details (also see comments at the original post) of how he is going about it but from my perspective this is definitely criticism motivated by the hope that he will succeed and that by making some changes early on, there is the chance to get much more out of the contributions that he gets.

In particular Egon asked whether it would be better to use Wikipedia as the platform for aggregating the named reaction; a point which I agree with. The problem that people see with Wikipedia is largely that of image. People are concerned about inaccurate editing, about the sometimes combative approach of senior editors that are not necessarily expert in the are. Part of the answer is to just get in there and do it – particularly in chemistry there are a range of people working hard to try and get stuff cleaned up. Lots of work has gone into the Chemical boxes and named reactions would be an obvious thing to move on to. Nonetheless it may not work for some people and to a certain extent as long as the material that is generated can be aggregated back to Wikipedia I’m not really fussed.

The bigger concern for us “chemoinformatics jocks” (I can’t help but feel that categorising me as a foo-informatics anything is a little off beam but never mind (-;) was the example pages Mitch put up where there was very little linking back of data to other resources. So there was no way, for instance, to know that this page was even about a specific class of chemicals. The schemes were shown as plain images, making it very hard for any information aggregation service to do anything useful. Essentially the pages didn’t make full use of the power of the web to connect information.

Mitch in turn has taken the criticism offered in a positive fashion and has thrown down the gauntlet; effectively asking the question, “well if you want this marked up, where are the tools to make it easy, and the instructions in plain English to show how to do it?”. He also asks, if named reactions aren’t the best place to start, then what would be a good collaborative project. Fair questions, and I would hope the Chemspider services start to point in the right direction. Instead of drawing an image of a molecule and pasting it on a web page, use the service and connect to the molecule itself, this connects the data up, and gives you a nice picture for free. It’s not perfect. The ideal situation would be a little chemical drawing palette. You draw the molecule you want, it goes to the data service of choice, finds the correct molecule (meaning the user doesn’t need to know what SMILES, InChis or whatever are), and then brings back whatever you want; image, data, vendors, price. This would be a really powerful demonstration of the power of linked data and it probably could be pulled together from existing services.

But what about the second question? What would be a good project? Well this is where that second Chemspider functionality comes in. What about flooding Chemspider with high quality, electronic copies, of NMR spectra. Not the molecules that your supervisor will kill you for releasing, but all those molecules you’ve published, all those hiding in undergraduate chemistry practicals. Grab the electronic files, lets find a way of converting them all to JCamp online, and get ’em up on Chemspider as Open Data.

May 28, 2008December 30, 2009

Defining error rates in the Illumina sequence: A useful and feasible open project?

Regular readers will know I am a great believer in the potential of Web2.0 tools to enable rapid aggregation of loose networks of collaborators to solve a particular problem and the possibilities of using this approach to do science better, faster, and more efficiently. The reason why we havenâ€™t had great successes on this thus far is fundamentally down to the size of the network we have in place and the bias in the expertise of that network towards specific areas. There is a strong bioinformatics/IT bias in the people interested in these tools and this plays out in a number of fields from the people on Friendfeed, to the relative frequency of commenting on PLoS Computational Biology versus PLoS ONE.

Putting these two together one obvious solution is to find a problem that is well suited to the people who are around, may be of interest to them, and is also quite useful to solve. I think I may have found such a problem.

The Illumina next generation sequencing platform developed originally by Solexa is the latest kid on the block as far as the systems that have reached the market. I spent a good part of today talking about how the analysis pipeline for this system could be improved. But one thing that came out as an issue is that no-one seems to have publishedÂ detailed analysis of the types of errors that are generated experimentally by this system. Illumina probably have done this analysis in some form but have better things to do than write it up.

The Solexa system is based on sequencing by synthesis. A population of DNA molecules, all amplified from the same single molecule, is immobilised on a surface. A new strand of DNA is added, one base at a time. In the Solexa system each base has a different fluorescent marker on it plus a blocking reagent. After the base is added, and the colour read, the blocker is removed and the next base can be added. More details can be found on the genographia wiki. There are two major sources of error here. Firstly, for a proportion of each sample, the base is not added successfully. This means in the next round, that part of the sample may generate a readout for the previous base. Secondly the blocker may fail, leading to the addition of two bases, causing a similar problem but in reverse. As the cycles proceed the ends of each DNA strand in the sample get increasingly out of phase making it harder and harder to tell which is the correct signal.

These error rates are probably dependent both on the identity of the base being added and the identity of the previous base. It may also be related to the number of cycles that have been carried out. There is also the possibility that the sample DNA has errors in it due to the amplification process though these are likely to be close to insignificant. However there is no data on these error rates available. Simple you might think to get some of the raw data and do the analysis â€“ fit the sequence of raw intensity data to a model where the parameters are error rates for each base.

Well we know that the availability of data makes re-processing possible and we further believe in the power of the social network. And I know that a lot of you guys are good at this kind of analysis, and might be interested in having a play with some of the raw data. It could also be a good paper â€“ Nature Biotech/Nature Methods perhaps and I am prepared to bet it would get an interesting editorial writeup on the process as well. I donâ€™t really have the skills to do the work but if others out there are interested then I am happy to coordinate. This could all be done, in the wild, out in the open and I think that would be a brilliant demonstration of the possibilities.

Oh, the data? Weâ€™ve got access to the raw and corrected spot intensities and the base calls from a single â€˜tileâ€™ of the phiX174 control lane for a run from the 1000 Genomes Project which can be found at http://sgenomics.org/phix174.tar.gz courtesy of Nava Whiteford from the Sanger Centre. If youâ€™re interested in the final product you can see some of the final read data being produced here.

What I had in mind was taking the called sequence, align onto phiX174 so we know the ‘true’ sequence. Then use that sequence plus a model with error rates to parameterise those error rates. Perhaps there is a better way to approach the problem? There are a series of relatively simple error models that could be tried and if the error rates can be defined then it will enable a really significant increase in both the quality and quantity of data that can be determined by these machines. I figure splitting the job up into a few small groups working on different models, putting the whole thing up on google code with a wiki there to coordinate and capture other issues as we go forward. Anybody up for it (and got the time)?

Project to map 1,000 people’s DNA [viaÂ Zemanta]
Watson and … Kriek [viaÂ Zemanta]
Detailed gene map will lift lid on diseases [viaÂ Zemanta]

May 11, 2008December 30, 2009

More on the science exchance – or building and capitalising a data commons

Image from Wikipedia via Zemanta

Following on from the discussion a few weeks back kicked off by Shirley at One Big Lab and continued here I’ve been thinking about how to actually turn what was a throwaway comment into reality:

What is being generated here is new science, and science isnâ€™t paid for per se. The resources that generate science are supported by governments, charities, and industry but the actual production of science is not supported. The truly radical approach to this would be to turn the system on its head. Donâ€™t fund the universities to do science, fund the journals to buy science; then the system would reward increased efficiency.

There is a problem at the core of this. For someone to pay for access to the results, there has to be a monetary benefit to them. This may be through increased efficiency of their research funding but that’s a rather vague benefit. For a serious charitable or commercial funder there has to be the potential to either make money, or at least see that the enterprise could become self sufficient. But surely this means monetizing the data somehow? Which would require restrictive licences, which is not at the end what we’re about.

The other story of the week has been the, in the end very useful, kerfuffle caused by ChemSpider moving to a CC-BY-SA licence, and the confusion that has been revealed regarding data, licencing, and the public domain. John Wilbanks, whose comments on the ChemSpider licence, sparked the discussion has written two posts [1, 2] which I found illuminating and have made things much clearer for me. His point is that data naturally belongs in the public domain and that the public domain and the freedom of the data itself needs to be protected from erosion, both legal, and conceptual that could be caused by our obsession with licences. What does this mean for making an effective data commons, and the Science Exchange that could arise from it, financially viable? Continue reading “More on the science exchance – or building and capitalising a data commons”

April 13, 2008December 30, 2009

Bursty science depends on openness

Image via Wikipedia

There have been a number of interesting discussions going on in the blogosphere recently about radically different ways of practising science. Pawel Szczesny has blogged about his plans for freelancing science as a way of moving out of the rigid career structure that drives conventional academic science. Deepak Singh has blogged a number of times about ‘bursty science‘, the idea that projects can be rapidly executed by distributing them amongst a number of people, each with the capacity to undertake a small part of the project. Continue reading “Bursty science depends on openness”