ChemSpider – Science in the Open

August 27, 2009December 30, 2009

Writing a Wave Robot – Some thoughts on good practice for Research Robots

ChemSpidey lives! Even in the face of Karen James’ heavy irony I am still amazed that someone like me with very little programming experience was able to pull together something that actually worked effectively in a live demo. As long as you’re not actively scared of trying to put things together it is becoming relatively straightforward to build tools that do useful things. Building ChemSpidey relied heavily on existing services and other people’s code but pulling that together was a relatively straightforward process. The biggest problems were fixing the strange and in most cases undocumented behaviour of some of the pieces I used. So what is ChemSpidey?

ChemSpidey is a Wave robot that can be found at chemspidey@appspot.com. The code repository is available at Github and you should feel free to re-use it in anyway you see fit, although I wouldn’t really recommend it at the moment, it isn’t exactly the highest quality. One of the first applications I see for Wave is to make it easy to author (semi-)semantic documents which link objects within the document to records on the web. In chemistry it would be helpful to link the names of compounds through to records about those compounds on the relevant databases.

If ChemSpidey is added to a wave it watches for text of the form “chem[ChemicalName{;weight {m}g}]” where the curly bracketed parts are optional. When a blip is submitted by hitting the “done” button ChemSpidey searches through the blip looking for this text and if it finds it, strips out the name and sends it) to the ChemSpider SimpleSearch service. ChemSpider returns a list of database ids and the robot currently just pulls the top one off the list and adds the text ChemicalName (csid:####) to the wave, where the id is linked back to ChemSpider. If there is a weight present it asks the ChemSpider MassSpec API for the nominal molecular weight calculates the number of moles and inserts that. You can see video of it working here (look along the timeline for the ChemSpidey tag).

ChemSpidey

What have I learned? Well some stuff that is probably obvious to anyone who is a proper developer. Use the current version of the API. Google AppEngine pushes strings around as unicode which broke my tests because I had developed things using standard Python strings. But I think it might be useful to start drawing some more general lessons about how best to design robots for research, so to kick of the discussion here are my thoughts, many of which came out of discussions with Ian Mulvany as we prepared for last weeks demo.

Always add a Welcome Blip when the Robot is added to a wave. This makes the user confident that something has happened. Lets you notify users if a new version has been released, which might change the way the robot works, and lets you provide some short instructions.It’s good to include a version number here as well.
Have some help available. Ian’s Janey robot responds to the request (janey help) in a blip with an extended help blip explaining context. Blips are easily deleted later if the user wants to get rid of them and putting these in separate blips keeps them out of the main document.
Where you modify text leave an annotation. I’ve only just started to play with annotations but it seems immensely useful to at least attempt to leave a trace here of what you’ve done that makes it easy for either your own Robot or others, or just human users to see who did what. I would suggest leaving annotations that identfy the robot, include any text that was parsed, and ideally provide some domain information. We need to discuss how to setup some name spaces for this.
Try to isolate the “science handling” from the “wave handling”. ChemSpidey mixes up a lot of things into one Python script. Looking back at it now it makes much more sense to isolate the interaction with the wave from the routines that parse text, or do mole calculations. This means both that the different levels of code should become easier for others to re-use and also if Wave doesn’t turn out to be the one system to rule them all that we can re-use the code. I am no architecture expert and it would be good to get some clues from some good ones about how best to separate things out.

These are just some initial thoughts from a very novice Python programmer. My code satisfies essentially none of these suggestions but I will make a concerted attempt to improve on that. What I really want to do is kick the conversation on from where we are at the moment, which is basically playing around, into how we design an architecture that allows rapid development of useful and powerful functionality.

February 13, 2009December 30, 2009

The growth of linked up data in chemistry – and good community projects

It’s been an interesting week or so in the Chemistry online world. Following on from my musings about data services and the preparation I was doing for a talk the week before last I asked Tony Williams whether it was possible to embed spectra from ChemSpider on a generic web page in the same way that you would embed a YouTube video, Flickr picture, or Slideshare presentation. The idea is that if there are services out on the cloud that make it easier to put some rich material in your own online presence by hosting it somewhere that understands about your data type, then we have a chance of pulling all of these disparate online presences together.

Tony went on to release two features, one that enables you to embed a molecule, which Jean-Claude has demonstrated over on the ONS Challenge Wiki. Essentially by cutting and pasting a little bit of text from ChemSpider into Wikispaces you get a nicely drawn image of the molecule, and the machinery is in place to enable good machine readability of the displayed page (by embedding chemical identifiers within the code) as well as enabling the aggregation of web based information about the molecule back at Chemspider.

The second feature was the one I had asked about, the embedding of spectra. Again this is really useful because it means that as an experimentalist you can host spectra on a service that gets what they are, but you can also incorporate them in a nice way back into your lab book, online report, or whatever it is you are doing. This has already enabled Andy Lang and Jean-Claude to build a very cool game, initially in Second Life but now also on the web. Using the spectral and chemical information from Chemspider the player is presented with the spectrum and three molecules; if they select the correct molecule they get some points, if they get it wrong they lose some. As Tony has pointed out, this is also a way of crowdsourcing the curation process – if the majority of people disagree with the “correct” assignment then maybe the spectrum needs a second look. Chemistry Captchas anyone?

The other even this week has been the efforts by Mitch over at the Chemistry Blog to set up an online resource for named reactions by crowdsourcing contributions and ultimately turning it into a book. Mitch deserves plaudits for this because he’s gone on and done something rather than just talked about it and we need more people like that. Some of us have criticised the details (also see comments at the original post) of how he is going about it but from my perspective this is definitely criticism motivated by the hope that he will succeed and that by making some changes early on, there is the chance to get much more out of the contributions that he gets.

In particular Egon asked whether it would be better to use Wikipedia as the platform for aggregating the named reaction; a point which I agree with. The problem that people see with Wikipedia is largely that of image. People are concerned about inaccurate editing, about the sometimes combative approach of senior editors that are not necessarily expert in the are. Part of the answer is to just get in there and do it – particularly in chemistry there are a range of people working hard to try and get stuff cleaned up. Lots of work has gone into the Chemical boxes and named reactions would be an obvious thing to move on to. Nonetheless it may not work for some people and to a certain extent as long as the material that is generated can be aggregated back to Wikipedia I’m not really fussed.

The bigger concern for us “chemoinformatics jocks” (I can’t help but feel that categorising me as a foo-informatics anything is a little off beam but never mind (-;) was the example pages Mitch put up where there was very little linking back of data to other resources. So there was no way, for instance, to know that this page was even about a specific class of chemicals. The schemes were shown as plain images, making it very hard for any information aggregation service to do anything useful. Essentially the pages didn’t make full use of the power of the web to connect information.

Mitch in turn has taken the criticism offered in a positive fashion and has thrown down the gauntlet; effectively asking the question, “well if you want this marked up, where are the tools to make it easy, and the instructions in plain English to show how to do it?”. He also asks, if named reactions aren’t the best place to start, then what would be a good collaborative project. Fair questions, and I would hope the Chemspider services start to point in the right direction. Instead of drawing an image of a molecule and pasting it on a web page, use the service and connect to the molecule itself, this connects the data up, and gives you a nice picture for free. It’s not perfect. The ideal situation would be a little chemical drawing palette. You draw the molecule you want, it goes to the data service of choice, finds the correct molecule (meaning the user doesn’t need to know what SMILES, InChis or whatever are), and then brings back whatever you want; image, data, vendors, price. This would be a really powerful demonstration of the power of linked data and it probably could be pulled together from existing services.

But what about the second question? What would be a good project? Well this is where that second Chemspider functionality comes in. What about flooding Chemspider with high quality, electronic copies, of NMR spectra. Not the molecules that your supervisor will kill you for releasing, but all those molecules you’ve published, all those hiding in undergraduate chemistry practicals. Grab the electronic files, lets find a way of converting them all to JCamp online, and get ’em up on Chemspider as Open Data.

May 24, 2008December 30, 2009

How do we build the science data commons? A proposal for a SciFoo session

I realised the other day that I havenâ€™t written an exciteable blog post about getting an invitation to SciFoo! The reason for this is that I got overexcited over on FriendFeed instead and havenâ€™t really had time to get my head together to write something here. But in this post I want to propose a session and think through what the focus and aspects of that might be.

I am a passionate advocate of two things that I think are intimately related. I believe strongly in the need and benefits that will arise from building, using, and enabling the effective search and processing of a scientific data commons. I [1,2] and others (including John Wilbanks, Deepak Singh, and Plausible Accuracy) have written on this quite a lot recently. The second aspect is that I believe strongly in the need for effective useable and generic tools to record science as it happens and to process that record so that others can use it effectively. To me these two things are intimately related. By providing the tools that enable the record to be created and integrating them with the systems that will store and process the data commons we can enable scientists to record their work better, communicate it better, and make it available as a matter of course to other scientists (not necessarily immediately I should add, but when they are comfortable with it). Continue reading “How do we build the science data commons? A proposal for a SciFoo session”

May 11, 2008December 30, 2009

More on the science exchance – or building and capitalising a data commons

Image from Wikipedia via Zemanta

Following on from the discussion a few weeks back kicked off by Shirley at One Big Lab and continued here I’ve been thinking about how to actually turn what was a throwaway comment into reality:

What is being generated here is new science, and science isnâ€™t paid for per se. The resources that generate science are supported by governments, charities, and industry but the actual production of science is not supported. The truly radical approach to this would be to turn the system on its head. Donâ€™t fund the universities to do science, fund the journals to buy science; then the system would reward increased efficiency.

There is a problem at the core of this. For someone to pay for access to the results, there has to be a monetary benefit to them. This may be through increased efficiency of their research funding but that’s a rather vague benefit. For a serious charitable or commercial funder there has to be the potential to either make money, or at least see that the enterprise could become self sufficient. But surely this means monetizing the data somehow? Which would require restrictive licences, which is not at the end what we’re about.

The other story of the week has been the, in the end very useful, kerfuffle caused by ChemSpider moving to a CC-BY-SA licence, and the confusion that has been revealed regarding data, licencing, and the public domain. John Wilbanks, whose comments on the ChemSpider licence, sparked the discussion has written two posts [1, 2] which I found illuminating and have made things much clearer for me. His point is that data naturally belongs in the public domain and that the public domain and the freedom of the data itself needs to be protected from erosion, both legal, and conceptual that could be caused by our obsession with licences. What does this mean for making an effective data commons, and the Science Exchange that could arise from it, financially viable? Continue reading “More on the science exchance – or building and capitalising a data commons”