A (small) Feeding Frenzy

Following on from (but unrelated to) my post last week about feed tools we have two posts, one from Deepak Singh, and one from Neil Saunders, both talking about ‘friend feeds’ or ‘lifestreams’. The idea here is of aggregating all the content you are generating (or is being generated about you?) into one place. There are a couple of these about but the main ones seem to be Friendfeed and Profiliac. See Deepaks’s post (or indeed his Friendfeed) for details of the conversations that can come out of these type of things.

What piqued my interest though was the comment Neil made at the bottom of his post about Workstreams.

Here’s a crazy idea – the workstream:

* Neil parsed SwissProt entry Q38897 using parser script swiss2features.pl
* Bob calculated all intersubunit contacts in PDB entry 2jdq using CCP4 package contact

This is exactly the kind of thing I was thinking about as the raw material for the aggregators that would suggest things that you ought to look at, whether it be a paper, a blog post, a person, or a specific experimental result. This type of system will rely absolutely on the willingness of people to make public what they are reading, doing, even perhaps thinking. Indeed I think this is the raw information that will make another one of Neil’s great suggestions feasible.

Following on from Neil’s post I had a short conversation with Alf in the comments about blogging (or Twittering) machines. Alf pointed out a really quite cool example. This is something that we are close to implementing in the open in the lab at RAL. We hope to have the autoclave, PCR machine, and balances all blogging out what they are seeing. This will generate a data feed that we can use to pull specific data items down into the LaBLog.

Perhaps more interesting is the idea of connecting this to people. At the moment the model is that the instruments are doing the blogging. This is probably a good way to go because it keeps a straightforward identifiable data stream. At the moment the trigger for the instruments to blog is a button. However at RAL we use RFID proximity cards for access to the buildings. This means we have an easy identifier for people, so what we aim to do is use the RFID card to trigger data collection (or data feeding).

If this could be captured and processed there is the potential for capturing a lot of the detail of what has happened in the laboratory. Combine this with a couple of Twitter posts giving a little more personal context and it may be possible to reconstruct a pretty complete record of what was done and precisely when. The primary benefit of this would be in trouble shooting but if we could get a little bit of processing into this, and if there are specific actions with agreed labels, then it may be possible to automatically create a large portion of the lab book record.

This may be a great way of recording the kind of machine readable description of experiments that Jean-Claude has been posting about. Imagine a simplistic Twitter interface where you have a limited set of options (I am stirring, I am mixing, I am vortexing, I have run a TLC, I have added some compound). Combine this with a balance, a scanner, and a heating mantle which are blogging out what they are currently seeing, and a barcode reader (and printer) so as to identify what is being manipulated and which compound is which.

One of the problems we have with our lab books is that they can never be detailed enough to capture everything that somebody might be interested in one day. However at the same time they are too detailed for easy reading by third parties. I think there is general agreement that on top of the lab book you need an interpretation layer, an extra blog that explains what is going on to the general public. Perhaps by capturing all the detailed bits automatically we can focus on planning and thinking about the experiments rather than worrying about how to capture everything manually. Then anyone can mash up the results, or the discussion, or the average speed of the stirrer bar, any way they like.

Who’s got the bottle?

Lots of helpful comments from people on my question about what to use as a good identifier of chemicals? I thought it might be useful to re-phrase what it was that I wanted because I think some of the comments, while important discussion points don’t really impinge directly on my current issue.

I have in mind a special type of page on our LaBLog system that will easily allow the generation of a post that describes a new bottle of material that comes into the lab. From a user perspective you want to enter the minimum amount of necessary information, probably a name, a company, perhaps a lot number and/or catalogue number to enable reordering. From the system perspective you want to try and grab as many different ways of describing the material as possible, including where appropriate SMILES, InChi, CML, or whatever. My question was, how do I provide a simple key that will enable the system to go off and find (if possible) these other identifiers. This isn’t really a database per se but a collection of descriptors on a page (although we would like to pull the data out and into a proper database at a later stage). CAS numbers are great because they are written on most bottles and are a well curated system. However I thought that the only way of converting from CAS to anything else was to go through a CAS service. Therefore I thought PubChem CID’s (or SID’s) might be a good way to do this.

So from my perspective a lot of the technical issues with substances versus chemicals versus structures aren’t so important. All I want is to, on a best efforts basis, pull down as many other descriptors as possible to expose in the post. For some things (e.g. yeast extract) the issues of substances versus compounds (not to mention supplier) get right out of hand (I am slightly bemused that it has a CAS number, and there are multiple SID’s in PubChem). Certainly it ain’t going to have an InChi. But if you try and get nothing it doesn’t really matter. Also we are dealing here with common materials. If as Dan Zaharevitz points out, we were dealing with compounds from synthetic chemists we would get into serious trouble, but in this case I think we could rely on our collaborating chemists to get InChi’s/SMILES/CML correct and use those directly. In the ideal Open Notebook Science world we would simply point to their lab books anyway.

So the fundamental issue for me; is there something written on the bottle of material that we can use as a convenient search key to pull down as many other descriptors as we can?

Now I am with Antony Williams on this, if CAS got its act together and made their numbers an open standard then that would be the best solution. It is curated and all pervasive as an identifier. Both Antony and Rich Apodaca have pointed out that I was wrong to say that CAS numbers aren’t in PubChem (and Rich pointed to two useful posts [1], [2] on how to get into PubChem using CAS numbers). So actually, my problem is probably solved by an application of Rich’s instructions on hacking PubChem (even if it turns out we have to download the entire database). The issue here is whether they will stay there or whether they may in the end get pulled.

I do think that for my purposes that PubChem CID’s and SID’s will do the job in this specific case. However as has been pointed out there are issues with reliability and curation. So I will accept that it is probably too early to start suggesting that suppliers label their bottles with PubChem IDs. This may happen anyway (Aldrich seem to have them in the catalogue at least; haven’t been able to check a bottle yet) in the longer term and I guess we have to wait and see what happens.

Peter Murray-Rust has also updated with a series of posts [1], [2], [3] around the issues of chemical substance identity, CAS, Wikipedia et al. Peter Suber has aggregated many of the related posts together. And Glyn Moody has called us to the barricades.

What to use as a the primary key for chemicals?

We are in the process of rolling out the LaBLog system to the new bioscience laboratory within ISIS at the Rutherford Appleton Laboratory. Because this is a new lab we have a real opportunity to embed the system in the way we run the laboratory and the way we practise our science. One of the things we definitely want to do is to use it to maintain a catalogue of all our stocks of chemicals. This is important to us because we are a user laboratory and expect people to come in on a regular basis to do their experiments. This in turn means we need to keep track of everything they bring in to the lab and any safety implications. Thus we want to use our system to log in every bottle of material that comes into the lab.

Now following on from my post about feeds it is clear that we also want to provide a good range of searchable indexes for people to be able to tell what we are using. So we would ideally want to expose InChi, InChiKey, SMILES, CML perhaps, PubChem Ids etc. These can all be converted one to the other using web services so we don’t need to type all of them in manually. All that is required is a nice logging screen where we can drop in one type of index key, the size of the bottle, supplier, lot numbers, perhaps a link to safety data. The real question is what is the index key that is easiest to input? For those of you in or near a laboratory I suggest an exercise. Go and pick up the nearest bottle of commodity stuff from a commercial supplier (i.e. not oligos or peptides). What is written on it? What is a nice short identifier that can consistently be found on pretty much any bottle of chemicals? For those unlucky people who don’t have a laboratory at their fingertips I have provided a clue below.

The Chemical Abstracts Service number is the one identifier that can reasonably reliably be found on most commercially supplied substances. Yet, as described by Peter Murray-Rust and Antony Williams recently you can’t look these up without paying for them. And indeed by recording them for your own purposes (say in a database of the compounds we have in the laboratory) we may be violating the terms of the license.

So what to do? Well we can adopt another standard or standards. Jean-Claude Bradley argued in a comment on my recent post that InChiKey is the way to go, but for this specific purpose (logging materials in) this may be too much to type in many cases (certainly SMILES, InChi and CML would be). You can’t expect people to draw in the structure each time a compound comes in, particularly if we get into arguments about which precise salt of cAMP we are using today. What is required is a simple, relatively short number. This is what makes the CAS number so appealling; it is short, easily typed in, and printed on most bottles.

So, along with Peter I think the answer is to use PubChem CID numbers. PubChem doesn’t use CAS numbers and CAS actively lobbied the US government to limit the scope of PubChem. PubChem CIDs are relatively short, and there are a range of web services from which other descriptions can be retrieved (see e.g. PubChem Power User Gateway). The only thing that is missing is the addition of CID’s on bottles. If we can get wide enough agreement on this I think the answer is to start writing to the suppliers. It’s not great effort on their part to add CIDs (or if there is something better, some other index) to the bottles I would have thought and it provides a lot of extra value for them. PubChem can provide links through to up to date safety data (without the potential legal issues that maintaining a database of MSDS forms with CAS numbers creates), it provides free access to a supplier index through which customers can find them, and it could also save them a small fortune in CAS license fees.

There is another side to this, which is that if there is a wholesale shift (or even the threat of a shift) away from CAS as the only provider of chemical indexing, then perhaps the ACS will wake up and realise that not only is this protectionism bad for chemistry, but it is bad for their business. The database of CAS numbers has no real value in its own right. It is only useful as a pointer to other information. If the ACS were to make the use and indexing of CAS numbers free then it would be driving traffic to its own value added services. The ACS needs to move into the 21st (or perhaps the 20th) century in terms of both its attitudes and business models. We often criticise the former, but without shifts in the latter there is a real risk of critical damage to an organisation that still has the potential to make a big contribution to the chemical sciences. If the major chemical suppliers were to start printing PubChem CID’s on their bottles it might start to persuade the powers that be within the ACS that things need to change.

So, to finish; do people agree that CID is a good standard index to aggregate around? If so we should start writing to the major chemical manufacturers, perhaps through open letters in the general literature (obviously not JACS), to suggest that they include these on their packaging. I’m up for drafting something if people are prepared to sign up to it.

D2O bottle label

Give me the feed tools and I can rule the world!

Two things last week gave me more cause to think a bit harder about the RSS feeds from our LaBLog and how we can use them. First, when I gave my talk at UKOLN I made a throwaway comment about search and aggregation. I was arguing that the real benefits of open practice would come when we can use other people’s filters and aggregation tools to easily access the science that we ought to be seeing. Google searching for a specific thing isn’t enough. We need to have an aggregated feed of the science we want or need to see delivered automatically. i.e. we need systems to know what to look for even before the humans know it exists. I suggested the following as an initial target;

‘If I can automatically identify all the compounds recently made in Jean-Claude’s group and then see if anyone has used those compounds [or similar compounds] in inhibitor screens for drug targets then we will be on our way towards managing the information’

The idea here would be to take a ‘Molecules’ feed (such as the molecules Blog at UsefulChem or molecules at Chemical Blogspace) extract the chemical identifiers (InChi, Smiles, CML or whatever) and then use these to search feeds from those people exposing experimental results from drug screening. You might think some sort of combination of Yahoo! Pipes and Google Search ought to do it.

So I thought I’d give this a go. And I fell at the first hurdle. I could grab the feed from the UsefulChem molecules Blog but what I actually did was set up a test post in the Chemtools Sandpit Blog. Here I put the InChi of one of the compounds from UsefulChem that was recently tested as a falcipain 2 inhibitor. The InChi went in as both clear text and as the microformat approach suggested by Egon Willighagen. Pipes was perfectly capable of pulling the feed down, and reducing it to only the posts that contained InChi’s but I couldn’t for the life of me figure out how to extract the InChi itself. Pipes doesn’t seem to see microformats. Another problem is that there is no obvious way of converting a Google Search (or Google Custom Search) to an RSS feed.

Now there may well be ways to do this, or perhaps other tools to do it better but they aren’t immediately obvious to me. Would the availability of such tools help us to take the Open Research agenda forwards? Yes, definitely. I am not sure exactly how much or how fast but without easy to use tools, that are well presented, and easily available, the case for making the information available is harder to make. What’s the point of having it on the cloud if you can’t customise your aggregation of it? To me this is the killer app; being able to identify, triage, and collate data as it happens with easily useable and automated tools. I want to see the stuff I need to see in feed reader before I know it exists. Its not that far away but we ain’t there yet.

The other thing this brought home to me was the importance of feeds and in particular of rich feeds. One of the problems with Wikis is that they don’t in general provide an aggregated or user configurable feed of the site in general or a name space such as a single lab book. They also don’t readily provide a means of tagging or adding metadata. Neither Wikis nor Blogs provide immediately accessible tools that provide the ability to configure multiple RSS feeds, at least not in the world of freely hosted systems. The Chemtools blogs each put out an RSS feed but it doesn’t currently include all the metadata. The more I think about this the more crucial I think it is.

To see why I will use another example. One of the features that people liked about our Blog based framework at the workshop last week was the idea that they got a catalogue of various different items (chemicals, oligonucleotides, compound types) for free once the information was in the system and properly tagged. Now this is true but you don’t get the full benefits of a database for searching, organisation, presentation etc. We have been using DabbleDB to handle a database of lab materials and one of our future goals has been to automatically update the database. What I hadn’t realised before last week was the potential to use user configured RSS feeds to set up multiple databases within DabbleDB to provide more sophisticated laboratory stocks database.

DabbleDB can be set up to read RSS or other XML or JSON feeds to update as was pointed out to me by Lucy Powers at the workshop. To update a database all we need is a properly configured RSS feed. As long as our templates are stable the rest of the process is reasonably straightforward and we can generate databases of materials of all sorts along with expiry dates, lot numbers, ID numbers, safety data etc etc. The key to this is rich feeds that carry as much information as possible, and in particular as much of the information we have chosen to structure as possible. We don’t even need the feeds to be user configurable within the system itself as we can use Pipes to easily configure custom feeds.

We, or rather a noob like me, can do an awful lot with some of the tools already available and a bit of judicious pointing and clicking. When these systems are just a little bit better at extracting information (and when we get just a little bit better at putting information in, by making it part of the process) we are going to be doing lots of very exciting things. I am trying to keep my diary clear for the next couple of months…Data flow and sharing

Workshop on Blog Based Notebooks

DUE TO SEVERE COMMENT SPAM ON THIS POST I HAVE CLOSED IT TO COMMENTS

On February 28/29 we held a workshop on our Blog Based notebook system at the Cosener’s House in Abingdon, Oxfordshire. This was a small workshop with 13 people including biochemists (from Southampton, Cardiff, and RAL), social scientists (from Oxford Internet Institute and Computing Laboratory), developers from the MyGrid and MyExperiment family and members of the the Blog development team. The purpose of the workshop was to try and identify both the key problems that need to be addressed in the next version of the Blog Based notebook system and also to ask the question ‘Does this system deliver functionality that people want’. Another aim was to identify specific ways in which we could interact with MyExperiment to deliver enhanced functionality for us as well as new members for MyExperiment.

Continue reading “Workshop on Blog Based Notebooks”

A quick update

I have got very behind. I’ve only just realised just how far behind but my excuse is that I have been rather busy. How far behind I was was brought home by the fact that I hadn’t actually commented as yet that the proposal for an Open Science session at PSB that was driven primarily by Shirley Wu has gone in and the proposal is now up at Nature Precedings. The posting there has already generated some new contacts.

On Tuesday I gave a talk at UKOLN at the University of Bath. Brian Kelly kindly videoed the first 10 minutes of the presentation when my attempts to record a screencast failed miserably and has blogged about the talk and on recording talks more generally for public consumption. Jean-Claude does this very effectively but this is something we should perhaps all be more putting a lot more effort into (and can someone tell me what the best software for recording screencasts is?!?). I got a lot of the talk on audio recording and will attempt to record a screencast when I can find time.
The talk was interesting; this was to a group of library/repository/curation/technical experts rather than the usual attempt to convince a group of sceptical scientists. Many of them are already ‘open’ advocates but are focused on technical issues. Lots of smart question on how do you really manage secure identities across multiple systems; how do we make data on the cloud stable for the long term; how do you choose between competing standards for describing and collating data; fundamentally how do you actually make all this work. Interesting discussion all in all and great to meet the people at UKOLN and finally meet Liz Lyon in person.

The other thing happening this week is that tomorrow and Friday we are running a small workshop introducing potential users to our Blog based notebook. Our aim is to see how other people’s working processes do or don’t fit into our system. This is still focused on biochemistry/molecular biology but it will be very interesting to see what comes out of this. I will try to report as soon as possible.

Finally; I think there is something in the air. This week has seen a rush of emails from people who have seen Blog posts, proposals, and other things writing to offer support, and perhaps more crucially access to more contacts.

And further on the PLoS front the biggest story in the UK news on Tuesday morning was about the paper in PLoS Medicine reporting on the results of a meta-study of the effectiveness of SSRIs in treating depression. I woke up to this story on BBC radio and by the time I gave my talk at 10:30 I’d had a chance at least to read the paper abstract. If I’d been on SSRIs this could be really important to me. Perhaps more to the point, if I were a doctor realising I’d be fielding phone calls from concerned patients all day, I could have read the paper. This story tells us a lot about why Open Access and Open Data are crucial. But more on that in another post sometime…I promise.

PSB Session – Final call for support

If you’ve been following the discussion here and over at the One Big Lab Blog you will know that tomorrow is the deadline for submission of proposals for sessions at PSB. Shirley Wu has done a great job of putting together the proposal text and the more support we can get from members of the community the better. Whether you can send us an email saying you would like to be there, will commit to being there, or (most preferred) can commit to submitting a presentation it will all help. Don’t get too hung up on the idea that talks have to be ‘research’. This can include the description of tool development (including services), quantitative studies of the impact of open practices, or just a description or review of a particular experience or process.

The proposal text is visible at: http://docs.google.com/Doc?id=dv4t5rx_33fpxx9pw5 and if you want editing access just holler or leave some comments here or at One Big Lab.

Connecting Open Notebook data to a peer reviewed paper

One thing that we have been thinking about a bit recently is how best to link elements of a peer reviewed paper back to an Open Notebook. There are a number of issues that this raises, both technical and philosophical about how and why we might do this. Our first motivation is to provide access to the raw data if people want it. The dream here is that by clicking on a graph you are taken straight through to processed data which you can then backtrack through to get to the raw data itself. This is clearly some way off.

Other simple solutions are to provide a hyperlink back to the open notebook, or to an index page that describes the experiment, how it was done, and how the data was processed. Publishers are always going to have an issue with this because they can’t rely on the stability of external material. So the other solution is to package up a version of the notebook and provide it as supplementary information. This could still provide links back to the ‘real’ notebook but provides additional stability and also protects the data against disaster by duplicating it.

The problem with this is that many journals will only accept pdf. While we can process a notebook to provide a package which is wrapped up as pdf this has a lot of limitations particularly when it comes to data scraping, which after all we want to encourage. An encouraging development was recently described on the BioMedCentral Blog where they describe the capability of uploading a ‘mini website’ as supplementary information. This is great as we can build a static version of our notebook, with lots of lovely rich metadata built in. We can point out to the original notebook and we can point in from the original notebook back to the paper. I am supposed to be working on a paper at the moment which I was considering where to send. I hope we can give BMC Biotechnology or perhaps BMC Research Notes a go to test this out.

Reflections from a parallel universe

On Wednesday and Thursday this week I was lucky to be able to attend a conference on Electronic Laboratory Notebooks run by an organization called SMI. Lucky because the registration fee was £1500 and I got a free ticket. Clearly this was not a conference aimed at academics. This was a discussion of the capabilities and implications for Electronic Laboratory Notebooks used in industry, and primarily in big pharma.

For me it was very interesting to see these commercial packages. I am often asked how what we do compares to these packages and I have always had to answer that I simply don’t know, I’ve never had the chance to look at one because they are way to expensive. Having now seen them I can say that they have very impressive user interfaces with lots of integrated tools and widgets. They are fundamentally built around specific disciplines and this allows them to be reasonably structured in their presentation and organisation. I think we would break them in our academic research setting but it might take a while. More importantly we wouldn’t be able to afford the customisation that it looks as though you need to get a product that does just what you want it to. Deployment costs of around around £10,000 per person were being bandied around with total contracts costs clearly in the millions of dollars.

Coming out of various recent discussions I would say that I think the overall software design of these products is flawed going forward. The vendors are being paid a lot by companies who want things integrated into their systems so there is no motivation for them to develop open platforms with data portability and easy integration of web services etc. All of these systems run on thick clients against a central database. Going forward these have to go into web portals as a first step before working towards a full  customisable interface with easily collectable widgets to enable end-user configured integration.

But these were far from the most interesting things at the meeting. We commonly assume that keeping, preserving, and indexing data is a clear good. And indeed many of the attendees were assuming the same thing. Then we got a talk on ‘Compliance and ELNs’ by Simon Coles of Amphora Research Systems. The talk can be found here. In this was an example of just how bizarre the legal process for patent protection can make industrial process. In the process for preparing for a patent suit you will need to pay your lawyers to go through all the relevant data and paperwork. Indeed if you lose you will probably pay for the oppositions lawyers to go through all the relevant paperwork. These are not just lawyers, they are expensive lawyers. If you have a whole pile of raw data floating around this is not just going to give the lawyers a field day finding something to pin you to the wall on, it is going to burn through money like nobody’s business. The simple conclusion: It is far cheaper to re-do the experiment than it is to risk the need for lawyers to go through raw data. Throw the raw data away as soon as you can afford to! Like I said, a parallel universe where you think things are normal until they suddenly go sideways on you.

On a more positive sense there were some interesting talks on big companies deploying ELNs. Now we can look at this at some level as a model of a community adopting open notebooks. At least within the company (in most cases) everyone can see everyone else’s notebook. A number of speakers mentioned that this had caused problems and a couple said that it had been necessary to develop and promulgate standards of behaviour. This is interesting in the light of the recent controversy over the naming of a new dinosaur (see commentary at Blog around the Clock) and Shirley Wu’s post on One Big Lab. It reinforces the need for generally accepted standards of behaviour and the growing importance of these as data becomes more open.

The rules? The first two came from the talk, the rest are my suggestion. Basically they boil down to ‘Be Polite’.

  1. Always ask before using someone else’s data or results
  2. User beware: if you rely on someone else’s results its your problem if it blows up in your face (especially if you didn’t ask them about it)
  3. If someone asks if they can use your data or results you say yes. If you don’t want them to, give them a clear timeline on which they can or specific reasons why you can’t release the data. Give clear warnings about any caveats or concerns
  4. If someone asks you not to use their results (whether or not they are helpful or reasonable about it) think very carefully about whether you should ignore their request. If having done this you still feel you are being reasonable in using them, then think again.
  5. Any data that has not been submitted for peer review after 18 months is fair game
  6. If you incorporate someone else’s data within a paper discuss your results with them. Then include them as an author.
  7. Always, without fail and under any cicrumstances, acknowledge any source of information and do so generously and without conditions.

OPEN Network proposal – referees comments are in

So we have received the referees comments on the network proposal and after a bit of a delay I have received permission to make them public. You can find a pdf of the referee’s comments here. I have started to draft a reply which is published on google docs. I have given a number of people access but if you are feeling left out and would like to contribute just drop me a line.

These are broadly pretty critical comments and our chance of getting this funded is not looking at all good on the basis of these. For those who have not written grant proposals before or not dealt with these types of criticisms there is an object less on here. Many of the criticisms relate to assumptions the referees have made about how a UK Network Proposal should be written and what it should do. It is always a good idea to identify precisely what the expectations are. In this case I simply didn’t have time to do this.

However there are some good aspects of this. Many of the critical comments made by the referees are contradicted by other referees (too many meetings, not enough meetings). A couple arise from misunderstandings or perhaps a lack of clarity in the proposal. The key thing is to answer the criticisms on how we expand the network – while at the same time explaining that until we have achieved this the network doesn’t really exist in its ideal form. Also we are asking for relatively little money so once all the big networks get their slice their may be realtively few proposal left small enough to pick up the scraps as it were.

The reply is due back on Monday (UK time) and I will gratefully receive any assistance in getting this response honed to a fine point. I would also point you in the direction of Shirley Wu’s draft of the proposal for a PSB session which is due at the end of next week. We know this collaborative process can work and we also know it has weaknesses and disconnects. If we can use the good part to convince funding agencies that we need to sort out the weaknesses I think that would be a great step forward.