A whole series of things have converged in the last couple of days for me. First was Jean-Claude’s description of the work [1, 2] he and Brent Friesen of the Dominican University are doing putting the combi-Ugi project into an undergraduate laboratory setting. The students will make new compounds which will then be sent for testing as antimalarial agents by Phil Rosenthal at UCSF. This is a great story and a testament in particular to Brent’s work to make the laboratory practical more relevant and exciting for his students.
At the same time I get an email from Anna Croft, University of Bangor, Wales, after meeting up the previous day;
[…] I discussed the idea of blog-labbooks with my student/PDRAs on the way back and we were all keen to trial your stuff. I have also taken the liberty of talking to a couple of my colleagues over morning tea, and they were also very interested in the project and the idea of open science – I expect when a couple of my other colleagues are back next week they will also be keen.
We thought it might be particularly valuable as a teaching resource in the first instance – ie making the undergrads use it as [CN – my emphasis] 1. it is more like the technology they are used to and 2. transparancy might improve the quality of their record keeping. It may come in very handy for undergrad practicals even, because it will be obvious to anyone who has copied each other. So I’m sure we could provide you with more than enough trouble shooting and commentary on the practice of open science to support your thesis that it is better for all ;)
I think this is emerging as a theme. If we can get these systems into the undergraduate laboratory we can do two things. We can help to engage the students by providing tools that, to them, may seem more relevant to their everyday lives. We can improve the educational experience by using the social networking and commenting tools available to help the students learn from each other. Perhaps most exciting, if as Jean-Claude and Brent are doing this is embedded in a real science project, particularly one with clear social significance, the students will be directly involved with the science. We talk a lot about ‘research led teaching’ in the UK but I’ve seen little evidence of people really doing it. Here is a real chance to make that happen, and because it’s Open Notebook Science they will be able to track the results from their own computers, at home, perhaps even while uploading the results to their facebook page (all right maybe not…).
But let’s think bigger again. What if it were possible to spread this out globally, using a coherent information tracking system. Chemistry undergraduates all over the world could be making thousands of characterised drug like molecules. Perhaps the screening could be done in undergraduate biochemistry laboratories with well defined screening assays for a wide range of targets: malaria, cancer cell lines, S. aureus, HIV protease. You could expand this into physical chemistry/biophysics with undergraduates looking at docking of the new or proposed compounds against the available targets. A global programme could link all of this together generating a massive library of compounds screened against a wide range of targets.
And all of this would be open. Deepak asked the question the other day, ‘What if the result of every docking result was published online?‘ and I have previously imagined a ‘Genbank of SAR data‘. A programme like this could actually provide the impetus for such repositories, while simultaneously adding value with characterisation data for new compounds to existing initiatives like ChemSpider. Whether or not we actually found hits the SAR data would be immensely valuable for future data mining. Whether or not the SAR was useful to impetus towards building these databases would be immensely valuable.
So – what about the challenges. Converting research chemistry into undergraduate labs is not straightforward. Jean-Claude’s Ugi reactions are a particularly good example for this because you can just mix three compounds in methanol and in many cases the product precipitates. But even here it’s not trivial – one component is usually smelly; the compounds may not dissolve in methanol; the product may not precipitate. Getting the fundamentals of a particular reaction right will take person hours and quite a lot of them. In UK language this could be done with a final year undergraduate project student to get one type of reaction working in a three to six month project.
Screening can be just as hard. The reactions are not necessarily straightforward, the reagents (and positive and negative controls) can be expensive, and the equipment required to do it efficiently may not be available. But again many reasonably straightforward assays are available, and the rest just require time, expertise, and money. This could be tracked down from a range of sources.
So, resources are required which really means money. Which is where the third co-incidental strand comes in. Via Pawel Szczesny on FriendFeed I got a link to the following post about a call from the Bill and Melinda Gates foundation. The problem with this call is that it really needs a drug resistance angle to fly and while that is possible it complicates the whole idea significantly. But between the B&MG foundation, Google, and others all looking for projects that are both socially significant, make them look good, and co-incidentally generate lots of data in the health arena that they can process, this ought to be fundable. In particular this looks to me like a great fit for Google. For us, it’s great education, great publicity, great science, and it demonstrates the benefits of openness.
A small scale project funding a development round for 12 months to try and sort out appropriate reactions and assays, while trying to agree and implement a data framework that will work for us would probably cost around $200-300k total costs. This would be followed by a larger scale implementation phase, heading towards a few million probably, as it gets rolled out. As it embeds in the education system the costs would probably drop over time to eventually be absorbed in the programme costs.
What do people think?
Further Reading: Jean-Claude’s proposal on ‘Crowdsourcing Chemistry‘
While I know nothing about reality of huge projects and applying for huge money pile, this made me think about going even bigger. What about setting aim to develop a data model/framework for majority of scientific experiments, not only in chemistry? Significant portion of bioinformatics and molecular biology could be easily (at least I hope it would be easy) captured in such framework. If we can then enrich this data with information from Registry of Standard Biological Parts (after it captures parts as functional roles, as I suggested recently), we could have quite a good model describing not only experiment, but also bigger picture (in other words, if we can see not only stream of different compounds tested against protein X, but also propagated from the Registry information that this protein has functional role Y and comes from known pathogen, we could understand what is project all about).
So maybe instead of answering B&MGF call we should altogether apply to Google/others/B&MGF directly? Does anybody has any experience with them in such area?
BTW, Szczesny, not Szeczesny. It doesn’t really matter but I often try to correct (very frequent I must admit) misspelings – you know, it’s all about databases – mistakes propagate very quickly :).
While I know nothing about reality of huge projects and applying for huge money pile, this made me think about going even bigger. What about setting aim to develop a data model/framework for majority of scientific experiments, not only in chemistry? Significant portion of bioinformatics and molecular biology could be easily (at least I hope it would be easy) captured in such framework. If we can then enrich this data with information from Registry of Standard Biological Parts (after it captures parts as functional roles, as I suggested recently), we could have quite a good model describing not only experiment, but also bigger picture (in other words, if we can see not only stream of different compounds tested against protein X, but also propagated from the Registry information that this protein has functional role Y and comes from known pathogen, we could understand what is project all about).
So maybe instead of answering B&MGF call we should altogether apply to Google/others/B&MGF directly? Does anybody has any experience with them in such area?
BTW, Szczesny, not Szeczesny. It doesn’t really matter but I often try to correct (very frequent I must admit) misspelings – you know, it’s all about databases – mistakes propagate very quickly :).
Damn. I even checked it twice. As far as data models are concerned, I am still arguing with Frank Gibson about that :) I think the reality of dealing with a wide range of different sites and groups would mean that keeping it simple (at least to start with) would be a good plan. Not that I’m against tackling the problem – definitely all for it – but its a tough problem. And one that might well stretch Google’s pockets. But definitely I think the direct approach for this idea might be better. No idea how to go about it. Anyone got Sergei and Larry’s phone number? :)
Actually you could imagine structuring this as its own startup or perhaps a foundation. Gotta say, I really don’t have the contacts list for that kind of game at the moment though.
Damn. I even checked it twice. As far as data models are concerned, I am still arguing with Frank Gibson about that :) I think the reality of dealing with a wide range of different sites and groups would mean that keeping it simple (at least to start with) would be a good plan. Not that I’m against tackling the problem – definitely all for it – but its a tough problem. And one that might well stretch Google’s pockets. But definitely I think the direct approach for this idea might be better. No idea how to go about it. Anyone got Sergei and Larry’s phone number? :)
Actually you could imagine structuring this as its own startup or perhaps a foundation. Gotta say, I really don’t have the contacts list for that kind of game at the moment though.
Proposal sounds good!
(and it means what follows below is not contrariness, just idle wondering)
One thing that’s been slightly bothering me is data reliability. By using undergraduates we are relying on potentially problematic data. ie, without excellent prac skills, or with hidden problems (such as gone off reagents, which might go unrecognised) many reactions may be reported as not working that may work in the hands of another worker. But by telling everyone it didn’t work, it might never be tried again – at least not for a long while. I presume failsafes will need to therefore be put in – replication by pairs/triplicates of students to rule out problems. I’m guessing with enough of them …
Does anyone know of a similar repository for (ab initio/DFT) calculation data? There must be millions of electronic structure calculations going on each day, and so much extractable data that could be 1. easily uploaded and 2. cross correlated with other chemical information.
Proposal sounds good!
(and it means what follows below is not contrariness, just idle wondering)
One thing that’s been slightly bothering me is data reliability. By using undergraduates we are relying on potentially problematic data. ie, without excellent prac skills, or with hidden problems (such as gone off reagents, which might go unrecognised) many reactions may be reported as not working that may work in the hands of another worker. But by telling everyone it didn’t work, it might never be tried again – at least not for a long while. I presume failsafes will need to therefore be put in – replication by pairs/triplicates of students to rule out problems. I’m guessing with enough of them …
Does anyone know of a similar repository for (ab initio/DFT) calculation data? There must be millions of electronic structure calculations going on each day, and so much extractable data that could be 1. easily uploaded and 2. cross correlated with other chemical information.
I think part of the answer to that is that you have the characterisation data so you can really check this. For instance Peter M-R’s recent project on checking NMR spectra could be deployed to assess which assignments were problematic, and possibly to identify where purity was an issue.
As for issues of whether a reaction ‘really’ failed I think you just need to see it as data and put it in context. If two students in each of two different locations all had a specific reaction ‘fail’ then its reasonably trustworthy. If it works in one place and not another then reagents may be dodgy and if it just dosen’t work for one student then maybe they need to lift their game. Perhaps one of the real benefits would be being able to check whether others have had the same problems or not.
Think of each reaction as raw data rather than the final story. All the data will be available and you can then draw your own conclusions.
As for databases there is an e-science project being led by Mark Sansom which is supposed to be a repository for protein MD but I don’t know whether there is something similar for ab inito or DFT data. Some of the people at Soton may know so I will ask today if I can find anyone.
I think part of the answer to that is that you have the characterisation data so you can really check this. For instance Peter M-R’s recent project on checking NMR spectra could be deployed to assess which assignments were problematic, and possibly to identify where purity was an issue.
As for issues of whether a reaction ‘really’ failed I think you just need to see it as data and put it in context. If two students in each of two different locations all had a specific reaction ‘fail’ then its reasonably trustworthy. If it works in one place and not another then reagents may be dodgy and if it just dosen’t work for one student then maybe they need to lift their game. Perhaps one of the real benefits would be being able to check whether others have had the same problems or not.
Think of each reaction as raw data rather than the final story. All the data will be available and you can then draw your own conclusions.
As for databases there is an e-science project being led by Mark Sansom which is supposed to be a repository for protein MD but I don’t know whether there is something similar for ab inito or DFT data. Some of the people at Soton may know so I will ask today if I can find anyone.
Cameron,
Your response to Anna’s question is dead on. We have to move away from the concept of “trusted sources” that can be relied upon to be completely accurate without providing adequate proof. Providing links to all the raw data is absolutely key and lets us use information from undergraduates, postdocs and machines with minimal assumptions.
That’s why I’m happy to see that Brent is asking students to take pictures of the precipitates. We can compare directly against the pictures in my lab. Students are not always clear about the definition of a solution – for example milky suspensions can be considered solutions by some but they are not for our purposes. Pictures clarify that nicely.
Now there is a limit to how much can be reported by students. For example, I can’t easily get proof that the micropipette was used correctly or that another mistake in measurement happened or that the wrong compound was used. Pictures help a lot with this too – for example I had a student who claimed that a compound was not soluble – from the pic it was obvious that there were several hundred milligrams in the vial instead of 40. Here the mistake was a decimal point error.
But other mistakes won’t be caught quickly. For example we had a bad batch of phenanthrene-9-carboxaldehyde. When we got around to taking NMRs of starting materials we had to remove a bunch of results from the data pool and start over. This is why redundancy is still important. If two identical experiments don’t come out the same we know to investigate it.
Unfortunately our NSF proposal on Crowdsourcing chemistry was rejected but we’ll try again. I think Google would be a logical place to seek funding, especially since we use their services whenever possible -GoogleDocs, Blogger, etc. But they need to be convinced that there is something behind the concept. We already have compounds that were active against the malarial parasite from our collaboration with Rajarshi Guha at Indiana U. and Phil Rosenthal at UCSF. With Brent, if he can show that his teaching lab can produce new science we can use that as a hard fact. Cameron your support is extremely appreciated also!
I think that the role of funding is to ramp up an existing infrastructure of people already committed to participating. If I got more money I could support more students, get more chemicals, equipment like automated reactors, etc. Testers could get more resources to do more samples, modelers like Rajarshi could also get more assistance, etc. But, like open source software, the open source science core motor relies on the participation (even if limited) of people who have already bought into the concept.
Lets please continue the discussion and brainstorm some additional ways of funding these activities!
Cameron,
Your response to Anna’s question is dead on. We have to move away from the concept of “trusted sources” that can be relied upon to be completely accurate without providing adequate proof. Providing links to all the raw data is absolutely key and lets us use information from undergraduates, postdocs and machines with minimal assumptions.
That’s why I’m happy to see that Brent is asking students to take pictures of the precipitates. We can compare directly against the pictures in my lab. Students are not always clear about the definition of a solution – for example milky suspensions can be considered solutions by some but they are not for our purposes. Pictures clarify that nicely.
Now there is a limit to how much can be reported by students. For example, I can’t easily get proof that the micropipette was used correctly or that another mistake in measurement happened or that the wrong compound was used. Pictures help a lot with this too – for example I had a student who claimed that a compound was not soluble – from the pic it was obvious that there were several hundred milligrams in the vial instead of 40. Here the mistake was a decimal point error.
But other mistakes won’t be caught quickly. For example we had a bad batch of phenanthrene-9-carboxaldehyde. When we got around to taking NMRs of starting materials we had to remove a bunch of results from the data pool and start over. This is why redundancy is still important. If two identical experiments don’t come out the same we know to investigate it.
Unfortunately our NSF proposal on Crowdsourcing chemistry was rejected but we’ll try again. I think Google would be a logical place to seek funding, especially since we use their services whenever possible -GoogleDocs, Blogger, etc. But they need to be convinced that there is something behind the concept. We already have compounds that were active against the malarial parasite from our collaboration with Rajarshi Guha at Indiana U. and Phil Rosenthal at UCSF. With Brent, if he can show that his teaching lab can produce new science we can use that as a hard fact. Cameron your support is extremely appreciated also!
I think that the role of funding is to ramp up an existing infrastructure of people already committed to participating. If I got more money I could support more students, get more chemicals, equipment like automated reactors, etc. Testers could get more resources to do more samples, modelers like Rajarshi could also get more assistance, etc. But, like open source software, the open source science core motor relies on the participation (even if limited) of people who have already bought into the concept.
Lets please continue the discussion and brainstorm some additional ways of funding these activities!
One thing that we do with bioinformatics student research projects is to provide the tools so that data quality can be measured in an objective manner. Those tools allow us to evaluate the quality of student work and feel good about submitting the final results of that work to public databases.
We are collaborating on a bioinformatics project where students clone and sequencing novel genes. If you want to see what we’re doing, you can go here, get the log-in info, and try it out.
One thing that we do with bioinformatics student research projects is to provide the tools so that data quality can be measured in an objective manner. Those tools allow us to evaluate the quality of student work and feel good about submitting the final results of that work to public databases.
We are collaborating on a bioinformatics project where students clone and sequencing novel genes. If you want to see what we’re doing, you can go here, get the log-in info, and try it out.
Convergence indeed! However, I felt a visceral reaction bulding when ‘data-model framework’ was mentioned:
The process of data-model development nearly breaks otherwise worthwhile projects, and extreme caution will be required to maintain flexibility (or repeated infliction of blunt force trauma – take your pick). Your post about ontology development touches on this (http://blog.openwetware.org/scienceintheopen/2008/04/08/semantics-in-the-real-world-part-ii-probabilistic-reasoning-on-contingent-and-dynamic-vocabularies/) because what is important is to ensure the correct terminology is used in the appropriate context to convey the meaning.
Overall, I can see this could work as a kind of federated LIMS. Models for representing experimental conditions were developed by necessity in various high-throughput communities so process records could be effectively recorded in LIMSs, and they often use an in-house vocabulary. A similar approach is possible with computational experiments, but it becomes difficult when there are a number of ‘different’ methods applied to the same measurement problem (especially when new methods appear on the scene), and the next analysis stage requires some consensus amongst all alternative methods to be reached. The other problem is that there are many LIMS out there, often adapted to specific research processes, so federation has its own nightmares.
Concerning Sarah Porters comments above – personal validation tools are essential at the point of data creation, and I think that internal consistency checks could also be used to detect systematic errors by an individual (who can’t use scales) or group (who had a bad batch of reagent leading to consistent impurities). What the data infrastructure will need to leverage these things is facility for ‘data quality trackback’ to allow curation. Discussions about this have been raised in the protein annotation community (c.f. the BioSapiens work on annotation ontology in DAS), but the problem of easily managing the storage of secondary curations whilst preserving any access issues (be it intellectual property or otherwise) has not yet been elegantly solved.
All I can say is that if I were a practical student again, labs would have been even more fun if the results were a contribution to something bigger!
Convergence indeed! However, I felt a visceral reaction bulding when ‘data-model framework’ was mentioned:
The process of data-model development nearly breaks otherwise worthwhile projects, and extreme caution will be required to maintain flexibility (or repeated infliction of blunt force trauma – take your pick). Your post about ontology development touches on this (http://blog.openwetware.org/scienceintheopen/2008/04/08/semantics-in-the-real-world-part-ii-probabilistic-reasoning-on-contingent-and-dynamic-vocabularies/) because what is important is to ensure the correct terminology is used in the appropriate context to convey the meaning.
Overall, I can see this could work as a kind of federated LIMS. Models for representing experimental conditions were developed by necessity in various high-throughput communities so process records could be effectively recorded in LIMSs, and they often use an in-house vocabulary. A similar approach is possible with computational experiments, but it becomes difficult when there are a number of ‘different’ methods applied to the same measurement problem (especially when new methods appear on the scene), and the next analysis stage requires some consensus amongst all alternative methods to be reached. The other problem is that there are many LIMS out there, often adapted to specific research processes, so federation has its own nightmares.
Concerning Sarah Porters comments above – personal validation tools are essential at the point of data creation, and I think that internal consistency checks could also be used to detect systematic errors by an individual (who can’t use scales) or group (who had a bad batch of reagent leading to consistent impurities). What the data infrastructure will need to leverage these things is facility for ‘data quality trackback’ to allow curation. Discussions about this have been raised in the protein annotation community (c.f. the BioSapiens work on annotation ontology in DAS), but the problem of easily managing the storage of secondary curations whilst preserving any access issues (be it intellectual property or otherwise) has not yet been elegantly solved.
All I can say is that if I were a practical student again, labs would have been even more fun if the results were a contribution to something bigger!