There has been a bit of discussion recently about identifying and promoting ‘wins’ for Open Science and Open Notebook Science. I was particularly struck by a comment made by Hemai Parthasarathy at the ScienceBlogging Meeting that she wasn’t aware of any really good examples that illustrate the power of open approches. I think sometimes we miss the most powerful examples right under our nose because they are such a familiar part of the landscape that we have forgotten they are there. So let us imagine two alternate histories; I have to admit I am very ignorant of the actual history of these resources but I am not sure that matters in making my point.
History the first…
In the second half of the twentieth century scientists developed methods for sequencing proteins and DNA. Not long after this the decades of hard work on developing methods for macromolecular structure determination started to bear fruit and the science of protein crystallography was born. There was a great feeling that in understanding the molecular detail of biological systems that disease was a beatable problem, that it was simply a matter of understanding the systems, to know how to treat any disease. Scientists, their funders, pharmaceutical companies, and publishers could see this was an important area for development, both in terms of the science and also with significant commercial potential.
There was huge excitement and a wide range of proprietary databases containing this information proliferated. Later there came suggestions that the NIH and EMBL should fund public databases with mandated deposition of data but a broad coalition of scientists, pharmaceutical companies, and publishers objected saying that this would hamper their ability to exploit their research effort and would reduce their ability to turn research into new drugs. Besides, the publishers said, all he important information is in the papers…By the mid-noughties a small group of scientists calling themselves ‘bioinformaticians’ started to appear and began to look at evolution of genetic sequences using those pieces of information they could legally scrape from the, now electronically available, published literature. One scientist was threatened with legal action for taking seven short DNA sequences from a published paper…
Imagine a world with no GenBank, no PDB, no SwissProt, and no culture growing out of these of publically funded freely available databases of biological information like Brenda, KEGG, etc etc. Would we still be living in the 90s, the 80s, or even the 70s compared to where we have got to?
History the second…
In the second half of the twentieth century synthetic organic chemistry went through an enormous technical revolution. The availability of modern NMR and Mass spectrometry radically changed the whole approach to synthesis. Previously the challenging problem had been figuring out what it was you had made. Careful degradation, analysis, and induction was required to understand what a synthetic procedure had generated. NMR and MS made this part of the process much easier shifting the problem to developing new synthetic methdology. Organic chemistry experienced a flowering as creative scientists flocked to develop new approaches that might bear their names if they were lucky.
There was tremendous excitement as people realised that virtually any molecule could be made, if only the methodology could be figured out. Diseases could be expected to fall as the synthetic methodology was developed to match the advances in the biological understanding. The new biological databases were providing huge quantities of information that could aid in the targeting of synthetic approaches. However it was clear that quality control was critical and sharing of quality control data was going to make a huge difference to the rate of advance. So many new compounds were being generated that it was impossible for anyone to check on the quality and accuracy of characterisation data. So, in the early 80s, taking inspiration from the biological community a coalition of scientists, publishers, government funders, and pharmaceutical companies developed public databases of chemical characterisation data with mandatory deposition policies for any published work. Agreed data formats were a problem but relatively simple solutions were found fast enough to solve these problems.
The availability of this data kick started the development of a ‘chemoinformatics’ community in the mid 80s leading to the development of sophisticated prediction tools that aided the synthetic chemists in identifying and optimising new methodology. By 1990, large natural products were falling to the synthetic chemists with such regularity that new academics moved into developing radically different methodologies targeted at entirely new classes of molecules. New databases containing information on the activity of compounds as substrates, inhibitors, and activators (with mandatory deposition policies for published data) provided the underlying datasets for validation that meant by the mid 90s structure based drug discovery was a solved problem. By the late 90s the chemoinformatic tools available made the development of tools for identifying test sets of small molecules to selectively target any biological process relatively straightforward.
Ok. Possibly a little utopian, but my point is this. Imagine how far behind we would be without Genbank, PDB, and without the culture of publically available databases that this embedded in the biological sciences. And now imagine how much further ahead chemical biology, organic synthesis, and drug discovery might have been with NMRBank, the Inhibitor Data Bank…