Home » Blog

Picture this…

27 January 2008 14 Comments

There has been a bit of discussion recently about identifying and promoting ‘wins’ for Open Science and Open Notebook Science. I was particularly struck by a comment made by Hemai Parthasarathy at the ScienceBlogging Meeting that she wasn’t aware of any really good examples that illustrate the power of open approches. I think sometimes we miss the most powerful examples right under our nose because they are such a familiar part of the landscape that we have forgotten they are there. So let us imagine two alternate histories; I have to admit I am very ignorant of the actual history of these resources but I am not sure that matters in making my point.

History the first…

In the second half of the twentieth century scientists developed methods for sequencing proteins and DNA. Not long after this the decades of hard work on developing methods for macromolecular structure determination started to bear fruit and the science of protein crystallography was born. There was a great feeling that in understanding the molecular detail of biological systems that disease was a beatable problem, that it was simply a matter of understanding the systems, to know how to treat any disease. Scientists, their funders, pharmaceutical companies, and publishers could see this was an important area for development, both in terms of the science and also with significant commercial potential.

There was huge excitement and a wide range of proprietary databases containing this information proliferated. Later there came suggestions that the NIH and EMBL should fund public databases with mandated deposition of data but a broad coalition of scientists, pharmaceutical companies, and publishers objected saying that this would hamper their ability to exploit their research effort and would reduce their ability to turn research into new drugs. Besides, the publishers said, all he important information is in the papers…By the mid-noughties a small group of scientists calling themselves ‘bioinformaticians’ started to appear and began to look at evolution of genetic sequences using those pieces of information they could legally scrape from the, now electronically available, published literature. One scientist was threatened with legal action for taking seven short DNA sequences from a published paper…

Imagine a world with no GenBank, no PDB, no SwissProt, and no culture growing out of these of publically funded freely available databases of biological information like Brenda, KEGG, etc etc. Would we still be living in the 90s, the 80s, or even the 70s compared to where we have got to?

History the second…

In the second half of the twentieth century synthetic organic chemistry went through an enormous technical revolution. The availability of modern NMR and Mass spectrometry radically changed the whole approach to synthesis. Previously the challenging problem had been figuring out what it was you had made. Careful degradation, analysis, and induction was required to understand what a synthetic procedure had generated. NMR and MS made this part of the process much easier shifting the problem to developing new synthetic methdology. Organic chemistry experienced a flowering as creative scientists flocked to develop new approaches that might bear their names if they were lucky.

There was tremendous excitement as people realised that virtually any molecule could be made, if only the methodology could be figured out. Diseases could be expected to fall as the synthetic methodology was developed to match the advances in the biological understanding. The new biological databases were providing huge quantities of information that could aid in the targeting of synthetic approaches. However it was clear that quality control was critical and sharing of quality control data was going to make a huge difference to the rate of advance. So many new compounds were being generated that it was impossible for anyone to check on the quality and accuracy of characterisation data. So, in the early 80s, taking inspiration from the biological community a coalition of scientists, publishers, government funders, and pharmaceutical companies developed public databases of chemical characterisation data with mandatory deposition policies for any published work. Agreed data formats were a problem but relatively simple solutions were found fast enough to solve these problems.

The availability of this data kick started the development of a ‘chemoinformatics’ community in the mid 80s leading to the development of sophisticated prediction tools that aided the synthetic chemists in identifying and optimising new methodology. By 1990, large natural products were falling to the synthetic chemists with such regularity that new academics moved into developing radically different methodologies targeted at entirely new classes of molecules. New databases containing information on the activity of compounds as substrates, inhibitors, and activators (with mandatory deposition policies for published data) provided the underlying datasets for validation that meant by the mid 90s structure based drug discovery was a solved problem. By the late 90s the chemoinformatic tools available made the development of tools for identifying test sets of small molecules to selectively target any biological process relatively straightforward.

Ok. Possibly a little utopian, but my point is this. Imagine how far behind we would be without Genbank, PDB, and without the culture of publically available databases that this embedded in the biological sciences. And now imagine how much further ahead chemical biology, organic synthesis, and drug discovery might have been with NMRBank, the Inhibitor Data Bank…


  • It is really striking how much the culture of openness is field-dependent. Actually I suspect the openness is selective within fields also. Although DNA sequences are public, reading the experimental sections of molecular biology papers leaves me with the impression that the protocols can be as obscure as organic chemistry.

  • It is really striking how much the culture of openness is field-dependent. Actually I suspect the openness is selective within fields also. Although DNA sequences are public, reading the experimental sections of molecular biology papers leaves me with the impression that the protocols can be as obscure as organic chemistry.

  • It was something that struck me about Shirley’s recent post. The cell biologists (and cancer biologists) depend absolutely on the open data provided by GenBank PubMed etc. but seem to be the resistant to making their own stuff open

  • It was something that struck me about Shirley’s recent post. The cell biologists (and cancer biologists) depend absolutely on the open data provided by GenBank PubMed etc. but seem to be the resistant to making their own stuff open

  • You are both spot on. There is a fundamental disconnect in my field (cell/molecular biology): absolute dependence on Open Data, and an absolute refusal to admit same (or, perhaps better, inability to recognize it).

    In his recent talk to SPARC, Andre made the point that physicists and biologists of his acquaintance come to opposite conclusions regarding preprints. The biologists wail “but I’ll be scooped” and the physicists want to know why biology is so slow that one can afford, career-wise, to wait for papers and not use preprints. They are both talking about competition and credit, but the physicists are used to having arXiv. I have great hopes for Nature Precedings in this arena.

  • You are both spot on. There is a fundamental disconnect in my field (cell/molecular biology): absolute dependence on Open Data, and an absolute refusal to admit same (or, perhaps better, inability to recognize it).

    In his recent talk to SPARC, Andre made the point that physicists and biologists of his acquaintance come to opposite conclusions regarding preprints. The biologists wail “but I’ll be scooped” and the physicists want to know why biology is so slow that one can afford, career-wise, to wait for papers and not use preprints. They are both talking about competition and credit, but the physicists are used to having arXiv. I have great hopes for Nature Precedings in this arena.

  • Further thought: why IS it that the extraordinary achievements in genomics don’t seem to count as a success for Open Science? I used that example in conversation at the conference (I think after someone else brought it up in Hemai’s talk), and the response was universally “oh yeah, why didn’t I think of that?”.

  • Further thought: why IS it that the extraordinary achievements in genomics don’t seem to count as a success for Open Science? I used that example in conversation at the conference (I think after someone else brought it up in Hemai’s talk), and the response was universally “oh yeah, why didn’t I think of that?”.

  • I think precisely because it is part of the wallpaper. I have had people ask ‘what was all that genome stuff for anyway?’. The answer I give is that, it doesn’t radically change the world, it just makes everything that I do a bit (and in some cases, really quite a bit) easier. Multiply that by a few million scientists and you have lots of small benefits meaning that we are probably 10-15 years further ahead than we would otherwise be.

    My tongue in cheek comments about chemoinformatics and bioinformatics were very deliberate. Chemoinformatics really seems to be at about the stage of the flint axe head compared to bioinformatics. But I think, and your comment about Andre’s talk here is absolutely apposite, that it is really about culture. Biologists _assume_ that data should be made public because there is a history of it. It is embedded in the culture.

  • I think precisely because it is part of the wallpaper. I have had people ask ‘what was all that genome stuff for anyway?’. The answer I give is that, it doesn’t radically change the world, it just makes everything that I do a bit (and in some cases, really quite a bit) easier. Multiply that by a few million scientists and you have lots of small benefits meaning that we are probably 10-15 years further ahead than we would otherwise be.

    My tongue in cheek comments about chemoinformatics and bioinformatics were very deliberate. Chemoinformatics really seems to be at about the stage of the flint axe head compared to bioinformatics. But I think, and your comment about Andre’s talk here is absolutely apposite, that it is really about culture. Biologists _assume_ that data should be made public because there is a history of it. It is embedded in the culture.

  • The kind of data being produced, and the availability of high-quality and inexpensive tools for manipulating and sharing it also plays a factor. Chemistry in particular has a big problem here:

    http://depth-first.com/articles/2007/12/20/if-you-want-to-change-the-world-build-the-tool-first-part-2

  • The kind of data being produced, and the availability of high-quality and inexpensive tools for manipulating and sharing it also plays a factor. Chemistry in particular has a big problem here:

    http://depth-first.com/articles/2007/12/20/if-you-want-to-change-the-world-build-the-tool-first-part-2

  • @Rich. Agreed but I guess part of my point was that if the data had been out there, or if people thought it was valuable, then solutions and formats would have been found. NMR machines would all pump out a single format data file because they would have to. But the tools are critical. One thing that is interesting is how bad the file formats in bioinformatics are. Everything pretty much is in flat text files and we are starting to pay the price for that in some areas.

  • @Rich. Agreed but I guess part of my point was that if the data had been out there, or if people thought it was valuable, then solutions and formats would have been found. NMR machines would all pump out a single format data file because they would have to. But the tools are critical. One thing that is interesting is how bad the file formats in bioinformatics are. Everything pretty much is in flat text files and we are starting to pay the price for that in some areas.