Re-inventing the wheel (again) – what the open science movement can learn from the history of the PDB

One of the many great pleasures of SciFoo was to meet with people who had a different, and in many cases much more comprehensive, view of managing data and making it available. One of the long term champions of data availability is Professor Helen Berman, the head of the Protein Data Bank (the international repository for biomacromolecular structures), and I had the opportunity to speak with her for some time on the Friday afternoon before Scifoo kicked off in earnest (in fact this was one of many somewhat embarrasing situations where I would carefully explain my background in my very best ‘speaking to non-experts’ voice only to find they knew far more about it than I did – however Jim Hardy of Gahaga Biosciences takes the gold medal for this event for turning to the guy called Larry next to him while having dinner at Google Headquarters and asking what line of work he was in).

I have written before about how the world might look if the PDB and other biological databases had never existed, but as I said then I didn’t know much of the real history. One of the things I hadn’t realised was how long it was after the PDB was founded before deposition of structures became expected for all atomic resolution biomacromolecular structures. The road from a repository of seven structures with a handful of new submissions a year to the standards that mean today that any structure published in a reputable journal must be deposited was a long and rocky one. The requirement to deposit structures on publication only became general in the early 1990s, nearly twenty years after it was founded and there was a very long and extended process where the case for making the data available was only gradually accepted by the community.

Helen made the point strongly that it had taken 37 years to get the PDB to where it is today; a gold standard international and publically available repository of a specific form of research data supported by a strong set of community accepted, and enforced, rules and conventions.  We don’t want to take another 37 years to achieve the widespread adoption of high standards in data availability and open practice in research more generally. So it is imperative that we learn the lessons and benefit from the experience of those who built up the existing repositories. We need to understand where things went wrong and attempt to avoid repeating mistakes. We need to understand what worked well and use this to our advantage. We also need to recognise where the technological, social, and political environment that we find ourselves in today means that things have changed, and perhaps to recognise that in many ways, particularly in the way people behave, things haven’t changed at all.

I’ve written this in a hurry and therefore not searched as thoroughly as I might but I was unable to find any obvious ‘history of the PDB’ online. I imagine there must be some out there – but they are not immediately accessible. The Open Science movement could benefit from such documents being made available – indeed we could benefit from making them required reading. While at Scifoo Michael Nielsen suggested the idea of a panel of the great and the good – those who would back the principles of data availability, open access publication, and the free transfer of materials. Such a panel would be great from the perspective of publicity but as an advisory group it could have an even greater benefit by providing the opportunity to benefit from the experience many of these people have in actually doing what we talk about.

Notes from Scifoo

I am too tired to write anything even vaguely coherent. As will have been obvious there was little opportunity for microblogging, I managed to take no video at all, and not even any pictures. It was non-stop, at a level of intensity that I have very rarely encountered anywhere before. The combination of breadth and sharpness that many of the participants brought was, to be frank, pretty intimidating but their willingness to engage and discuss and my realisation that, at least in very specific areas, I can hold my own made the whole process very exciting. I have many new ideas, have been challenged to my core about what I do, and how; and in many ways I am emboldened about what we can achieve in the area of open data and open notebooks. Here are just some thoughts that I will try to collect some posts around in the next few days.

  • We need to stop fretting about what should be counted as ‘academic credit’. In another two years there will be another medium, another means of communication, and by then I will probably be conservative enough to dismiss it. Instead of just thinking that diversifying the sources of credit is a good thing we should ask what we want to achieve. If we believe that we need a more diverse group of people in academia than that is what we should articulate – Courtesy of a discussion with Michael Eisen and Sean Eddy.
  • ‘Open Science’ is a term so vague as to be actively dangerous (we already knew that). We need a clear articulation of principles or a charter. A set of standards that are clear, and practical in the current climate. As these will be lowest common denominator standards at the beginning we need a mechanism that enables or encourages a process of incrementally raising those standards. The electronic Geophysical Year Declaration is a good working model for this – Courtesy of session led by Peter Fox.
  • The social and personal barriers to sharing data can be codified and made sense of (and this has been done). We can use this understanding to frame structures that will make more data available – session led by Christine Borgman
  • The Open Science movement needs to harness the experience of developing the open data repositories that we now take for granted. The PDB took decades of continuous work to bring to its current state and much of it was a hard slog. We don’t want to take that much time this time round – Courtesy of discussion led by Sarah Berman
  • Data integration is tough, but it is not helped by the fact that bench biologists don’t get ontologies, and that ontologists and their proponents don’t really get what the biologists are asking. I know I have an agenda on this but social tagging can be mapped after the fact onto structured data (as demonstrated to me by Ben Good). If we get the keys right then much else will follow.
  • Don’t schedule a session at the same time as Martin Rees does one of his (aside from anything else you miss what was apparently a fabulous presentation).
  • Prosthetic limbs haven’t changed in 100 years and they suck. Might an open source approach to building a platform be the answer – discussion with Jon Kuniholm, founder of the Open Prosthetics Project.
  • The platform for Open Science is very close and some of the key elements are falling into place. In many ways this is no longer a technical problem.
  • The financial system backing academic research is broken when the cost of reproducing or refuting specific claims rises to 10 to 20-fold higher than the original work. Open Notebook Science is a route to reducing this cost – discussion with Jamie Heywood.
  • Chris Anderson isn’t entirely wrong – but he likes being provocative in his articles.
  • Google run a fantasticaly slick operation. Down to the fact that the chocolate coated oatmeal biscuit icecream sandwiches are specially ordered in made with proper sugar instead of hugh fructose corn syrup.

Enough. Time to sleep.