A use case scenario for Mark…a description of the first experiment on the ISIS LaBLog

Two rather exciting things are happening at the moment. Firstly we have finally got the LaBLog system up and running at RAL (http://biolab.isis.rl.ac.uk). Not a lot is happening there yet but we are gradually working up to a full Open Notebook status, starting by introducing people to the system bit by bit. My first experiment went up there late last week, it isn’t finished yet but I better get some of the data analysis done as rpg, if no-one else, is interested in the results.

The other area of development is that back down in Southampton, Blog MkIII is being specced out and design is going forward. This is being worked on now by both Mark Borkum and Andrew Milsted. Last time I was down in Southampton Mark asked me for some use cases – so I thought I might use the experiment I’ve just recorded to try and explain both the good and bad points of the current system, and also my continuing belief that anything but a very simple data model is likely to be fatally flawed when recording an experiment. This will also hopefully mark the beginning of more actual science content on this blog as I start to describe some of what we are doing and why. As we get more of the record of what we are doing onto the web we will be trying to generate a useful resource for people looking to use our kind of facilities.

So, very briefly the point of the experiment we started last week is to look at the use of GFP as a concentration and scattering standard in Small Angle Scattering. Small angle x-ray and neutron scattering provide an effective way of determining low resolution (say 5-10 A) structures of proteins in solution. However they suffer from serious potential artefacts that must be rigorously excluded before the data analysis can be trusted. One of those most crucial of these is aggregation, whether random conversion of protein into visible crud, or specific protein-protein interactions. Either of these, along with poor background subtraction or any one of a number of other problems can very easily render data and the analysis that depends on it meaningless.

So what to do? Well one approach is to use a very well characterised standard for which concentration, size, and shape are well established. There are plenty of proteins that are well behaved, pretty cheap, and for which the structure is known. However, as any biophysicist will tell you, measuring protein concentration accurately and precisely is tough; colorimetric assays are next to useless and measuring the UV absorbance of aromatic residues is pretty insensitive, prone to interference with other biological molecules (particularly DNA), and a lot harder to do right than most people think.

Our approach is to look at whether GFP is a good potential standard (specifically an eGFP engineered to prevent the tetramerisation that is common with the natural proteins). It has a strong absoprtion, well clear of most other biological molecules at 490 nm, it is dead easy to produce in large quantities (in our hands, I know other people have had trouble with this but we routinely pump out hundreds of milligrams and currently have a little over one gramme in the freezer), is stable in solution at high concentrations and freeze dries nicely. Sounds great! In principle we can do our scattering, then take the same sample cells, put them directly in a spectrophotometer, and measure the concentration. Last week was about doing some initial tests on a lab SAXS instrument to see whether the concept held up.

So – to our use case.

Maria, a student from Southampton, met me in Bath holding samples of GFP made up to 1, 2, 5, and 10 mg/mL in buffer. I quizzed Maria as to exactly how the samples had been made up and then recorded that in the LaBLog (post here). I then created the four posts representing each of the samples (1, 2, 3, 4). I also created a template for doing SAXS, and then, using that template I started filling in the first couple of planned samples (but I didn’t actually submit the post until sometime later).

At this point, as the buffer background was running, I realised that the 10mg/mL sample actually had visible aggregate in it. As the 5 mg/mL sample didn’t have any aggregate we changed the planned order of SAXS samples, starting with the 5 mg/mL sample. At the same time, we centrifuged the 10 mg/mL sample, which appeared to work quite nicely, generating a new, cleared 10 mg/mL sample, and prepared fresh 5 mg/mL and fresh 2 mg/mL samples.

Due to a lack of confidence in how we had got the image plate into its reader we actually ended up running the original 5 mg/mL sample three times. The second time we really did muck up the transfer but comparisons of the first and third time made us confident the first one was ok. At this point we were late for lunch and decided we would put the lowest concentration (1 mg/mL) sample on for an hour and grab something to eat. Note that by this time we had changed the expected order of samples about three or four times but none of this is actually recorded because I didn’t actually commit the record of data collection until the end of the day.

By this stage the running of samples was humming along quite nicely. It was time to deal with the data. The raw data comes off the instrument in the form of an image. I haven’t actually got these off the original computer as yet because they are rather large. However they are then immediately processed into relatively small two column data. It seems clear that each data file requires its own identity so those were all created (using another template) . Currently, several of these do not even have the two column text data, the big tiff files broke the system on upload, and I got fed up with uploading the reduced data by hand into each file.

As a result of running out of time and the patience to upload multiple files, the description of the data reduction is a bit terse, and although there are links to all the data most of you will get a 404 if you try to follow them, so I need to bring all of that back down and put it into the LaBLog proper where it is accessible but if you look closely here, you will see I made a mistake with some of the data analysis that needs fixing. I’m not sure I can be bothered systematically uploading all the incorrect files. If the system were acting naturally as a file repository and I was acting directly on those files then it would be part of the main workflow that everything would be made available automatically. The problem here was that I was forced by the instrument software to do the analysis on a specific computer (that wasn’t networked) and that our LaBLog system has no means of multiple file upload.

So to summarise the use case.

  1. Maria created four samples
  2. Original plan created to run the four samples plus backgrounds
  3. Realised 10mg/mL sample was aggregating and centrifuged to clear it (new unexpected procedure)
  4. Ran one of the pre-made samples three times, first time wasn’t confident, second time was a failure, third time confirmed first time was ok
  5. New samples prepared from cleared 10 mg/mL sample
  6. Prepared two new samples from cleared 10 mg/mL sample
  7. Re-worked plan for running samples based on time available
  8. Ran 1 mg/mL sample for different amount of time than previous samples
  9. Ran remaining samples for various amounts of time
  10. Data was collected from each sample after it was run and converted to a two column text format
  11. Two column data was rebinned and background subtracted (this is where I went wrong with some of them, forgetting that in some cases I had two lots of electronic background)
  12. Subtracted data was rebinned again and then desmeared (the instrument has a slit geometry rather than a pinhole) to generate a new two column data file.

So, four original samples, and three unexpected ones were created. One set of data collection led to nine raw data files which were then recombined in a range of different ways depending on collection times. Ultimately this generates four finalised reduced datasets, plus a number of files along the way. Two people were involved. And all of this was done under reasonable time pressure. If you look at the commit times on the posts you will realise that a lot of these were written (or at least submitted) rather late in the day, particularly the data analysis. This is because the data analysis was offline, out of the notebook in proprietary software. Not a lot that can be done about this. The other things that were late were the posts associated with the ‘raw’ datafiles. In both cases a major help would be a ‘directory watcher’ that automatically uploads files and queues them up somewhere so they are available to link to.

This was not an overly complicated or unusual experiment but one that illustrates the pretty common  changes of direction mid-stream and reassessments of priorities as we went. What it does demonstrate is the essential messiness of the process. There is no single workflow that traces through the experiment that can be applied across the whole experiment, either in the practical or the data analysis parts. There is no straightforward parallel process applied to a single set of samples but multiple, related samples, that require slightly different tacks to be taken with data analysis.  What there are, are objects that have relationships. The critical thing in any laboratory recording system is making the recording of both the objects, and the relationships between them, as simple and as natural as possible. Anything else and the record simply won’t get made.

A personal view of Open Science – Part II – Tools

The second installment of the paper (first part here) where I discuss building tools for Open (or indeed any) Science.

Tools for open science – building around the needs of scientists

It is the rapid expansion and development of tools that are loosely categorised under the banner of ‘Web2.0’ or ‘Read-write web’ that makes the sharing of research material available. Many of the generic tools, particularly those that provide general document authoring capabilities, have been adopted and used by a wide range of researchers. Online office tools can enable collaborative development of papers and proposals without the need for emailing documents to multiple recipients and the resultant headaches associated with which version is which. Storing spreadsheets, databases, or data online means that collaborators have easy access to the most recent versions and can see how these are changing. More generally the use of RSS feed readers and bookmarking sites to share papers of interest and, to some extent, to distribute the task of triaging the literature are catching in in some communities. The use of microblogging platforms such as Twitter and aggregation and conversational tools such as Friendfeed have recently been used very effectively to provide coverage of conferences in progress, including collaborative note-taking. In combination with streamed or recorded video as well as screencasts and sharing of presentations online the idea of a dstributed conference, while not an everyday reality, is becoming feasible.

However it is often the case that,while useful, generic web based services do not provide desired functionality or do not fit well into the existing workflows of researchers. Here there is the opportunity, and sometime necessity, to build specialised or adapated tools. Collaborative preparation of papers is a good example of this. Conventional web bookmarking services, such as del.icio.us provide a great way of sharing the literature or resources that a paper builds on with other authors but they do not automatically capture and recognise the necessary metadata associated with published papers (journal, date, author, volume, page numbers). Specialised services such as citeulke and Connotea have been developed to enable one click bookmarking of published literature and these have been used effectively by for example using a specific tag for references associated with a specific paper in progress. The problem with these services as they exist at the moment is that they don’t provide the crucial element in the workflow that scientists want to aggregate the references for, the formatting of the references in the finalised paper. Indeed the lack of formatting functionality in GoogleDocs, the most widely used collaborative writing tool, means that in practice the finalised document is usually cut and pasted into Word and the references formatted using proprietary software such as Endnote.The available tools do not provide the required functionality.

A number of groups and organisations have investigated the use of Blogs and Wikis as collaborative and shareable laboratory notebooks. However few of these systems offer good functionality ‘out of the box’. While there are many electronic laboratory notebook systems sold by commercial interests most are actually designed around securing data rather than sharing it so are not of interesthere. While the group of Jean-Claude Bradley has used the freely hosted WikiSpaces as a laboratory notebook without further modification, much of the data and analysis is hosted on other services, including YouTube, FlickR, and GoogleDocs. The OpenWetWare group has made extensive modifications to the MediaWiki system to provide laboratory notebook functionality whereas Garret Lisi has adapted the TiddlyWiki framework as a way of presenting his notebook. The Chemtools collaboration at the University of Southampton has developed a specialised Blog platform . Commercial offerings in the area of web based lab notebooks are also starting to appear. All of these different systems have developed because of the specialised needs of recording the laboratory work of the scientists they were designed for. The different systems make different assumptions about where they fit in the workflow of the research scientist, and what that workflow looks like. They are all, however, built around the idea that they need to satisfy the needs of the user.

This creates a tension in tool building. General tools, that can be used across a range of disciplines, are extremely challenging to design, because workflows, and the perception of how they work, are different in different disciplines. Specialist tools can be built for specific fields but often struggle to translate into new areas. Because the market is small in any field the natural desire for designers is to make tools as general as possible. However in the process of trying to build for a sufficiently general workflow it is often the case that applicability to specific workflows is lost. There is a strong argument based on this for building interoperable modules, rather than complete systems, that will allow domain specialists to stich together specific solutions for specific fields or even specific experiments. Interoperability of systems and standards that enable it is a criteria that is sometimes lost in the development process, but is absolutely essential to making tools and processes shareable. The use of workflow management tools, such as Taverna, Kepler, and VisTrails have an important role to play here.

While not yet at a stage where they are widely configurable by end users the vision behind them has the potential both to make data analysis much more straightforward for experimental scientist but also to solve many of the problems involved in sharing process, as opposed to data. The idea of visually wiring up online or local analysis tools to enable data processing pipelines is compelling. The reason most experimental scientists use spreadsheets for data analysis is that they do not wish to learn programming languages. Providing visual programming tools along with services with clearly defined inputs and outputs will make it possible for a much wider range of scientists to use more sophisticated and poweful analysis tools. What is more the ability to share, version, and attribute, workflows will go some significant distance towards solving the problem of sharing process. Services like MyExperiment which provide an environment for sharing and versioning Taverna workflows provide a natural way of sharing the details of exactly how a specific analysis is carried out. Along with an electronic notebook to record each specific use of a given workflow or analysis procedure (which can be achieved automatically though an API) the full details of the raw data, analysis procedure, and any specific parameters used, can be recorded. This combination offers a potential route out of the serious problem of sharing research processes if the appropriate support infrastructure can be built up.

Also critical to successful sharing is a shared language or vocabulary. The development of ontologies, controlled vocabularies, and design standards are all important in sharing knowledge and crucial to achieving the ulitmate goals of making this knowledge machine readable. While there are divisions in the technical development and user communities over the development and use of controlled vocabularies there is little disagreement over the fact that good vocabularies combined with good tools are useful. The disagreements tend to lie in how they are best developed, when they should be applied, and whether they are superior to or complementary to other approaches such as text mining and social tagging. An integrated and mixed approach to the use of controlled vocabularies and standards is the most likely to be successful. In particular it is important to match the degree of structure in the description to the natural degree of structure in the object or objects being described. Highly structured and consistent data types, such as crystal structures and DNA sequences, can benefit greatly from highly structured descriptions which are relatively straightforward to create, and in many cases are the standard outputs of an analysis process. For large scale experimental efforts the scale of the data and sample management problem makes an investment in detailed and structured desriptions worth while. In a small laboratory doing unique work, however, there may be a strong case for using local descriptions and vocabularies that are less rigorous but easier to apply and able to grow to fit the changing situation on the ground. Ideally designed in such a way that mapping onto an external vocabulary is feasible if it is required or useful in the future.

Making all of this work requires that researchers adopt these tools and that a community develops that is big enough to provide the added value that these tools might deliver. For a broad enough community to adopt these approaches the tools must fit well in their existing workflow and help to deliver the things that researchers are already motivated to produce. For most researchers, published papers are the measure of their career success and the basis of their reward structures. Therefore tools that make it easier to write papers, or that help researchers to write better papers, are likely to get traction. As the expectations of the quality and completeness of supporting data increase for published papers, tools that make it easier for the researcher to collate and curate the record of their research will become important. It is the process of linking the record of what happened in the laboratory, or study, to the first pass intepretation and analysis of data, through further rounds of analysis until a completed version is submitted for review, that is currently poorly supported by available tools, and it is this need that will drive the development of improved tools. These tools will enable the disparate elements of the record of research, currently scattered between paper notebooks, various data files on multiple hard drives, and unconnected electronic documents, to be chained together. Once this record is primarily electronic, and probably stored online in a web based system, the choice to make the record public at any stage from the moment the record is made to the point of publication, will be available. The reason to link this to publication is to tie it into an existing workflow in the first instance. Once the idea is embedded the steps involved in making the record even more open are easily taken.

Part III covers social issues around Open Science.

The Southampton Open Science Workshop – a brief report

On Monday 1 September we had a one day workshop in Southampton discussing the issues that surround ‘Open Science’. This was very free form and informal and I had the explicit aim of getting a range of people with different perspectives into the room to discuss a wide range of issues, including tool development, the social and career structure issues, as well as ideas about standards and finally, what concrete actions could actually be taken. You can find live blogging and other commentary in the associated Friendfeed room and information on who attended as well as links to many of the presentations on the conference wiki.

Broadly speaking the day was divided into three chunks, the first was focussed on tools and services and included presentations on MyExperiment, Mendeley, Chemtools, and Inkspot Science. Branwen Hide of Research Information Network has written more on this part. Given that the room contained more than the usual suspects the conversation focussed on usability and interfaces rather than technical aspects although there was a fair bit of that as well.

The second portion of the day revolved more around social challenges and issues. Richard Grant presented his experience of blogging on an official university sanctioned site and the value of that for both outreach and education. One point he made was that the ‘lack of adoption problem’ seen in science just doesn’t seem to exist in the humanities. Perhaps this is because scientists don’t generally see ‘writing’ as a valuable thing in its own right. Certainly there is a preponderance of scientists who happen also to see themselves as writers on Nature Network.

Jennifer Rohn followed on from Richard, and objected to my characterising her presentation as “the skeptic’s view”. A more accurate characterisation would have been “I’d love to be open but at the moment I can’t: This is what has to change to make it work”. She presented a great summary of the proble, particularly from the biological scientist’s point of view as well as potential solutions. Essentially the problem is that of the ‘Minimum Publishable Unit’ or research quantum as well as what ‘counts’ as publication. Her main point was that for people to be prepared to publish material that falls short of a full paper they need to get some proportional credit for that. This folds closely into the discussion of what can be cited and what should be cited in particular contexts. I have used the phrase ‘data sized peg into a paper shaped hole’ to describe this in the past.

After lunch Liz Lyon from UKOLN talked about curation and long term archival storage which lead into an interesting discussion about the archiving of blogs and other material. Is it worth keeping? One answer to this was to look at the real interest today in diaries from the second world war and earlier from ‘normal people’. You don’t necessarily need to be a great scientist, or even a great blogger, for the material to be of potential interest to historians in 50-100 years time. But doing this properly is hard – in the same way that maintaining and indexing data is hard. Disparate sites, file formats, places of storage, and in the end whose blog is it actually? Particularly if you are blogging for, or recording work done at, a research institution.

The final session was about standards or ‘brands’. Yaroslav Nikolaev talked about semantic representations of experiments. While important it was probably a shame in the end we did this at the end of the day because it would have been helpful to get more of the non-techie people into that discussion to iron out both the communication issues around semantic web as well as describing the real potential benefits. This remains a serious gap – the experimental scientists who could really use semantic tools don’t really get the point, and the people developing the tools don’t communicate well what the benefits are, or in some cases (not all I hasten to add!) actually build the tools the experimentalists want.

I talked about the possibility of a ‘certificate’ or standard for Open Science, and the idea of an organisation to police this. It would be safe to say that, while people agreed that clear definitions would be hepful, the enhusiasm level for a standards organisation was pretty much zero. There are more fundamental issues of actually building up enough examples of good practice, and working towards identifying best practice in open science, that need to be dealt with before we can really talk about standards.

On the other hand the idea of ‘the fully supported’ paper got immediate and enthusiastic support. The idea here is deceptively simple, and has been discussed elsewhere; simply that all the relevant supporting information for a paper (data, detailed methodology, software tools, parameters, database versions etc. as well as access to required materials at reasonable cost) should be available for any published paper. The challenge here lies in actually recording experiments in such a way that this information can be provided. But if all of the record is available in this form then it can be made available whenever the researcher chooses. Thus by providing the tools that enable the fully supported paper you are also providing tools that enable open science.

Finally we discussed what we could actually do: Jean-Claude Bradley discussed the idea of an Open Notebook Science challenge to raise the profile of ONS (this is now setup – more on this to follow). Essentially a competition type approach where individuals or groups can contribute to a larger scientific problem by collecting data – where the teams get judged on how well they describe what they have done and how quickly they make it available.

The most specific action proposed was to draft a ‘Letter to Nature’ proposing the idea of the fully supported paper as a submission standard. The idea would be to get a large number of high profile signatories on a document which describes  a concrete step by step plan to work towards the final goal, and to send that as correspondence to a high profile journal. I have been having some discussions about how to frame such a document and hope to be getting a draft up for discussion reasonably soon.

Overall there was much enthusiasm for things Open and a sense that many elements of the puzzle are falling into place. What is missing is effective coordinated action, communication across the whole community of interested and sympathetic scientsts, and critically the high profile success stories that will start to shift opinion. These ought to, in my opinion, be the targets for the next 6-12 months.

A personal view of Open Science – Part I

For the Open Science workshop at the Pacific Symposium on Biocomputing I wrote a very long essay as an introductory paper. It turned out that this was far too long for the space available so an extremely shortened version was submitted for the symposium proceedings. I thought I would post the full length essay in installments here as a prelude to cleaning it up and submitting to an appropriate journal.

Introduction

Openness is arguably the great strength of the scientific method. At its core is the principle that claims and the data that support them are placed before the community for examination and critique. Through open examination and critical analysis models can be refined, improved, or rejected. Conflicting data can be compared and the underlying experiments and methodology investigated to identify which, if any, is more reliable. While individuals may not always adhere to the highest standards, the community mechanisms of review, criticism, and integration have proved effective in developing coherent and useful models of the physical world around us. As Lee Smolin of the Perimeter Institute for Theoretical Physics recently put it, “we argue in good faith from shared evidence to shared conclusions“[1]. It is an open approach that drives science towards an understanding which, while never perfect, nevertheless enables the development of sophisticated technologies with practical applications.

The Internet and the World Wide Web provide the technical ability to share a much wider range of both the evidence and the argument and conclusions that drive modern research. Data, methodology, and interpretation can also be made available online at lower costs and with lower barriers to access than has traditionally been the case. Along with the ability to share and distribute traditional scientific literature, these new technologies also offer the potential for new approaches. Wikis and blogs enable geographically and temporally widespread collaborations, the traditional journal club can now span continents with online book marking tools such as Connotea and CiteULike, and the smallest details of what is happening in a laboratory (or on Mars [2]) can be shared via instant messaging applications such as Twitter.

The potential of online tools to revolutionise scientific communication and their ability to open up the details of the scientific enterprise so that a wider range of people can participate is clear. In practice, however, the reality has fallen far behind the potential. This is in part due to a need for tools that are specifically designed with scientific workflows in mind, partly due to the inertia of infrastructure providers with pre-Internet business models such as the traditional “subscriber pays” print literature and, to some extent, research funders. However it is predominantly due to cultural and social barriers within the scientific community. The prevailing culture of academic scientific research is one of possession – where control over data, methodological secrets, and exploitation of results are paramount. The tradition of Mertonian Science has receded, in some cases, so far that principled attempts to reframe an ethical view of modern science can seem charmingly naive.

It is in the context of these challenges that the movement advocating more openness in science must be seen. There will always be places where complete openness is not appropriate, such as where personal patient records may be identifiable, where research is likely to lead to patentable (and patent-worthy) results, or where the safety or privacy of environments, study subjects, or researchers might be compromised. These, however are special instances for which exceptional cases can be made, and not the general case across the whole of global research effort. Significant steps forward such as funder and institutional pre-print deposition mandates and the adoption of data sharing policies by UK Research Councils must be balanced against the legal and legislative attempts to overturn the NIH mandate and widespread confusion over what standards of data sharing are actually required and how they will be judged and enforced. Nonetheless there is a growing community interested in adopting more open practices in their research, and increasingly this community is developing as a strong voice in discussions of science policy, funding, and publication.  The aim of this workshop is to strengthen this voice by focusing the attention of the community on areas requiring technical development, the development and implementation of standards, both technical and social, and identification and celebration of success.

Why we need open science – Open Access publication, Open Data, and Open Process

The case for taxpayer access to the taxpayer funded peer reviewed literature was made personally and directly in Jonathon Eisen’s first editorial for PLoS Biology [3].

[…describing the submission of a paper to PLoS Biology as an ‘experiment’…] But then, while finalizing the paper, a two-month-long medical nightmare ensued that eventually ended in the stillbirth of my first child. While my wife and I struggled with medical mistakes and negligence, we felt the need to take charge and figure out for ourselves what the right medical care should be. And this is when I experienced the horror of closed-access publishing. For unlike my colleagues at major research universities that have subscriptions to all journals, I worked at a 300-person nonprofit research institute with a small library. So there I was—a scientist and a taxpayer—desperate to read the results of work that I helped pay for and work that might give me more knowledge than possessed by our doctors. And yet either I could not get the papers or I had to pay to read them without knowing if they would be helpful. After we lost our son, I vowed to never publish in non-OA journals if I was in control. […]

Eisen JA (2008) PLoS Biology 2.0. PLoS Biol 6(2): e48 doi:10.1371/journal.pbio.0060048

As a scientist in a small institution he was unable to access the general medical literature. More generally, as a US taxpayer he was unable to access the outputs of US government funded research or indeed of research funded by the governments of other countries. The general case for enabling access of both the general public, scientists in less well funded institutions, and in the developing world has been accepted by most in principle. While there are continuing actions being taken to limit the action of the NIH mandate by US publishers a wide range of research institutions have adopted deposition mandates. There remains much discussion about routes to open access with the debate over ‘Green’ and ‘Gold’ routes continuing as well as an energetic ongoing debate about the stability and viability of the business models of various open access journals. However it seems unlikely that the gradual increase in number and impact of open access journals is likely to slow or stop soon. The principle that the scientific literature should be available to all has been won. The question of how best to achieve that remains a matter of debate.

A similar case to that for access to the published literature can also be made for research data. At the extremes, withholding data could lead to preventable deaths or severely reduced quality of life for patients. Andrew Vickers, in a hard hitting New York Times essay [4] dissected the reasons that medical scientists give for not making data from clinical cancer trials available; data that could, in aggregate, provide valuable insights into enhancing patient survival time and quality of life. He quotes work by John Kirwan (Bristol University) showing that three quarters of researchers in one survey opposed sharing data from clinical trials. While there may be specific reasons for retaining specific types of data from clinical trials, particularly in small specialised cases where maintaining the privacy of participants is difficult or impossible, it seems unarguable that the interests of patients and the public demand that such data be available for re-use and analysis. This is particularly the case where the taxpayer has funded these trials, but for other funders, including industrial funders, there is a public interest argument for making clinical trial data public in particular.

In other fields the case for data sharing may seem less clear cut. There is little obvious damage done to the general public by not making the details of research available. However, while the argument is more subtle, it is similar to that for clinical data. There the argument is that reanalysis and aggregation can lead to new insights with an impact on patient care. In non-clinical sciences this aggregation and re-analysis leads to new insights, more effective analysis, and indeed new types of analysis. The massive expansion in the scale and ambition of biological sciences over the past twenty years is largely due to the availability of biological sequence, structural, and functional data in international and freely available archives. Indeed the entire field of bioinformatics is predicated on the availability of this data. There is a strong argument to be made that the failure of the chemical sciences to achieve a similar revolution is due to the lack of such publicly available data. Bioinformatics is a highly active and widely practiced field of science. By comparison, chemoinformatics is marginalised, and, what is most galling to those who care for the future of chemistry, primarily driven by the needs and desires of biological scientists. Chemists for the most part haven’t grasped the need because the availability of data is not part of their culture.

High energy particle physics by contrast is necessarily based on a community effort; without strong collaboration, communication, and formalised sharing of the details of what work is going on the research simply would not happen. Astronomy, genome sequencing, and protein crystallography are other fields where there is a strong history, and in some cases formalized standards of data sharing. While there are anecdotal cases of ‘cheating’ or bending the rules, usually to prevent or restrict the re-use of data, the overall impact of data sharing in these areas is generally seen as positive, leading to better science, higher data quality standards, and higher standards of data description. Again, to paraphrase Smolin, where the discussion proceeds from a shared set of evidence we are more likely to reach a valid conclusion. This is simply about doing better science by improving the evidence base.

The final piece of the puzzle, and in many ways the most socially and technically challenging is the sharing of research procedures. Data has no value in and of itself unless the process used to generate it is appropriate and reliable. Disputes over the validity of claims are rarely based on the data themselves but on the procedures used either to collect them or those used to process and analyse them. A widely reported recent case turned on the details of how a protein was purified; whether with a step or gradual gradient elution. This detail of procedure led laboratories to differing results, a year of wasted time for one researcher, and ultimately the retraction of several high profile papers [refs – nature feature, retractions, original paper etc]. Experimental scientists generally imagine that in the computational sciences where a much higher level of reproducibility and the ready availability of code and subversion repositories makes sharing and documenting material relatively straightforward, would have much higher standards. However, a recent paper [6] by Ted Pedersen (University of Minnesota, Duluth) – with the wonderful title ‘Empiricism is not a matter of faith’ – criticized the standards of both code documentation and availability. He makes the case that working with the assumption that you will make the tools available to others not only allows you to develop better tools, and makes you popular in the community, but also improves the quality of your own work.

And this really is the crux of the matter. If the central principle of the scientific method is open analysis and criticism of claims then making the data and process and conclusions avalable and accessible is just doing good science. While we may argue about the timing of release or the details of ‘how raw’ available data needs to be or the file formats or ontologies used to describe it there can be no argument that if the scientific record is to have value it must rest on an accessible body of relevant evidence. Scientists were doing mashups long before the term was invented; mixing data from more than one source; reprocessing it to provide a different view. The potential of online tools to help to do this better is massive, but the utility of these tools depends on the sharing of data, workflows, ideas, and opinions.

There are broadly three areas for development that are required to enable the more widespread adoption of open practice by research scientists. The first is the development of tools that are designed for scientists. While many of the general purpose tools and services have been adopted by researchers there are many cases where specialised design or adaptation is required for the specific needs of a research environment. In some cases the needs of research willpush development in specific areas, such as controlled vocabularies, beyond what is being done in the mainstream. The second, and most important area involves the social and cultural barriers within various research communities.These vary widely in type and importance across different fields and understanding and overcoming the fears as well as challenging entrenched interests will be an important part of the open science programme. Finally, there is a value and a need to provide top-down guidance in the form of policies and standards. The vagueness of the term ‘Open Science’ means that while it is a good banner there is a potential for confusion. Standards, policies, and brands can provide clarity for researchers, a clear articulation of aspirations (and a guide to the technical steps required to achieve them), and the support required to help people actually make this happen in their own research.

Part II will cover the issues around tools for Open Science

References

  1. Smolin L (2008), Science as an ethical community, PIRSA ID#08090035, http://pirsa.org/08090035/
  2. Mars Phoenix on Twitter, http://twitter.com/MarsPhoenix
  3. Eisen JA (2008) PLoS Biology 2.0. PLoS Biol 6(2): e48 doi:10.1371/journal.pbio.0060048
  4. Vickers A (2008), http://www.nytimes.com/2008/01/22/health/views/22essa.html?_r=1
  5. Pedersen T (2008), Computational Linguistics, Volume 34, Number 3, pp. 465-470, Self archived.

Convergent evolution of scientist behaviour on Web 2.0 sites?

A thought sparked off by a comment from Maxine Clarke at Nature Networks where she posted a link to a post by David Crotty. The thing that got me thinking was Maxine’ statement:

I would add that in my opinion Cameron’s points about FriendFeed apply also to Nature Network. I’ve seen lots of examples of highly specific questions being answered on NN in the way Cameron describes for FF…But NN and FF aren’t the same: they both have the same nice feature of discussion of a partiular question or “article at a URL somewhere”, but they differ in other ways,…[CN- my emphasis]

Alright, in isolation this doesn’t look like much, read through both David’s post and the comments, and then come back to Maxine’s,  but what struck me was that on many of these sites many different communities seem to be using very different functionality to do very similar things. In Maxine’s words ‘…discussion of a…paricular URL somewhere…’ And that leads me to wonder the extent to which all of these sites are failing to do what it is that we actually want them to do. And the obvious follow on question: What is it we want them to do?

There seem to be two parts to this. One, as I wrote in my response to David, is that a lot of this is about the coffee room conversation, a process of building and maintaining a social network. It happens that this network is online, which makes it tough to drop into each others office, but these conversational tools are the next best thing. In fact they can be better because they let you choose when someone can drop into your office, a choice you often don’t have in the physical world. Many services; Friendfeed, Twitter, Nature Networks, Faceboo, or a combination can do this quite well – indeed the conversation spreads across many services helping the social network (which bear in mind probably actually has less than 500 total members) to grow, form, and strengthen the connections between people.

Great. So the social bit, the bit we have in common with the general populace, is sorted. What about the science?

I think what we want as scientists is two things. Firstly we want the right URL delivered at the right time to our inbox (I am assuming anything important is a resource on the web – this may not be true now but give it 18 months and it will be) . Secondly we want a rapid and accurate assessment of this item, its validity, its relevance, and its importance to us judged by people we trust and respect. Traditionally this was managed by going to the library and reading the journals – and then going to the appropriate conference and talking to peopl. We know that the volume of material and the speed at which we need to deal with this is way too fast. Nothing new there.

My current thinking is that we are failing in building the right tools because we keep thinking of these two steps as separate when actually combining them into one integrated process would actual provide efficiency gains for both phases. I need to sleep on this to get it straight in my head, there are issues of resource discovery, timeframes, and social network maintenance that are not falling into place for me at the moment, so that will be the subject of another post.

However, whether I am right or wrong in that particular line of thought, if it is true that we are reasonably consistent in what we want then it is not suprising that we try to bend the full range of services available into achieving those goals. The interesting question is whether we can discern what the killer app would be by looking at the details of what people do to different services and where they are failing. In a sense, if there is a single killer app for science then it should be discernable what it would do based on what scientists try to do with different services…

Why a 25% risk of developing Parkinson’s really does matter

There has been a lot of commentary around the blogosphere about Sergey Brin’s blog post in which he announced that his SNP profile includes one variant which significantly increases his risk of developing Parkinson’s disease. The mainstream media seem mostly to be desperately concerned about the potential for the ignorant masses to be misinformed about what the results of such tests would mean and the potential for people to make unfortunate decisions based on them (based on that argument I’m not sure why we are allowed to have either credit cards or mortgages but never mind). The other, related stream, which is found more online is that this is all a bit meaningless because the correlation between the SNPs 23andMe measure and any disease (or indeed any phenotype) are weak so it’s not like he knows he’s going to get it – why not get a proper test for a proper genetic disease?

I think this is missing the point – and I think in fact, his post could represent the beginning of a significant change to the landscape of medical funding – precisely because that correlation is weak.

In the western world we have been talking for decades about how ‘prevention is better than cure’ yet funding for disease prevention has always remained poor. It’s not sexy, it is often long term, and it is much harder to get people to donate money for it.  Rich people donate money because relatives, or they themselves, are already ill. They are looking for a cure, or at least a legacy of helping people in the same situation.

The Health Commons project run by Science Commons has as its aim the reduction of the transactional costs involved in getting to a drug. If you can drop the cost of making a cure for a disease from $1B to $1M you have many orders of magnitude more people who can afford to just ‘buy a cure’ for a loved one, themselves, or to make themselves feel better.  This vision, and what it does to global health, even if the success rates are relatively low, is immensely powerful. And it needn’t apply just to drugs, or to cures.

But Brin doesn’t want a cure necessarily. Actually that’s not true – I’m sure he does, his mother has Parkinson’s and his great aunt suffered from it as well. Anyone who has seen someone suffer from a degenerative disease would want a cure. But equally, he’s a smart guy and knows just how tough that will be, and how much money is already going into such things.  But he can now look at his genetic profile now and look at which diseases he is predisposed to get. He can look at which of those he is really worried about. And then he can dig into his pocket and decide what the most efficient use of his $15B fortune is. Sure he will put millions into developing treatments, but the smart money will go on research into prevention or slowing onset. He has the time before any likely onset to allow programmes on prevention to run for 10-20 years before he will need to take a best guess on implementation.

And the point is that there will be a growing number of people having only the probabilities based on SNP data to work on making similar judgements about a range of diseases. And don’t believe that if we have full genome sequences those probabilities will get any better either. More precise yes. Better linked to phenotype? Not for a while yet. People who get these tests done don’t know exactly what they will get, but they have an idea, and they might have that idea up to 50 years in advance. Now consider what happens if the costs of developing methods to prevent or delay onset drops to the point that millionaires can make an impact. A thousand or maybe a miillion-fold more people with a deep interest in preventing the onset of specific diseases, an understanding of risk-based investment, and the money and the time to do something about it.

Preventative medicine just became the biggest growth area in medical research.

p.s. Attilla gets it – he’s just thinking regenerative rather than preventative – maybe they are the same thing in the end?

Won’t someone please think of the policy wonks?

I wouldn’t normally try to pick a fight with Chad Orzel, and certainly not over a post which I mostly agree with, but I wanted to take some issue with the slant in his relatively recent post We are science (see also a good discussion in the comments).  Chad makes a cogent argument that there is a lot of whining about credit and rewards and that ‘Science’ or ‘The Powers That Be’ are blamed for a lot of these things. His point is that ‘We are science’ – and that it is the community of scientists or parts of it that makes these apparently barmy decisions. And as part of the community, if we want to see change, it is our responsibility to get on and change things. There is a strong case for more grass-roots approaches and in particular for those of us with some influence over appointments and promotions procedures to make the case in those committees for widening criteria. I certainly don’t have any problem with his exhortation to ‘be the change’ rather than complain about it. I would hope that I do reasonably good, though by no means perfect,  job of trying to live by the principles I am advocate.

Yet at the same time I think his apparent wholesale rejection of top-down approaches is a bit too much. There is a place for advocating changes in policy and standards. It is not always the right approach, either because it is not the right time or place, but sometimes it is. One reason for advocating changes in policy and standards is to provide a clear message about what the aspirations of a community or funding body are. This is particularly important in helping younger researchers assess the risks and benefits of taking more open approaches. Many of the most energetic advocates of open practice are actually in no position to act on their beliefs because as graduate students and junior postdocs they have no power to make the crucial decisions over where to publish and how (and whether) to distribute data and tools.

Articulations of policy such as the data sharing statements required by the UK BBSRC make it clear that there is an aspiration to move in this direction, that funding will be linked to delivering on these targets. This will both encourage young scientists to make the case to their PIs that this is the way forward and will also influence hiring committees. Chad makes the point that a similar mandate on public outreach for NSF grants has not been taken seriously by grantees. Here I would agree. There is no point having such policies if they are not taken seriously. But everything I have seen and heard so far suggests that the BBSRC does intend to take their policy and delivery on the data sharing statements very seriously indeed.

Top down initiatives are also sometimes needed to drive infrastructure development. The usability of the tools that will be needed to deliver on the potential of data and process sharing is currently woefully inadequate. Development is necessary and funding for this is required. Without the clear articulation from funders that this is the direction in which they wish to go, that they expect to see standards rising year on year, and without them backing that up with money, then nothing much will happen. Again BBSRC has done this, explicitly stating that it expects funding requestst to include support for data availability. The implication is that if people haven’t thought about the details of what they will do and its costs there will be questions asked. I wonder whether this was true of the NSF outreach scheme?

Finally, policy and top-down fiat has the potential, when judiciously applied to accelerate change. Funders, and indeed governments want to see better value for money on the research investment and see greater data availability as one way of achieving that. Chad actually provides an example of this working. The NIH deposition mandate has significantly increased the proportion of NIH funded papers available in PubMedCentral (I suspect Chad’s figure of 30% is actually taken from my vague recollection of the results of the Wellcome Trust Mandate – I think that current evidence suggests that the NIH mandate is getting about 50% now – just saw a graph of this somewhere but can’t find it now! Mark Siegal in the comments provided a link to the data in an article in Science (sorry behind paywall) – here). Clearly providing funding and policy incentives can move us further in that direction quicker. Anybody who doesn’t believe that funding requirements can drive the behaviour of research communities extremely effectively clearly hasn’t applied for any funding in the last decade.

But it remains the case that policy and funding is a blunt instrument. It is far more effective in the long term to bring the community with you by persuasion than by force. A community of successful scientists working together and changing the world is a more effective message in many ways than a fiat from the funding body. All I’m saying is that a combination of both is called for.

Thinking about peer review of online material: The Peer Reviewed Journal of Open Science Online

I hold no particular candle for traditional peer review. I think it is inefficient, poorly selective, self reinforcing, often poorly done, and above all, far too slow. However I also agree that it is the least worst system we have available to us.  Thus far, no other approaches have worked terribly well, at least in the communication of science research. And as the incumbent for the past fifty years or so in the post of ‘generic filter’ it is owed some respect for seniority alone.

So I am considering writing a fellowship proposal that would be based around the idea of delivering on the Open Science Agenda via three independent projects, one focussed on policy and standards development, one on delivering a shareable data analysis pipeline for small angle scattering as an exemplar of how a data analysis process can be shared, and a third project based around building the infrastructure for embedding real science projects involving chemistry and drug discovery in educational and public outreach settings. I think I can write a pretty compelling case around these three themes and I think I would be well placed to deliver on them, particularly given the collaborative support networks we are already building in these areas.

The thing is I have no conventional track record in these areas. There are a bunch of papers currently being written but none that will be out in print by the time the application is supposed to go in. My recorded contribution in this area is in blog posts, blog comments, presentations and other material, all of which are available online. But none of which are peer-reviewed in the conventional sense.

One possibility is to make a virtue of this – stating that this is a rapidly moving field – that while papers are in hand and starting to come out that the natural medium for communication with the specific community is online through blogs and other media. There is an argument that conventional peer review simply does not map on to the web of data, tools, and knowledge that is starting to come together and that measuring a contribution in this area by conventional means is simply misguided.  All of which I agree with in many ways.

I just don’t think the referees will buy it.

Which got me thinking. It’s not just me, many of the seminal works for the Open Science community are not peer reviewed papers. Bill Hooker‘s three parter [1, 2, 3] at Three Quarks Daily comes to mind, as does Jean-Claude’s presentation on Nature Precedings on Open Notebook Science, Michael Nielsen’s essay The Future of Science, and Shirley Wu’s Envisioning the scientific community as One Big Lab (along with many others). It seems to me that these ought to have the status of peer reviewed papers which raises the question. We are a community of peers, we can referee, we can adopt some sort of standard of signficance and decide to apply that selectively to specific works online. So why can’t we make them peer reviewed?

What would be required? Well a stable citation obviously, so probably a DOI and some reasonably strong archival approach, probably using WebCite.  There would need to be a clear process of peer review, which need not be anonymous, but there would have to be a clear probity trail to show that an independent editor or group of referees made a decision and that appropriate revisions had been made and accepted. The bar for acceptance would also need to be set pretty high to avoid the charge of simply rubber stamping a bunch of online material. I don’t think open peer review is a problem for this community so many of the probity questions can be handled by simply having the whole process out in the open.

One model would be for an item to be submitted by posting a link on a new page on an independent Wiki . This would then be open to peer review. Once three (five?) independent reviewers had left comments and suggestions – and a version of the document created that satisfied them posted – then the new version could be re-posted at the author’s site, in a specified format which would include the DOI and arhival links, along with a badge that would be automatically aggregated to create the index a la researchblogging.org. There would need to be a charge, either for submission or acceptance – submission would keep volume down and (hopefully) quality up.

How does this differ from setting up a journal? Well two major things – one is that the author remains the publisher so the costs of publication per se are taken out of the equation. This is important as it keeps costs down – not zero, there is still the cost of the DOI and (even if it is donated) the time of editors and referees in managing the process and giving a stamp of authority. The main cost is in maintaining some sort of central index and server pointing out at the approved items. It would also be appropriate to support WebCite if that is the backstop archive. But the big costs for journals are in providing storage that is stable in the long term and managing peer review. If the costs of storage are offloaded and  the peer review process can be self organised then the costs drop significantly.

The second major advantage is that, as a community we already do a lot of this, looking over blog posts, linking to presentations, writing commentary or promoting them on FriendFeed. The reason why ArXiv worked was that there was already a culture of preprints amongst that community. The reason why commenting, rating,  and open peer review trials have not been as successful as people had hoped is because there is no pre-existing culture of doing these things. We already have a culture of open peer review in our community. Is it worth formalising it for the really high quality material that’s already out there?

I am aware that this goes against many of the principles of open and continuous review that many of you hold dear but I think it could serve two useful purposes. First it means that members of the community, particularly younger members, can bolster their CV with peer reviewed papers. Come the revolution this won’t matter but we’re not there yet. Making these contributions tangible for people could be quite powerful. Secondly it takes the important material out of the constant stream of objects flitting past on our screens and gives them a static (I won’t say permanent) priviledged place as part of the record of this field.  Many of them perhaps already have this but I think there is a value in formalising it. Is it worth considering? This proposal is out for review.

 

Q&A in this week’s Nature – one or two (minor) clarifications

So a bit of a first for me. I can vaguely claim to have contributed to two things into the print version of Nature this week. Strictly speaking my involvement in the first, the ‘From the Blogosphere‘ piece on the Science Blogging Challenge, was really restricted to discussing the idea (originally from Richard Grant I believe) and now a bit of cheerleading and ultimately some judging. The second item though I can claim some credit for in as much as it is a Q&A with myself and Jean-Claude Bradley that was done when we visited Nature Publishing Group in London a few weeks back.

It is great that a journal like Nature views the ideas of data publication, open notebook science, and open science in general as worthy of featuring. This is not an isolated instance either, as we can point to the good work of the Web Publishing Group, in developing useful resources such as Nature Precedings, as well as previous features in the print edition such as the Horizons article (there is also another version on Nature Precedings) written by Peter Murray-Rust. One thing I have heard said many times in recent months is that while people who advocate open science may not agree with everything NPG with respect to copyright and access, people are impressed and encouraged by the degree of engagement that they maintain with the community.

I did however just want to clarify one or two of the things I apparently said. I am not claiming that I didn’t say those things – the interview was recorded after all – but just that on paper they don’t really quite match what I think I meant to say. Quoting from the article:

CN-Most publishers regard what we do as the equivalent of presenting at a conference, or a preprint. That hasn’t been tested across a wide range of publishers, and there’s at least one — the American Chemical Society — that doesn’t allow prepublication in any form whatsoever.

That sounds a little more extreme than what I meant to say – there are a number of publishers that don’t allow submission of material that has appeared online as a pre-print and the ACS has said that they regard online publication as equivalent to a pre-print. I don’t have any particular sympathy for the ACS but I think they probably do allow publication of material that was presented at ACS conferences.

CN-Open notebooks are practical but tough at the moment. My feeling is that the tools are not yet easy enough to use. But I would say that a larger proportion of people will be publishing electronically and openly in ten years.

Here I think what I said is too conservative on one point and possibly not conservative enough on the other. I did put my neck out and say that I think the majority of scientists will be using electronic lab notebooks of one sort or another in ten years. Funder data sharing policies will drive a much greater volume of material online post publication (hopefully with higher quality description) and this may become the majority of all research data. I think that more people will be making more material available openly as it is produced as well but I doubt that this will be a majority of people in ten years – I hope for a sizeable and significant minority and that’s what we will continue to work towards.

The distinction between recording and presenting – and what it means for an online lab notebook

Something that has been bothering me for quite some time fell into place for me in the last few weeks. I had always been slightly confused by my reaction to the fact that on UsefulChem Jean-Claude actively works to improve and polish the description of the experiments on the wiki. Indeed this is one of the reasons he uses a wiki as the process of making modifications to posts on blogs is generally less convenient and in most cases there isn’t a robust record of the different versions. I have always felt uncomfortable about this because to me a lab book is about the record of what happened – including any mistakes in recording you make along the way. There is some more nebulous object (probably called a report) which aggregates and polishes the description of the experiments together.

Now this is fine, but point is that the full history of a UsefulChem page is immediately available from the history. So the full record is very clearly there – it is just not what is displayed. In our system we tend to capture a warts and all view of what was recorded at the time and only correct typos or append comments or observations to a post. This tends not be very human readable in most cases – to understand the point of what is going on you have to step above this to a higher level – one which we are arguably not very good at describing at the moment.

I had thought for a long time that this was a difference between our respective fields. The synthetic chemistry of UsefulChem lends itself to a slightly higher level description where the process of a chemical reaction is described in a fairly well defined, community accepted, style. Our biochemistry is more a set of multistep processes where each of those steps is quite stereotyped. In fact for us it is difficult to define where the ‘experiment’ begins and end. This is at least partly true, but actually if you delve a little deeper and also have a look at Jean-Claude’s recent efforts to use a controlled vocabulary to describe the synthetic procedures a different view arises. Each line of one these ‘machine readable’ descriptions actually maps very well onto each of our posts in the LaBLog. Something that maps on even better is the log that appears near the bottom of each UsefulChem page. What we are actually recording is rather similar. It is simply that Jean-Claude is presenting it at a different level of abstraction.

And that I think is the key. It is true that synthetic chemistry lends itself to a slightly different level of abstraction than biochemistry and molecular biology, but the key difference actually comes in motivation. Jean-Claude’s motivation from the beginning has been to make the research record fully available to other scientists; to present that information to potential users. My focus has always been on recording the process that occurs in the lab and particular to capture the connections between objects and data files. Hence we have adopted a fine grained approach that provides a good record, but does not necessarily make it easy for someone to follow the process through. On UsefulChem the ideal final product contains a clear description of how to repeat the experiment. On the LaBLog this will require tracking through several posts to pick up the thread.

This also plays into the discussion I had some months ago with Frank Gibson about the use of data models. There is a lot to be said for using a data model to present the description of an experiment. It provides all sorts of added value to have an agreed model of what these descriptions look like. However it is less clear to me that it provides a useful way of recording or capturing the research process as it happen, at least in a general case. Stream of consciousness recording of what has happened, rather than stopping halfway through to figure out how what you are doing fits into the data model, is what is required at the recording stage. One of the reasons people feel uncomfortable with electronic lab notebooks is that they feel they will lose the ability to scribble such ‘free form’ notes – the lack of any presuppositions about what the page should loook like is one of the strengths of pen and paper.

However, once the record, or records, have been made then it is appropriate to pull these together and make sense of them – to present the description of an experiment in a structured and sensible fashion. This can of course be linked back to the primary records and specific data files but it provides a comprehensible and fine grained descriptionof the rationale for and conduct of the experiment as well as placing the results in context. This ‘presentation layer’ is something that is missing from our LaBLog but could relatively easily be pulled together by writing up the methodology section for a report. This would be good for us and good for people coming into the system looking for specific information.

Person Frank Gibson

Right click for SmartMenu shortcuts