Now that’s what I call social networking…

So there’s been a lot of antagonistic and cynical commentary about Web2.0 tools particularly focused on Twitter, but also encompassing Friendfeed and the whole range of tools that are of interest to me. Some of this is ill informed and some of it more thoughtful but the overall tenor of the comments is that “this is all about chattering up the back, not paying attention, and making a disruption” or at the very least that it is all trivial nonsense.

The counter argument for those of us who believe in these tools is that they offer a way of connecting with people, a means for the rapid and efficient organization of information, but above all, a way of connecting problems to the resources that can let us make things happen. The trouble has been that the best examples that we could point to were flashmobs, small scale conversations and collaborations, strangers meeting in a bar, the odd new connection made. But overall these are small things; indeed in most cases trivial things. Nothing that registers on the scale of “stuff that matters” to the powers that be.

That was two weeks ago. In the last couple of weeks I have seen a number of remarkable things happen and I wanted to talk about one of them here because I think it is instructive.

On Friday last week there was a meeting held in London to present and discuss the draft Digital Britain Report. This report, commissioned by the government is intended to map out the needs of the UK in terms of digital infrastructure, both physical, legal, and perhaps even social. The current tenor of the draft report is what you might expect, heavy on the need of putting broadband everywhere, to get content to people, and heavy on the need to protect big media from the rising tide of piracy. Actually it’s not all that bad but many of the digerati felt that it is missing important points about what happens when consumers are also content producers and what that means for rights management as the asymmetry of production and consumption is broken but the asymmetry of power is not. Anyway, that’s not what’s important here.

What is important is that the sessions were webcast, a number of people were twittering from the physical audience, and a much larger number were watching and twittering from outside, aggregated around a hashtag #digitalbritain. There was reportage going on in real time from within the room and a wideranging conversation going on beyond the walls of the room. In this day and age nothing particularly remarkable there. It is still relatively unusual for the online audience to be bigger than the physical one for these kind of events but certainly not unheard of.

Nor was it remarkable when Kathryn Corrick tweeted the suggestion that an unconference should be organized to respond to the forum (actually it was Bill Thomson who was first with the suggestion but I didn’t catch that one). People say “why don’t we do something?” all the time; usually in a bar. No, what was remarkable was what followed this as a group of relative strangers aggregated around an idea, developed and refined it, and then made it happen. One week later, on Friday evening, a website went live, with two scheduled events [1, 2], and at least two more to follow. There is an agreement with the people handling the Digital Britain report on the form an aggregated response should take. And there is the beginning of a plan as to how to aggregate the results of several meetings into that form. They want the response by 13 May.

Lets rewind that. In a matter of hours a group of relative strangers, who met each other through something as intangible as a shared word, agreed on, and started to implement a nationwide plan to gather the views of maybe a few hundred, perhaps a few thousand people, with the aim, and the expectation of influencing government policy. Within a week there was a scalable framework for organizing the process of gathering the response (anyone can organize one of the meetings) and a process for pulling together a final report.

What made this possible? Essentially the range of low barrier communication, information, and aggregation tools that Web2.0 brings us.

  1. Twitter: without twitter the conversation could never have happened. Friendfeed never got a look in because that wasn’t where this specific community was. But much more than just twitter, the critical aspect was;
  2. The hashtag #digitalbritain: the hashtag became the central point of a conversation between people who didn’t know each other, weren’t following each other, and without that link would never have got in contact. As the conversation moved to discussing the idea of an unconference the hashtags morphed first to #digitalbritain #unconference (an intersection of ideas) and then to #dbuc09. In a sense it became serious when the hashtag was coined. The barrier to a group of sufficiently motivated people to identify each other was low.
  3. Online calendars: it was possible for me to identify specific dates when we might hold a meeting at my workplace in minutes because we have all of our rooms on an online calendar system. Had it been more complex I might not have bothered. As it was it was easy to identify possible dates. The barrier to organization was low.
  4. Free and easy online services: A Yahoo Group was set up very early and used as a mailing list. WordPress.com provides a simple way of throwing up a website and giving specified people access to put up material. Eventbrite provies an easy method to manage numbers for the specific events. Sure someone could have set these up for us on a private site but the almost zero barrier of these services makes it easy for anyone to do this.
  5. Energy and community: these services  lead to low barriers, not zero barrier. There still has to be the motivation to carry it through. In this case Kathryn provided the majority of the energy and others chipped in along the way. Higher barriers could have put a stop to the whole thing, or perhaps stopped it going national, but there needs to be some motivation to get over the barriers that do remain. What was key was that a small group of people had sufficient energy to carry these through.
  6. Flexible working hours: none of this would be possible if the people who would be interested in attending such meetings couldn’t come on short notice. The ability of people to either arrange their own working schedule or to have the flexibility to take time out of work is crucial, otherwise no-one could come. Henry Gee had a marvelous riff on the economic benefits of flexible working just before the budget. The feasibility of our meetings is an example of the potential efficiency benefits that such flexibility could bring.

The common theme here is online services making it easy to aggregate the right people and the right information quickly, to re-publish that information in a useful form. We will use similar services, blogs, wikis, online documents to gather back the outputs from these meetings to push back into the policy making process. Will it make a big difference? Maybe not, but even in showing that this kind of response, this kind of community consultation can be done effectively in a matter of days and weeks, I think we’re showing what a Digital Britain ought to be about.

What does this mean for science or research? I will come back to more research related examples over the next few weeks but one key point was that this happened because there was a pretty large audience watching the webcast and communicating around it. As I and others have recently argued in research the community sizes probably aren’t big enough in most cases for these sort of network effects to kick in effectively. Building up community quantity and quality will be the main challenge of the next 6 – 12 months but where the community exists and where the time is available we are starting to see rapid, agile, and bursty efforts in projects and particularly in preparing documents.

There is clearly a big challenge in taking this into the lab but there is a good reason why when I talk to my senior management about the resources I need that the keywords are “capacity” and “responsiveness”. Bursty work requires the capacity to be in place to resource it. In a lab this is difficult, but it is not impossible. It will probably require a reconfiguring of resource distribution to realize its potential. But if that potential can be demonstrated then the resources will almost certainly follow.

Use Cases for Provenance – eScience Institute – 20 April

On Monday I am speaking as part of a meeting on Use Cases for Provenance (Programme), which has a lot of interesting talks scheduled. I appear to be last. I am not sure whether that means I am the comedy closer or the pre-dinner entertainment. This may, however, be as a result of the title I chose:

In your worst nightmares: How experimental scientists are doing provenance for themselves

On the whole experimental scientists, particularly those working in traditional, small research groups, have little knowledge of, or interest in, the issues surrounding provenance and data curation. There is however an emerging and evolving community of practice developing the use of the tools and social conventions related to the broad set of web based resources that can be characterised as “Web 2.0”. This approach emphasises social, rather than technical, means of enforcing citation and attribution practice, as well as maintaining provenance. I will give examples of how this approach has been applied, and discuss the emerging social conventions of this community from the perspective of an insider.

The meeting will be webcast (link should be available from here) and my slides will with any luck be up at least a few minutes before my talk in the usual place.

Creating a research community monoculture – just when we need diversity

This post is a follow on from a random tweet that I sent a few weeks back in response to a query on twitter from Lord Drayson, the UK’s Minister of State for Science and Innovation. I thought it might be an idea to expand from the 140 characters that I had to play with at the time but its taken me a while to get to it. It builds on the ideas of a post from last year but is given a degree of urgency by the current changes in policy proposed by EPSRC.

Government money for research is limited, and comes from the pockets of taxpayers. It is incumbent on those of us who spend it to ensure that this investment generates maximum impact. Impact, for me comes in two forms. Firstly there is straightforward (although not straightforward to measure) economic impact; increases in competitivenes, standard of living, development of business opportunities, social mobility, reductions in the burden of ill health and hopefully in environmental burden at some point in the future. The problem with economic impact is that it is almost impossible to measure in any meaningful way. The second area of impact is, at least on the surface, a little easier to track, that is research outputs delivere. How efficiently do we turn money into science? Scratch beneath the surface and you realise rapidly that measurement is a nightmare, but we can at least look at where there are inefficiencies, where money is being wasted, and being lost from the pipelines before it can be spent on research effort.

The approach that is being explicitly adopted in the UK is to concentrate research in “centres of excellence” and to “focus research on areas where the UK leads” and where “they are relevant to the UK’s needs”. At one level this sounds like motherhood and apple pie. It makes sense in terms of infrastructure investment to focus research funding both geographically and in specific subject areas. But at another level it has the potential to completely undermine the UK’s history of research excellence.

There is a fundamental problem with trying to maximise the economic impact of research. And it is one that any commercial expert, or indeed politician should find obvious. Markets are good at picking winners, commitees are very bad at it. Using committees of scientists, with little or no experience of commercialising research outputs is likely to be an unmitigated disaster. There is no question that some research leds to commercial outcomes but to the best of my knowledge there is no evidence that anyone has ever had any success in picking the right projects in advance. The simple fact is that the biggest form of economic impact from research is in providing and supporting the diverse and skilled workforce that support a commercially responsive, high technology economy. To a very large extent it doesn’t actually matter what specific research you support as long as it is diverse. And you will probably generate just exactly the same amount of commercial outcomes by picking at random as you will by trying to pick winners.

The world, and the UK in particular, is facing severe challenges both economic and environmental for which there may be technological solutions. Indeed there is a real opportunity in the current economic climate to reboot the economy with low carbon technologies and at the same time take the opportunity to really rebuild the information economy in a way that takes advantage of the tools the web provides, and in turn to use this to improve outcomes in health, social welfare, to develop new environmentally friendly processes and materials. The UK has great potential to lead these developments precisely because it has a diverse research community and a diverse highly trained research and technology workforce. We are well placed to solve todays problems with tomorrow’s technology.

Now let us return to the current UK policy proposals. These are to concentrate research, to reduce diversity, and to focus on areas of UK strength. How will those strengths be identified? No doubt by committee. Will they be forward looking strengths? No, they will be what a bunch of old men, already selected by their conformance to a particular stereotype, i.e. the ones doing fundable research i fundable places, identify in a closed room. It is easy to identify the big challenges. It is not easy, perhaps not even possible, to identify the technological solutions that will eventually solve them. Not the currently most promising solutions, the ones that will solve the problem five or ten years down the track.

As a thought experiment think back to what the UK’s research strengths and challenges were 20 years ago and imagine a world in which they were exclusively funded. It would be easy to argue that many of the UK’s current strengths simply wouldn’t even exist (web technology? biotechnology? polymer materials?). And that disciplines that have subsequently reduced in size or entirely disappeared would have been maintained at the cost of new innovation. Concentrating research in a few places, on a few subjects, will reduce diversity, leading to the loss of skills, and probably the loss of skilled people as researchers realise there is no future career for them in the UK. It will not provide the diverse and skilled workforce required to solve the problems we face today. Concentrating on current strengths, no matter how worthy, will lead to ossification and conservatism making UK research ultimately irrelevant on a world stage.

What we need more than ever now, is a diverse and vibrant research community working on a wide range of problems, and to find better communication tools so as to efficiently connect unexpected solutions to problems in different areas. This is not the usual argument for “blue skies research”, whatever that may be. It is an argument for using market forces to do what they are best at (pick the winners from a range of possible technologies) and to use the smart people currently employed in research positions at government expense to actually do what they are good at; do research and train new researchers. It is an argument for critically looking at the expenditure of government money in a wholistic way and to seriously consider radical change where money is being wasted. I have estimated in the past that the annual cost of failed grant proposals to the UK government is somewhere between £100M – £500M, a large sum of money in anybody’s books. More rigorous economic analysis of a Canadian government funding scheme has shown that the cost of preparing and refeering the proposals ($CAN40k) is more than the cost of giving every eligible applicant a support grantof $CAN30k. This is not just farcical, it is an offensive waste of taxpayer’s money.

The funding and distribution of research money requires radicaly overhaul. I do not beleive that simply providing more money is the solution. Frankly we’ve had a lot more money, it makes life a little more comfortable if you are in the right places, but it has reduced the pressure to solve the underlying problems. We need responsive funding at a wide range of levels that enables both bursts of research, the kind of instant collaboration that we know can work, with little or no review, and large scale data gathering projects of strategic importance that need extensive and careful critical review before being approved.  And we need mechanisms to tension these against each other. We need baseline funding to just let people get on with research and we need access to larger sums where appropriate.

We need less buearacracy, less direction from the top, and more direction from the sides, from the community, and not just necessarily the community of researchers. What we have at the moment are strategic initiatives announced by research councils that are around five years behind the leading edge, which distort and constrain real innovation. Now we have ministers proposing to identify the UK’s research strengths. No doubt these will be five to ten years out of date and they will almost certainly stifle those pockets of excellence that will grow in strengths over the next decade. No-one will ever agree what tomorrow’s strengths will be. Much better would be to get on and find out.

Open Data, Open Source, Open Process: Open Research

There has been a lot of recent discussion about the relative importance of Open Source and Open Data (Friendfeed, Egon Willighagen, Ian Davis). I don’t fancy recapitulating the whole argument but following a discussion on Twitter with Glyn Moody this morning [1, 2, 3, 4, 5, 6, 7, 8] I think there is a way of looking at this with a slightly different perspective. But first a short digression.

I attended a workshop late last year on Open Science run by the Open Knowledge Foundation. I spent a significant part of the time arguing with Rufus Pollock about data licences, an argument that is still going on. One of Rufus’ challenges to me was to commit to working towards using only Open Source software. His argument was that there wasn’t really any excuses any more. Open Office could do the job of MS Office, Python with SciPy was up to the same level as MatLab, and anything specialist needed to be written anyway so should be open source from the off.

I took this to heart and I have tried, I really have tried. I needed a new computer and, although I got a Mac (not really ready for Linux yet), I loaded it up with Open Office, I haven’t yet put my favourite data analysis package on the computer (Igor if you must know), and have been working in Python to try to get some stuff up to speed. But I have to ask whether this is the best use of my time. As is often the case with my arguments this is a return on investment question. I am paid by the taxpayer to do a job. At what point is the extra effort I am putting into learning to use, or in some cases fight with, new tools cost more than the benefit that is gained, by making my outputs freely available?

Sometimes the problems are imposed from outside. I spent a good part of yesterday battling with an appalling, password protected, macroed-to-the-eyeballs Excel document that was the required format for me to fill in a form for an application. The file crashed Open Office and only barely functioned in Mac Excel at all. Yet it was required, in that format, before I could complete the application. Sometimes the software is just not up to scratch. Open Office Writer is fine, but the presentation and spreadsheet modules are, to be honest, a bit ropey compared to the commercial competitors. And with a Mac I now have Keynote which is just so vastly superior that I have now transferred wholesale to that. And sometimes it is just a question of time. Is it really worth me learning Python to do data analysis that I could knock in Igor in a tenth of the time?

In this case the answer is, probably yes. Because it means I can do more with it. There is the potential to build something that logs process the way I want to , the potential to convert it to run as a web service. I could do these things with other OSS projects as well in a way that I can’t with a closed product. And even better because there is a big open community I can ask for help when I run into problems.

It is easy to lose sight of the fact that for most researchers software is a means to an end. For the Open Researcher what is important is the ability to reproduce results, to criticize and to examine. Ideally this would include every step of the process, including the software. But for most issues you don’t need, or even want, to be replicating the work right down to the metal. You wouldn’t after all expect a researcher to be forced to run their software on an open source computer, with an open source chipset. You aren’t necessarily worried what operating system they are running. What you are worried about is whether it is possible read their data files and reproduce their analysis. If I take this just one step further, it doesn’t matter if the analysis is done in MatLab or Excel, as long as the files are readable in Open Office and the analysis is described in sufficient detail that it can be reproduced or re-implemented.

Lets be clear about this: it would be better if the analysis were done in an OSS environment. If you have the option to work in an OSS environment you can also save yourself time and effort in describing the process and others have a much better chances of identifying the sources of problems. It is not good enough to just generate an Excel file, you have to generate an Excel file that is readable by other software (and here I am looking at the increasing number of instrument manufacturers providing software that generates so called Excel files that often aren’t even readable in Excel). In many cases it might be easier to work with OSS so as to make it easier to generate an appropriate file. But there is another important point; if OSS generates a file type that is undocumented or worse, obfuscated, then that is also unacceptable.

Open Data is crucial to Open Research. If we don’t have the data we have nothing to discuss. Open Process is crucial to Open Research. If we don’t understand how something has been produced, or we can’t reproduce it, then it is worthless. Open Source is not necessary, but, if it is done properly, it can come close to being sufficient to satisfy the other two requirements. However it can’t do that without Open Standards supporting it for documenting both file types and the software that uses them.

The point that came out of the conversation with Glyn Moody for me was that it may be more productive to focus on our ability to re-implement rather than to simply replicate. Re-implementability, while an awful word, is closer to what we mean by replication in the experimental world anyway. Open Source is probably the best way to do this in the long term, and in a perfect world the software and support would be there to make this possible, but until we get there, for many researchers, it is a better use of their time, and the taxpayer’s money that pays for that time, to do that line fitting in Excel. And the damage is minimal as long as source data and parameters for the fit are made public. If we push forward on all three fronts, Open Data, Open Process, and Open Source then I think we will get there eventually because it is a more effective way of doing research, but in the meantime, sometimes, in the bigger picture, I think a shortcut should be acceptable.

Eduserv Symposium 2009 – Evolution or revolution: The future of identity and access management for research

I am speaking at the Eduserv Symposium on London in late May on the subject of the importance of identity systems for advancing the open research agenda.

From the announcement:

The Eduserv Symposium 2009 will be held on Thursday 21st May 2009 at the Royal College of Physicians, London. More details about the event are available from: http://www.eduserv.org.uk/events/esym09

This symposium will be of value to people with an interest in the impact that the social Web is having on research practice and scholarly communication and the resulting implications for identity and access management. Attendees will gain an insight into the way the research landscape is evolving and will be better informed when making future decisions about policy or practice in this area.

Confirmed speakers include: James Farnhill, Joint Information Systems Committee; Nate Klingenstein, Internet2; Cameron Neylon, Science and Technology Facilities Council; Mike Roch, University of Reading; David Smith, CABI; John Watt, National e-Science Centre (Glasgow).

I’ve just written the abstract and title for my talk and will in time honored fashion be preparing the talk “just in time” so will try to make it as up to the minute as possible. Any comments or suggestions are welcome and the slides will be available on slideshare as soon as I have finalized them (probably just after I give the talk…)

Oh, you’re that “Cameron Neylon”: Why effective identity management is critical to the development of open research

There is a growing community developing around the need make the outputs of research available more efficiently and more effectively. This ranges from efforts to improve the quality of data presentation in published peer reviewed papers through to efforts where the full record of research is made available online, as it is recorded. A major fear as more material goes online in different forms is that people will not receive credit for their contribution. The recognition of researcher’s contribution has always focussed on easily measurable quantities. As the diversity of measurable contributions increases there is a growing need to aggregate the contributions of a specific researcher together in a reliable and authoritative way. The key to changing researcher behaviour lies in creating a reward structure that acknowledges their contribution and allows them to effectively cited. Effective mechanisms for uniquely identifying researchers are therefore at the heart of constructing reward systems that support an approach to research that fully exploits the communication technologies available to us today.

Capturing the record of research process – Part II

So in the last post I got all abstract about what the record of process might require and what it might look like. In this post I want to describe a concrete implementation that could be built with existing tools. What I want to think about is the architecture that is required to capture all of this information and what it might look like.

The example I am going to use is very simple. We will take some data and do a curve fit to it. We start with a data file, which we assume we can reference with a URI, and load it up into our package. That’s all, keep it simple. What I hope to start working on in Part III is to build a toy package that would do that and maybe fit some data to a model. I am going to assume that we are using some sort of package that utilizes a command line because that is the most natural way of thinking about generating a log file, but there is no reason why a similar framework can’t be applied to something using a GUI.

Our first job is to get our data. This data will naturally be available via a URI, properly annotated and described. In loading the data we will declare it to be of a specific type, in this case something that can be represented as two columns of numbers. So we have created an internal object that contains our data. Assuming we are running some sort of automatic logging program on our command line our logfile will now look something like:
> start data_analysis
...loading data_analysis
...Version 0.1
...Package at: http://mycodeversioningsystem.org/myuserid/data_analysis/0.1
...Date is: 01/01/01
...Local environment is: Mac OS10.5
...Machine: localhost
...Directory: /usr/myuserid/Documents/Data/some_experiment
> data = load_data(URI)
...connecting to URI
...found data
...created two column data object "data"
...pushed "data" to http://myrepository.org/myuserid/new_object_id
..."data" aliased to http://myrepository.org/myuserid/new_object_id

That last couple are important because we want all of our intermediates to be accessible via a URI on the open web. The load_data routine will include the pushing of the newly created object in some useable form to an external repository. Existing services that could provide this functionality include a blog or wiki with an API, a code repository like GitHub, GoogleCode, or SourceForge, an institutional or disciplinary repository, or a service like MyExperiment.org. The key thing is that the repository must then expose the data set in a form which can be readily extracted by the data analysis tool being used. The tool then uses that publicly exposed form (or an internal representation of the same object for offline work).

At the same time a script file is being created that if run within the correct version of data_analysis should generate the same results.
# Script Record
# Package: http://http://mycodeversioningsystem.org/myuserid/data_analysis/0.1
# User: myuserid
# Date: 01/01/01
# System: Mac OS 10.5
data = load_data(URI)

The script might well include some system scripting that would attempt to check whether the correct environment (e.g. Python) for the tool is available and to download and start up the tool itself if the script is directly executed from a GUI or command line environment. The script does not care what the new URI created for the data object was because when it is re-run it will create a new one. The Script should run independently of any previous execution of the same workflow.

Finally there is the graph. What we have done so far is to take one data object and convert it to a new object which is a version of the original. That is then placed online to generate an accessible URI. We want our graph to assert that http://myrepository.org/myuserid/new_object_id is a version of URI (excuse my probably malformed RDF).

<data_analysis:data_object
    rdf:about= "http://myrepository.org/myuserid/new_object_id">
  <data_analysis:data_type>two_column_data</data_analysis:data_type>
  <data_analysis:generated>
    <data_analysis:generated_from rdf:resource="URI"/>
    <data_analysis:generated_by_command>load_data</data_analysis:generated_via>
    <data_analysis:generated_by_version rdf:resource="http://mycodeversioningsystem.org/myuserid/data_analysis/0.1>
    <data_analysis:generated_in_system>Max OS 10.5</data_analysis:generated_in_system>
    <data_analysis:generated_by rdf:resource="http://myuserid.name"/>
    <data_analysis:generated_on_date dc:date="01/01/01"/>
  </data_analysis:generated>
</data_analysis:data_object>

Now this is obviously a toy example. It is relatively trivial to set up the data analysis package so as to write out these three different types of descriptive files. Each time a step is taken, that step is then described and appended to each of the three descriptions. Things will get more complicated if a process requires multiple inputs or generates multiple outputs but this is only really a question of setting up a vocabulary that makes reasonable sense. In principle multiple steps can be collapsed by combining a script file and the rdf as follows:

<data_analysis:generated_by_command
    rdf:resource="http://myrepository/myuserid/location_of_script"/>

I don’t know anything much about theoretical computer science but it seems to me that any data analysis package that works through a stepwise process running previously defined commands could be described in this way. And that given that this is how computer programs run that this suggests that any data analysis process can be logged this way. It obviously has to be implemented to write out the files but in many cases this may not even be too hard. Building it in at the beginning is obviously better. The hard part is building vocabularies that make sense locally and are specific enough but are appropriately wired into wider and more general vocabularies. It is obvious that the reference to data_analysis:data_type = “two_column_data” above should probably point to some external vocabulary that describes generic data formats and their representations (in this case probably a Python pickled two column array). It is less obvious where that should be, or whether something appropriate already exists.

This then provides a clear set of descriptive files that can be used to characterise a data analysis process. The log file provides a record of exactly what happened, that is reasonably human readable, and can be hacked using regular expressions if desired. There is no reason in principle why this couldn’t be in the form of an XML file with a style sheet appropriate for human readability. The script file provides the concept of what happened as well as the instructions for repeating the process. It could usefully be compared to a plan which would look very similar but might have informative differences. The graph is a record of the relationships between the objects that were generated. It is machine readable and can additionally be used to automate the reproduction of the process, but it is a record of what happened.

The graph is immensely powerful because it can be ultimately used to parse across multiple sets of data generated by different versions of the package and even completely different packages used by different people (provided the vocabularies have some common reference). It enables the comparison of analyses carried out in parallel by different people.

But what is most powerful about the idea of an rdf based graph file of the process is that it can be automated and completely hidden from the user. The file may be presented to the user in some pleasant and readable form but they need never know they are generating rdf. The process of wiring the dataweb up, and the following process of wiring up the web of things in experimental science, will rest on having the connections captured from and not created by the user. This approach seems to provide a way towards making that happen.

What does this tell us about what a data analysis tool should look like? Well ideally it will be open source, but at a minimum there must be a set of versions that can be referenced. Ideally these versions would be available on an appropriate code repository configured to enable an automatic download. They must provide, at a minimum a log file, and preferably both script and graph versions of this log (in principle the script can be derived from either of the other two which can be derived from each other, the log and graph can’t be derived from the script). The local vocabulary must be available online and should preferably be well wired into the wider data web. The majority of this should be trivial to implement for most command line driven tools and not terribly difficult for GUI driven tools. The more complicated aspects lie in the pushing out of intermediate objects and the finalized logs onto appropriate online repositories.

A range of currently available services could play these roles, from code repositories such as Sourceforge and Github, through to the internet archive, and data and process repositories such as MyExperiment.org and Talis Connected Commons, or to locally provided repositories. Many of these have sensible APIs and/or REST interfaces that should make this relatively easy. For new analysis tools this shouldn’t be terribly difficult to implement. Implementing it in existing tools could be more challenging but not impossible. It’s a question of will rather than severe technical barriers as far as I can see. I am going to start trying to implement some of this in a toy data fitting package in Python, which will be hosted at Github, as soon as I get some specific feedback on just how bad that RDF is…

Capturing the record of research process – Part I

When it comes to getting data up on the web, I am actually a great optimist. I think things are moving in the right direction and with high profile people like Tim Berners-Lee making the case, the with meme of “Linked Data” spreading, and a steadily improving set of tools and interfaces that make all the complexities of RDF, OWL, OAI-ORE, and other standards disappear for the average user, there is a real sense that this might come together. It will take some time; certainly years, possibly a decade, but we are well along the road.

I am less sanguine about our ability to record the processes that lead to that data or the process by which raw data is turned into information. Process remains a real recording problem. Part of the problem for this is just that scientists are rather poor at recording process in most cases, often assuming that the data file generated is enough. Part of the problem is the lack of tools to do the recording effectively, the reason why we are interested in developing electronic laboratory recording systems.

Part of it is that the people who have been most effective at getting data onto the web tend to be informaticians rather than experimentalists. This has lead in a number of cases to very sophisticated systems for describing data, and for describing experiments, that are a bit hazy about what it is they are actually describing. I mentioned this when I talked about Peter Murray-Rust and Egon Willighagen’s efforts at putting at UsefulChem experiments into Chemistry Markup Language. While we could describe a set of results we got awfully confused because we weren’t sure whether we were describing a specific experiment, the average of several experiments, or a description of what we thought was happening in the experiments. This is not restricted to CML. Similar problems arise with many of the MIBBI guidelines where the process of generating the data gets wrapped up with the data file. None of these things are bad, it is just that we may need some extra thought about distinguishing between classes and instances.

In the world of linked data Process can be thought of either as a link between objects, or as an object in its own right. The distinction is important. A link between, say, a material and data files, points to a vocabulary entry (e.g. PCR, HPLC analysis, NMR analysis). That is the link creates a pointer to the class of a process, not to a specific example of carrying out that process. If the process is an object in its own right then it refers reasonably clearly to a specific instance of that process (it links this material, to that data file, on this date) but it breaks the links between the inputs and outputs. While RDF can handle this kind of break by following the links though it creates the risk of a rapidly expanding vocabulary to hold all the potential relationships to different processes.There are other problems with this view. The object that represents the process. Is it a a log of what happened? Or is it the instructions to transform inputs into outputs? Both are important for reproducibility but they may be quite different things. They may even in some cases be incompatible if for some reason the specific experiment went off piste (power cut, technical issues, or simply something that wasn’t recorded at the time, but turned out to be important later).

The problem is we need to think clearly about what it is we mean when we point at a process. Is it merely a nexus of connections? Things coming out, things going in. Is it the log of what happened? Or is it the set of instructions that lets us convert these inputs into those outputs. People who have thought about this from a vocabulary of informatics perspective tend to worry mostly about categorization. Once you’ve figure out what the name of this process is and where it falls in the heirachy you are done. People who work on minimal description frameworks start from the presumption that we understand what the process is (a microarray experiment, electrophoresis etc). Everything within that is then internally consistent but it is challenging in many cases to link out into other descriptions of other processes. Again the distinction between what was planned to be done, and what was actually done, what instructions you would give to someone else if they wanted to do it, and how you would do it yourself if you did it again, have a tendency to be somewhat muddled.

Basically process is complicated, and multifaceted, with different requirements depending on your perspective. The description of a process will be made up of several different objects. There is a collection of inputs and a collection of outputs. There is a pointer to a classification, which may in itself place requirements on the process. The specific example will inherit requirements from the abstract class (any PCR has two primers, a specific PCR has two specific primers). Finally we need the log of what happened, as well as the “script” that would enable you to do it again. On top of this for a complex process you would need a clear machine-readable explanation of how everything is related. Ideally a set of RDF statements that explain what the role of everything is, and links the process into the wider web of data and things. Essentially the RDF graph of the process that you would use in an idealized machine reading scenario.

Diagramatic representation of process

This is all very abstract at the moment but much of it is obvious and the rest I will try to tease out by example in the next post. At the core is the point that process is a complicated thing and different people will want different things from it. This means that it is necessarily going to be a composite object. The log is something that we would always want for any process (but often do not at the moment). The “script”, the instructions to repeat the process, whether it be experimental or analytical is also desirable and should be accessible either by stripping the correct commands out of the log, or by adapting a generic set of instructions for that type of process. The new things is the concept of the graph. This requires two things. Firstly that every relevant object has an identity that can be referenced. Secondly that we capture the context of each object and what we do to it at each stage of the process. It is building systems that capture this as we go that will be the challenge and what I want to try and give an example of in the next post.

Do high art and science mix?

On Friday night we went down to London to see the final night of the English National Opera’s run of Doctor Atomic, the new opera from John Adams with a libretto by Peter Sellars. The opera focuses on the days and hours leading up to first test firing of an atomic bomb at the Trinity site on 15 July 1945. The doctor of the title is J Robert Oppenheimer who guided the Manhattan Project to its successful conclusion and the drama turns around him as the lynch pin connecting scientists, engineers, the military and politicians. The title recalls the science fiction novels of the 50s and 60s as but the key reference is to Thomas Mann’s much earlier doctor, Faust.

Sellars made the intriguing choice of drawing the libretto from archival documents, memoirs, recollections and interviews. The libretto places the words of the main protagonists in their own mouths; words that are alternately moving, disturbing, and dry and technical. These literal texts are supplemented with poetry drawn from Oppenheimer’s own interest. John Donne’s Holy Sonnet XIV, apparently the source of the name of the test site, Baudelaire, and the Bhagadavad Gita (which I learnt from the programme that Oppenheimer taught himself Sanskrit to read) all play a role. The libretto places the scientists in their historical context, giving them the human motivations, misgivings, and terrors, as well as the exhultation of success and the technical genius that lead to their ultimate success. The question is, does this presentation of science and scientists in the opera hall work?

The evening starts well, with a periodic table projected onto a sheer curtain. It appears to be drawn from  textbook circa 1945, with gaps beneath cesium, below manganese, and between neodymium and samarium. The choice is clever, providing a puzzle for people with all level of scientific experience; what is it; what do the symbols mean; why are there gaps; what goes into the gaps? It also leads into the first chorus “We beleived, that matter can neither be created nor destroyed, but only altered in form…”. Unfortunately at this point things began to fall apart, the chorus and orchestra appearing to struggle with the formidable rythmic challenges that Adam’s had set them. The majority of the first half felt largely shapeless, moments of rythmic and dramatic excitement lost within large tracts of expository and largely atonal semi-recitative.

Gerald Finley and Sasha Cooke were both in magnificent voice throughout but given limited opportunities for lyrical singing. The trademark rythmic drive of Adam’s previous works seems broken to pieces throughout the first Act, the chorus is used sparingly and the orchestra rarely feeling like it is moving beyond a source of sound effects. Act 1 is redeemed however by powerful closing scene in which Oppenheimer, in the shadow of the Trinity device, sings a lyrical and despairing setting of the Donne, “Batter my heart, three person’d God”, punctuated with a driving orchestral interlude reminiscent of the more familar Adam’s of Klinhoffer and Nixon in China. The effect is shattering, leaving the audience reflective in the gap between acts, but is also dramatically puzzling. Up until now Oppenheimer has show few doubts, now suddenly he seems to be despairing.

The second half started more confidently, with both chorus and orchestra appearing to settle. The second half is better organised and touches the heights of Adams’ previous work but opportunities are missed. The chorus and orchestra again seem under utilised but more generally the opportunity for effective musical treatment seems to be missed. The centre of the second act revolves around the night before the test, and the fears of those involved; fear that it won’t work, that the stormy weather will lead to the device detonating, the fears of General Groves that the scientists will run amok, and the, largely unspoke, fear of the consequences of the now inevitable use of the bomb in warfare. During the extended build to the climax and end of the opera the leads stand spread across the front of the stage, singing in turn, but not together, expressing their separate fears and beliefs but not providing the dramatic synthesis that would link them all together. Leo Szilard’s petition to Truman is sung in the first half by Robert Wilson as part of the introduction but not used, as it might have been, as a dramatic chorus in the second half.

But the core dramatic failure lies in the choice of the test itself as the end and climax of the opera. In todays world of CGI effects and surround sound it is simply not feasible to credibly treat an atomic explosion on a theatre stage or in an orchestral treatment. The tension is built extremely effectively towards the climax but then falls flat as the whole cast stand or crouch on stage, waiting for something to happen, the orchestra simulating the countdown through regular clock like noises all going at different rates. The flash that signals the explosion itself is simply a large array of spots trained on the backdrop. The curtain drops and a recording of a woman asking for water in Japanese is played. Moving yes, dramatically satisfying? No.

We do not see the consequences of the test. We don’t see the, undoubted, exhultation of scientists achieving what they set out to do. Nor do we see their growing realisation of the consequences. We are shown in short form the human consequences of the use of the weapon but not the way this effects the people who built the weapon. Ultimately as an exploration of the inner world of Oppenheimer, Teller, and the other scientists it feels as though it falls short. Given the inability to represent the explosion itself my personal choice would have been to let the action flow on, to the party and celebration that followed, allowing the cast to drift away gradually sobering and leaving Oppenheimer finally on stage to reprise the Donne.

The Opera is moving because we know the consequences of what followed. I found it dramatically unsatifying for both structural reasons and because it did not take the full opportunity to explore the complexities of the motivations of  the scientists themselves. There are two revealing moments. In the first half, having heard the impassioned Wilson reading Leo Szilard’s petition, Oppenheimer asks Wilson how he is feeling. The answer; “pretty excited actually”, captures the tension at the heart of the scientists beautifully, the excitement and adventure of the science, and the fears and disquiet over the consequences. In the second act, Edward Teller comes on stage to announce that Fermi is taking bets as to whether the test will ignite the atmosphere. He goes on at some length as to how his calculations suggested this would happen, Oppenheimer telling him to shut up, saying that Bethe had checked the figures. Teller continues to grandstand but then agrees magnanimously that he has now revised his figures and the world will not come to an end. Humour or arrogance? Or the heady combination of both that is not an unusual characteristic amongst scientists.

These flashes of insight into the leads minds never seem to fully explored, merely occasionally exposed. Later in Act 2 Teller teases Oppenheimer and the other scientists for placing bets on explosive yields for the device much lower than they have calculated. We never find out who was right. The playful, perhaps childish, nature of the scientists is exposed but their response after the test is not dissected. What was the response of those betting when they realised they knew exactly what the yield of the device to be dropped on Hiroshima was? Were the bets paid out or abandoned in disgust? The opera draws on the Faust legend for inspiration but the drama of Faust turns on the inner conflict of the scientist and the personal consequences of the choices made, things we don’t see enough of here.

Finally there is the question of the text itself. The chorus is restricted largely to technical statements. Richard Morrison, reviewing in the Times (I don’t seem to be able to find the review in question online) said that Adams struggled to bring poetry to some of these texts. It seems to be though, that there is poetry here for the right composer. I was slightly dissappointed that Adams was not that composer. Don’t let me put you off seeing this. It is not one of the great operas but it is very good and in places very powerful. It has flaws and weaknesses but if you want to spend your life waiting for perfection you are going to need to be very patient. This is a good night at the opera and a moving drama. We just need to keep waiting for the first successful opera on a scientific subject.

We believed that
“Matter can be neither
created nor destroyed
but only altered in form.”

We believed that
“Energy can be neither
created nor destroyed
but only altered in form.”

But now we know that
energy may become matter,
and now we know that
matter may become energy
and thus be altered in form.
[….]

We surround the plutonium core
from thirty two points
spaced equally around its surface,
the thirty-two points
are the centers of the
twenty triangular faces
of an icosahedron
interwoven with the
twelve pentagonal faces
of a dodecahedron.
We squeeze the sphere.
Bring the atoms closer.
Til the subcritical mass
goes supercritical.

We disturb the stable nucleus.

From the Chorus Text, Act I Scene I, Dr Atomic, Libretto arranged by Peter Sellars from a variety of texts, copyright Boosey and Hawkes

Doc Bushwell has also reviewed Dr Atomic

Fantasy Science Funding: How do we get peer review of grant proposals to scale?

This post is both a follow up to last week’s post on the cost’s of peer review and a response to Duncan Hull‘s post of nine or so months ago proposing a game of “Fantasy Science Funding“. The game requires you to describe how you would distribute the funding of the BBSRC if you were a benign (or not so benign) dictator. The post and the discussion should be read bearing in mind my standard disclaimer.

Peer review is in crisis. Anyone who tells you otherwise either has their head in the sand or is trying to sell you something. Volumes are increasing, quality of review is decreasing. The willingness of scientists to take on refereeing is increasingly the major problem for those who commission it. This is a problem for peer reviewed publication but the problems for the reviewing of funding applications are far worse.

For grant review, the problems that are already evident in scholarly publishing, fundamentally the increasing volume, are exacerbated by the fact that success rates for grans are falling and that successful grants are increasingly in the hands of a smaller number of people in a smaller number of places. Regardless of whether you agree with this process of concentrating grant funding this creates a very significant perception problem. If the perception of your referees is that they have no chance of getting funding, why on earth should they referee

Is this really happening? Well in the UK chemistry community last year there was an outcry when two EPSRC grant rounds in a row had success rates of 10% or lower. Bear in mind this was the success rate of grants that made it to panel, i.e. it is an upper bound, assuming there weren’t any grants removed at an earlier stage. As you can imagine there was significant hand wringing and a lot of jumping up and down but what struck me was two statements I heard made. The first, was from someone who had sat on one of the panels, was that it “raises the question of whether it is worth our time to attend panel meetings”. The second was the suggestion that the chemistry community could threaten to unilaterally withdraw from EPSRC peer review. These sentiments are now being repeated on UK mailing lists in response to EPSRC’s most recent changes to grant submission guidelines. Whether serious or not, credible or not, this shows that the compact of community contribution to the review process is perilously close to breaking down.

The research council response to this is to attempt to reduce the number of grant proposals, generally by threatening to block those who have a record of serial rejection. This will fail. With success rates as low as they are, and with successful grants concentrated in the hands of the few, most academics are serial failures. The only way departments can increase income is by increasing the volume and quality of grant applications. With little effective control over quality the focus will necessarily be on increasing volume. The only way research councils will control this is either by making applications a direct cost to departments, or by reducing the need of academics to apply.

The cost of refereeing is enormous and largely hidden. But it pales into insignificance compared to the cost of applying for grants. Low success rates make the application process an immense waste of departmental resources. The approximate average cost of running a UK academic for a year is £100,000. If you assume that each academic writes one grant per year and that this takes around two weeks full time work that amounts to ~£4k per academic per year. If there are 100,000 academics in the UK this is £400M, which with a 20% success rate means that £320M is lost in the UK each year. Let’s say that £100M is a reasonable ballpark figure.

In more direct terms this means that academics who are selected for their ability to do research, are being taken away from what they are good at to play a game which they will on average lose four times out of five. It would be a much more effective use of government funding to have those people actually doing research.

So this is a game of Fantasy Funding, how would I spend the money? Well, rather than discuss my biases about what science is important, which are probably not very interesting, it is perhaps more useful to think about how the system might be changed to reduce these wastages. And there is a simple, if somewhat radical, way of doing this.

Cut the budget in two and distribute half of it directly to academics on a pro-rata basis.

By letting researcher’s focus on getting on with research you will reduce their need for funding and reduce the burden. By setting the bar naturally higher for funding research you still maintain the perception that everyone is in with a chance and reduce the risk of referee drop out due to dis-enchantment with the process. More importantly you enable innovative research by allowing it to keep ticking over and in particular you enable a new type of peer review.

If you look at the amounts of money involved, say a few hundred million pounds for BBSRC, and divide that up amongst all bioscience academics, you end up with figures of a £5-20K per academic per year. Not enough to hire a postdoc, just about enough to run a PhD student (at least at UK rates). But what if you put that together with the money from a few other academics? If you can convince your peers that you have an interesting and fun idea then they can pool funds together. Or perhaps share a technician between two groups so that you don’t lose the entire group memory every time a student leaves? Effective collaboration will lead to a win on all sides.

If these arguments sound familiar it is because they are not so different to the notion of 20% time, best known as a Google policy of having all staff spend some time on personal projects. By supporting low level innovation and enabling small scale judging of ideas and pooling of resources it is possible to enable bottom up innovation of precisely the kind that is stifled by top down peer review.

No doubt there would be many unintended consequences, and probably a lot of wastage, but in amongst that I wouldn’t bet against the occassional brilliant innovation which is virtually impossible in the current climate.

What is clear is that doing nothing is not an option. Look at that EPSRC statement again. People with a long term success rate below 25% will be blocked…I just checked my success rate over the past ten years (about 15% by numbers of grants, 70% by value but that is dominated by one large grant). The current success rate at chemistry panel is around 15%. And that is skewed towards a limited number of people and places.

The system of peer review relies absolutely on the communities agreement to contribute and some level of faith in the outcome. It relies absolutely on trust. That trust is perilously close to a breakdown.