Everything I know about software design I learned from Greg Wilson – and so should your students

Visualization of the "history tree" ...
Image via Wikipedia

Which is not to say that I am any good at software engineering, good practice, or writing decent code. And you shouldn’t take Greg to task for some of the dodgy demos I’ve done over the past few months either. What he does need to take the credit for is enabling me to go from someone who knew nothing at all about software design, the management of software development or testing to being able to talk about these things, ask some of the right questions, and even begin to make some of my own judgements about code quality in an amazingly short period of time. From someone who didn’t know how to execute a python script to someone who feels uncomfortable working with services where I can’t use a testing framework before deploying software.

This was possible through the online component of the training programme, called Software Carpentry, that Greg has been building, delivering and developing over the past decade. This isn’t a course in software engineering and it isn’t built for computer science undergraduates. It is a course focussed on taking scientists who have done a little bit of tinkering or scripting and giving them the tools, the literacy, and the knowledge to apply the best of knowledge base of software engineering to building useful high quality code that solves their problems.

Code and computational quality has never been a priority in science and there is a strong argument that we are currently paying, and will continue to pay a heavy price for that unless we sort out the fundamentals of computational literacy and practices as these tools become ubiquitous across the whole spread of scientific disciplines. We teach people how to write up an experiment; but we don’t teach them how to document code. We teach people the importance of significant figures but many computational scientists have never even heard of version control. And we teach the importance of proper experimental controls but never provide the basic training in testing and validating software.

Greg is seeking support to enable him to update Software Carpentry to provide an online resource for the effective training of scientists in basic computational literacy. It won’t cost very much money; we’re talking a few hundred thousand dollars here. And the impact is potentially both important and large. If you care about the training of computational scientists; not computer scientists, but the people who need, or could benefit from, some coding, data managements, or processing in their day to day scientific work, and you have money then I encourage you to contribute. If you know people or organizations with money please encourage them to contribute. Like everything important, especially anything to do with education and preparing for the future, these things are tough to fund.

You can find Greg at his blog: http://pyre.third-bit.com

His description of what wants to do and what he needs to do it is at: http://pyre.third-bit.com/blog/archives/3400.html

Reblog this post [with Zemanta]

Creating a research community monoculture – just when we need diversity

This post is a follow on from a random tweet that I sent a few weeks back in response to a query on twitter from Lord Drayson, the UK’s Minister of State for Science and Innovation. I thought it might be an idea to expand from the 140 characters that I had to play with at the time but its taken me a while to get to it. It builds on the ideas of a post from last year but is given a degree of urgency by the current changes in policy proposed by EPSRC.

Government money for research is limited, and comes from the pockets of taxpayers. It is incumbent on those of us who spend it to ensure that this investment generates maximum impact. Impact, for me comes in two forms. Firstly there is straightforward (although not straightforward to measure) economic impact; increases in competitivenes, standard of living, development of business opportunities, social mobility, reductions in the burden of ill health and hopefully in environmental burden at some point in the future. The problem with economic impact is that it is almost impossible to measure in any meaningful way. The second area of impact is, at least on the surface, a little easier to track, that is research outputs delivere. How efficiently do we turn money into science? Scratch beneath the surface and you realise rapidly that measurement is a nightmare, but we can at least look at where there are inefficiencies, where money is being wasted, and being lost from the pipelines before it can be spent on research effort.

The approach that is being explicitly adopted in the UK is to concentrate research in “centres of excellence” and to “focus research on areas where the UK leads” and where “they are relevant to the UK’s needs”. At one level this sounds like motherhood and apple pie. It makes sense in terms of infrastructure investment to focus research funding both geographically and in specific subject areas. But at another level it has the potential to completely undermine the UK’s history of research excellence.

There is a fundamental problem with trying to maximise the economic impact of research. And it is one that any commercial expert, or indeed politician should find obvious. Markets are good at picking winners, commitees are very bad at it. Using committees of scientists, with little or no experience of commercialising research outputs is likely to be an unmitigated disaster. There is no question that some research leds to commercial outcomes but to the best of my knowledge there is no evidence that anyone has ever had any success in picking the right projects in advance. The simple fact is that the biggest form of economic impact from research is in providing and supporting the diverse and skilled workforce that support a commercially responsive, high technology economy. To a very large extent it doesn’t actually matter what specific research you support as long as it is diverse. And you will probably generate just exactly the same amount of commercial outcomes by picking at random as you will by trying to pick winners.

The world, and the UK in particular, is facing severe challenges both economic and environmental for which there may be technological solutions. Indeed there is a real opportunity in the current economic climate to reboot the economy with low carbon technologies and at the same time take the opportunity to really rebuild the information economy in a way that takes advantage of the tools the web provides, and in turn to use this to improve outcomes in health, social welfare, to develop new environmentally friendly processes and materials. The UK has great potential to lead these developments precisely because it has a diverse research community and a diverse highly trained research and technology workforce. We are well placed to solve todays problems with tomorrow’s technology.

Now let us return to the current UK policy proposals. These are to concentrate research, to reduce diversity, and to focus on areas of UK strength. How will those strengths be identified? No doubt by committee. Will they be forward looking strengths? No, they will be what a bunch of old men, already selected by their conformance to a particular stereotype, i.e. the ones doing fundable research i fundable places, identify in a closed room. It is easy to identify the big challenges. It is not easy, perhaps not even possible, to identify the technological solutions that will eventually solve them. Not the currently most promising solutions, the ones that will solve the problem five or ten years down the track.

As a thought experiment think back to what the UK’s research strengths and challenges were 20 years ago and imagine a world in which they were exclusively funded. It would be easy to argue that many of the UK’s current strengths simply wouldn’t even exist (web technology? biotechnology? polymer materials?). And that disciplines that have subsequently reduced in size or entirely disappeared would have been maintained at the cost of new innovation. Concentrating research in a few places, on a few subjects, will reduce diversity, leading to the loss of skills, and probably the loss of skilled people as researchers realise there is no future career for them in the UK. It will not provide the diverse and skilled workforce required to solve the problems we face today. Concentrating on current strengths, no matter how worthy, will lead to ossification and conservatism making UK research ultimately irrelevant on a world stage.

What we need more than ever now, is a diverse and vibrant research community working on a wide range of problems, and to find better communication tools so as to efficiently connect unexpected solutions to problems in different areas. This is not the usual argument for “blue skies research”, whatever that may be. It is an argument for using market forces to do what they are best at (pick the winners from a range of possible technologies) and to use the smart people currently employed in research positions at government expense to actually do what they are good at; do research and train new researchers. It is an argument for critically looking at the expenditure of government money in a wholistic way and to seriously consider radical change where money is being wasted. I have estimated in the past that the annual cost of failed grant proposals to the UK government is somewhere between £100M – £500M, a large sum of money in anybody’s books. More rigorous economic analysis of a Canadian government funding scheme has shown that the cost of preparing and refeering the proposals ($CAN40k) is more than the cost of giving every eligible applicant a support grantof $CAN30k. This is not just farcical, it is an offensive waste of taxpayer’s money.

The funding and distribution of research money requires radicaly overhaul. I do not beleive that simply providing more money is the solution. Frankly we’ve had a lot more money, it makes life a little more comfortable if you are in the right places, but it has reduced the pressure to solve the underlying problems. We need responsive funding at a wide range of levels that enables both bursts of research, the kind of instant collaboration that we know can work, with little or no review, and large scale data gathering projects of strategic importance that need extensive and careful critical review before being approved.  And we need mechanisms to tension these against each other. We need baseline funding to just let people get on with research and we need access to larger sums where appropriate.

We need less buearacracy, less direction from the top, and more direction from the sides, from the community, and not just necessarily the community of researchers. What we have at the moment are strategic initiatives announced by research councils that are around five years behind the leading edge, which distort and constrain real innovation. Now we have ministers proposing to identify the UK’s research strengths. No doubt these will be five to ten years out of date and they will almost certainly stifle those pockets of excellence that will grow in strengths over the next decade. No-one will ever agree what tomorrow’s strengths will be. Much better would be to get on and find out.

A specialist OpenID service to provide unique researcher IDs?

Following on from Science Online 09 and particularly discussions on Impact Factors and researcher incentives (also on Friendfeed and some video available at Mogulus via video on demand) as well as the article in PloS Computational Biology by Phil Bourne and Lynn Fink the issue of unique researcher identifiers has really emerged as absolutely central to making traditional publication work better, effectively building a real data web that works, and making it possible to aggregate the full list of how people contribute to the community automatically.

Good citation practice lies at the core of good science. The value of research data is not so much in the data itself but its context, its connection with other data and ideas. How then is it that we have no way of citing a person? We need a single, unique way, of identifying researchers. This will help traditional publishers and the existing ecosystem of services by making it possible to uniquely identify authors and referees. It will make it easier for researchers to be clear about who they are and what they have done. And finally it is a critical step in making it possible to automatically track all the contributions that people make. We’ve all seen CVs where people say they have refereed for Nature or the NIH or served on this or that panel. We can talk about micro credits but until there are validated ways of pulling that information and linking it to an identity that follows the person, not who they work for, we won’t make much progress.

On the other hand most of us do not want to be locked into one system, particularly if it is controlled by one commercial organization.  Thomson ISI’s ResearcherID is positioned as a solution to this problem, but I for one am not happy with being tied into using one particular service, regardless of who runs it.

In the PLoS Comp Biol article Bourne and Fink argue that one solution to this is OpenID. OpenID isn’t a service, it is a standard. This means that an identity can be hosted by a range of services and people can choose between them based on the service provided, personal philosophy, or any other reason. The central idea is that you have a single identity which you can use to sign on to a wide range of sites. In principle you sign into your OpenID and then you never see another login screen. In practice you often end up typing in your ID but at least it reduces the pain in setting up new accounts. It also provides in most cases a “home page”. If you go to http://cameron.neylon.myopenid.com you will see a (pretty limited) page with some basic information.

OpenID is becoming more popular with a wide range of webservices providing it as a login option including Dopplr, Blogger, and research sites including MyExperiment. Enabling OpenID is also on the list for a wide range of other services, although not always high up the priority list. As a starting point it could be very easy for researchers with an OpenID simply to add it to their address when publishing papers, thus providing a unique, and easily trackable identifier that is carried through the journal, abstracting services, and the whole ecosystem services built around them.

There are two major problems with OpenID. The first is that it is poorly supported by big players such as Google and Yahoo. Google and Yahoo will let you use your account with them as an OpenID but they don’t accept other OpenID providers. More importantly, people just don’t seem to get OpenID. It seems unnatural for some reason for a person’s identity marker to be a URL rather than a number, a name, or an email address. Compounded with the limited options provided by OpenID service providers this makes the practical use of such identifiers for researchers very much a minority activity.

So what about building an OpenID service specifically for researchers? Imagine a setup screen that asks sensible questions about where you work and what field you are in. Imagine that on the second screen, having done a search through literature databases it presents you with a list of publications to check through, remove any mistakes, allow you to add any that have been missed. And then imagine that the default homepage format is similar to an academic CV.

Problem 1: People already have multiple IDs and sometimes multiple OpenIDs. So we make at least part of the back end file format, and much of what is exposed on the homepage FOAF, making it possible to at least assert that you are the same person as, say cameronneylon@yahoo.com.

Problem 2: Aren’t we just locking people into a specific service again? Well no, if people don’t want to use it they can use any OpenID provider, even set one up themselves. It is an open standard.

Problem 3: What is there to make people sign up? This is the tough one really. It falls into two parts. Firstly, for those of us who already have OpenIDs or other accounts on other systems, isn’t this just (yet) another “me too” service. So, in accordance with the five rules I have proposed for successful researcher web services, there has to be a compelling case for using it.

For me the answer to this comes in part from the question. One of the things that comes up again and again as a complaint from researchers is the need to re-format their CV (see Schleyer et al, 2008 for a study of this). Remember that the aim here is to automatically aggregate most of the information you would put in a CV. Papers should be (relatively) easy, grants might be possible. Because we are doing this for researchers we know what the main categories are and what they look like. That is we have semantically structured data.

Ok so great I can re-format my CV easier and I don’t need to worry about whether it is up to date with all my papers but what about all these other sites where I need to put the same information? For this we need to provide functionality that lets all of this be carried easily to other services. Simple embed functionality like that you see on YouTube, and most other good file hosting services, which generates a little fragment of code that can easily be put in place on other services (obviously this requires other services to allow that – which could be a problem in some cases). But imagine the relief if all the poor people who try to manage university department websites could just throw in some embed codes to automatically keep their staff pages up to date? Anyone seeing a business model here yet?

But for this to work the real problem to be solved is the vast majority of researchers for whom this concept is totally alien. How do we get them to be bothered to sign up for this thing which apparently solves a problem they don’t have? The best approach would be if journals and grant awarding bodies used OpenIDs as identifiers. This would be a dream result but doesn’t seem likely. It would require significant work on changing many existing systems and frankly what is in it for them? Well one answer is that it would provide a mechanism for journals and grant bodies to publicly acknowledge the people who referee for them. An authenticated RSS feed from each journal or funder could be parsed and displayed on each researcher’s home page. The feed would expose a record of how many grants or papers that each person has reviewed (probably with some delay to prevent people linking that to the publication of specific papers). Of course such a feed could be used for lot of other interesting things as well, but none of them will work without a unique person identifier.

I don’t think this is compelling enough in itself, for the moment, but a simpler answer is what was proposed above – just encouraging people to include an OpenID as part of their address. Researchers will bend over backwards to make people happy if they believe those people have an impact on their chances of being published or getting a grant. A little thing could provide a lot of impetus and that might bring into play the kind of effects that could result from acknowledgement and ultimately make the case that shifting to OpenID as the login system is worth the effort. This would particularly the case for funders who really want to be able to aggregate information about the people they fund effectively.

There are many details to think about here. Can I use my own domain name (yes, re-directs should be possible). Will people who use another service be at a disadvantage (probably, otherwise any business model won’t really work).  Is there a business model that holds water (I think there is but the devil is in the details). Should it be non-profit or for profit or run by a respected body (I would argue that for-profit is possible and should be pursued to make sure the service keeps improving – but then we’re back with a commercial provider).

There are many good questions that need to be thought through but I think the principle of this could work, and if such an approach is to be successful it needs to get off the ground soon and fast.

Note: I am aware that a number of people are working behind the scenes on components of this and on similar ideas. Some of what is written above is derived from private conversations with these people and as soon as I know that their work has gone public I will add references and citations as appropriate at the bottom of this post. 

New Year’s Resolutions 2009

Sydney Harbour Bridge NYE fireworksAll good traditions require someone to make an arbitrary decision to do something again. Last year I threw up a few New Year’s resolutions in the hours before NYE in the UK. Last night I was out on the shore of Sydney Harbour. I had the laptop – I thought about writing something – and then I thought – nah I can just lie here and look at the pretty lights. However I did want to follow up the successes and failures of last year’s resolutions and maybe make a few more for this year.

So last year’s resolutions were, roughly speaking, 1) to adopt the principles of the NIH Open Access mandate when choosing journals for publications, 2) to get more of the existing data within my group online and available, 3) to take the whole research group fully open notebook, 4) to mention Open Notebooks in every talk I gave, and 5) attempt to get explicit funding for developing open notebook approaches.

So successes – the research group at RAL is now (technically) working on an Open Notebook basis. This has taken a lot longer than we expected and the guys are still really getting a feel for what that means both in terms of how the record things and how they feel about it. I think it will improve over time and it just reinforces the message that none of this is easy.  I also made a point about talking about the Open Notebook approach is every talk I gave – mostly this was well received – often there was some scepticism but the message is getting out there.

However we didn’t do so well on picking journals – most of the papers I was on this year were driven by other people or were directed requests for special issues, or both. The papers that I had in mind I still haven’t got written, some drafts exist, but they’re definitely not finished. I also haven’t done any real work on getting older data online – it has been enough work just trying to manage the stuff we already have.

Funding is a mixed bag – the network proposal that was in last New Year’s was rejected. A few proposals have gone in – more haven’t gone in but exist in draft form – and a group of us went close to getting a tender to do some research into the uptake of Web 2. tools in science (more on that later but Gavin Baker has written about it and our tender document itself is available). The success of the year was the funding that Jean-Claude Bradley obtained from Submeta (as well as support from Aldrich Chemicals and Nature Publishing Group) to support the Open Notebook Science Challenge. I can’t take any credit for this but I think it is a good sign that we may have more luck this coming year.

So for this year – there are some follow ons – and some new ones:

  1. I will re-write the network application (and will be asking for help) and re-submit it to a UK funder
  2. I will clean up the “Personal View of Open Science” series of blog posts and see if I can get it published as a perspectives article in a high ranking journal
  3. I will get some of those damn papers finished – and decide which ones are never going to be written and give up on them. Papers I have full control over will go by first preference to Gold OA journals.
  4. I will pull together the pieces needed to take action on the ideas that came out of the Southampton Open Science workshop, specifically the idea of a letter signed by a wide range of scientists and interested people to a high ranking journal stating the importance of working towards published papers being fully supported by data and methodological detail that is fully available
  5. I will focus on doing less things and doing them better – or at least making sure the resources are available to do more of the things I take on…

I think five is enough things to be going on with. Hope you all have a happy new year, whenever it may start, and that it takes you further in the direction you want to go (whether you know what that is now or not) than you thought was possible.

p.s. I noticed in the comments to last year’s post a comment from one Shirley Wu suggesting the idea of running a session at the 2009 Pacific Symposium on Biocomputing – a proposal that resulted in the session we are holding in a few days (again more later on – we hope – streaming video, micro blogging etc). Just thinking about how much has changed in the way such an idea would be raised and explored in the last twelve months is food for thought.

The problem of academic credit and the value of diversity in the research community

This is the second in a series of posts (first one here) in which I am trying to process and collect ideas that came out of Scifoo. This post arises out of a discussion I had with Michael Eisen (UC Berkely) and Sean Eddy (HHMI Janelia Farm) at lunch on the Saturday. We had drifted from a discussion of the problem of attribution stacking and citing datasets (and datasets made up of datasets) into the problem of academic credit. I had trotted out the usual spiel about the need for giving credit for data sets and for tool development.

Michael made two interesting points. The first was that he felt people got too much credit for datasets already and that making them more widely citeable would actually devalue the contribution. The example he cited was genome sequences. This is a case where, for historical reasons, the publication of a dataset as a paper in a high ranking journal is considered appropriate.

In a sense I agree with this case. The problem here is that for this specific case it is allowable to push a dataset sized peg into a paper sized hole. This has arguably led to an over valuing of the sequence data itself and an undervaluing of the science it enables. Small molecule crystallography is similar in some regards with the publication of crystal structures in paper form bulking out the publication lists of many scientists. There is a real sense in which having a publication stream for data, making the data itself directly citeable, would lead to a devaluation of these contributions. On the other hand it would lead to a situation where you would cite what you used, rather than the paper in which it was, perhaps peripherally described. I think more broadly that the publication of data will lead to greater efficiency in research generally and more diversity in the streams to which people can contribute.

Michael’s comment on tool development was more telling though. As people at the bottom of the research tree (and I count myself amongst this group) it is easy to say ‘if only I got credit for developing this tool’, or ‘I ought to get more credit for writing my blog’, or anyone of a thousand other things we feel ‘ought to count’. The problem is that there is no such thing as ‘credit’. Hiring decisions and promotion decisions are made on the basis of perceived need. And the primary needs of any academic department are income and prestige. If we believe that people who develop tools should be more highly valued then there is little point in giving them ‘credit’ unless that ‘credit’ will be taken seriously in hiring decisions. We have this almost precisely backwards. If a department wanted tool developers then it would say so, and would look at CVs for evidence of this kind of work. If we believe that tool developers should get more support then we should be saying that at a higher, strategic level, not just trying to get it added as a standard section in academic CVs.

More widely there is a question as to why we might think that blogs, or public lectures, or code development, or more open sharing of protocols are something for which people should be given credit. There is often a case to be made for the contribution of a specific person in a non-traditional medium, but that doesn’t mean that every blog written by a scientists is a valuable contribution. In my view it isn’t the medium that is important, but the diversity of media and the concomitant diversity of contributions that they enable. In arguing for these contributions being significant what we are actually arguing for is diversity in the academic community.

So is diversity a good thing? The tightening and concentration of funding has, in my view, led to a decrease in diversity, both geographical and social, in the academy. In particular there is a tendency to large groups clustered together in major institutions, generally led by very smart people. There is a strong argument that these groups can be more productive, more effective, and crucially offer better value for money. Scifoo is a place where those of us who are less successful come face to face with the fact that there are many people a lot smarter than us and that these people are probably more successful for a reason. And you have to question whether your own small contribution with a small research group is worth the taxpayer’s money. In my view this is something you should question anyway as an academic researcher – there is far too much comfortable complacency and sense of entitlement, but that’s a story for another post.

So the question is; do I make a valid contribution? And does that provide value for money? And again for me Scifoo provides something of an answer. I don’t think I spoke to any person over the weekend without at least giving them something new to think about, a slightly different view on a situation, or just an introduction to something that hadn’t heard of before. These contributions were in very narrow areas, ones small enough for me to be expert, but my background and experience provided a different view. What does this mean for me? Probably that I should focus more on what makes my background and experience unique – that I should build out from that in the directions most likely to provide a complementary view.

But what does it mean more generally? I think that it means that a diverse set of experiences, contributions, and abilities will improve the quality of the research effort. At one session of Scifoo, on how to support ground breaking science, I made the tongue in cheek comment that I thought we needed more incremental science, more filling in of tables, of laying the foundations properly. The more I think about this the more I think it is important. If we don’t have proper foundations, filled out with good data and thought through in detail, then there are real risks in building new skyscrapers. Diversity adds reinforcement by providing better tools, better datasets, and different views from which to examine the current state of opinion and knowledge. There is an obvious tension between delivering radical new technologies and knowledge and the incremental process of filling in, backing up, and checking over the details. But too often the discussion is purely about how to achieve the first, with no attention given to the importance of the second. This is about balance not absolutes.

So to come back around to the original point, the value of different forms of contribution is not due to the fact that they are non-traditional or because of the medium per se, it is because they are different. If we value diversity at hiring committees, and I think we should, then looking at a diverse set of contributions, and the contribution that a given person is likely to make in the future based on their CVs, we can assess more effectively how they will differ from the people we already have. The tendency of ‘the academy’ to hire people in its own image is well established. No monoculture can ever be healthy; certainly not in a rapidly changing environment. So diversity is something we should value for its own sake, something we should try to encourage, and something that we should search CVs for evidence of. Then the credit for these activities will flow of its own accord.

More on the science exchance – or building and capitalising a data commons

Image from Wikipedia via ZemantaBanknotes from all around the World donated by visitors to the British Museum, London

Following on from the discussion a few weeks back kicked off by Shirley at One Big Lab and continued here I’ve been thinking about how to actually turn what was a throwaway comment into reality:

What is being generated here is new science, and science isn’t paid for per se. The resources that generate science are supported by governments, charities, and industry but the actual production of science is not supported. The truly radical approach to this would be to turn the system on its head. Don’t fund the universities to do science, fund the journals to buy science; then the system would reward increased efficiency.

There is a problem at the core of this. For someone to pay for access to the results, there has to be a monetary benefit to them. This may be through increased efficiency of their research funding but that’s a rather vague benefit. For a serious charitable or commercial funder there has to be the potential to either make money, or at least see that the enterprise could become self sufficient. But surely this means monetizing the data somehow? Which would require restrictive licences, which is not at the end what we’re about.

The other story of the week has been the, in the end very useful, kerfuffle caused by ChemSpider moving to a CC-BY-SA licence, and the confusion that has been revealed regarding data, licencing, and the public domain. John Wilbanks, whose comments on the ChemSpider licence, sparked the discussion has written two posts [1, 2] which I found illuminating and have made things much clearer for me. His point is that data naturally belongs in the public domain and that the public domain and the freedom of the data itself needs to be protected from erosion, both legal, and conceptual that could be caused by our obsession with licences. What does this mean for making an effective data commons, and the Science Exchange that could arise from it, financially viable? Continue reading “More on the science exchance – or building and capitalising a data commons”

Open Science in the Undergraduate Laboratory: Could this be the success story we’re looking for?

A whole series of things have converged in the last couple of days for me. First was Jean-Claude’s description of the work [1, 2] he and Brent Friesen of the Dominican University are doing putting the combi-Ugi project into an undergraduate laboratory setting. The students will make new compounds which will then be sent for testing as antimalarial agents by Phil Rosenthal at UCSF. This is a great story and a testament in particular to Brent’s work to make the laboratory practical more relevant and exciting for his students.

At the same time I get an email from Anna Croft, University of Bangor, Wales, after meeting up the previous day; Continue reading “Open Science in the Undergraduate Laboratory: Could this be the success story we’re looking for?”