BioBarCamp – Meeting friends old and new and virtual

So BioBarCamp started yesterday with a bang and a great kick off. Not only did we somehow manage to start early we were consistently running ahead of schedule. With several hours initially scheduled for introductions this actually went pretty quick, although it was quite comprehensive. During the introduction many people expressed an interest in ‘Open Science’, ‘Open Data’, or some other open stuff, yet it was already pretty clear that many people meant many different things by this. It was suggested that with the time available we have a discussion session on what ‘Open Science’ might mean. Pedro and mysey live blogged this at Friendfeed and the discussion will continue this morning.

I think for me the most striking outcome of that session was that not only is this a radically new concept for many people but that many people don’t have any background understanding of open source software either which can make the discussion totally impenetrable to them. This, in my view strengthens the need for having some clear brands, or standards, that are easy to point to and easy to sign up to (or not). I pitched the idea, basically adapting from John Wilbank’s pitch at the meeting in Barcelona, that our first target should that all data and analysis associated with a published paper should be available. This seems an unarguable basic standard, but is one that we currently fall far short of. I will pitch this again in the session I have proposed on ‘Building a data commons’.

The schedule for today is up as a googledoc spreadsheet with many difficult decisions to make. My current thinking is;

  1. Kaitlin Thaney – Open Science Session
  2. Ricardo Vidal and Vivek Murthy (OpenWetWare and Epernicus).  Using online communities to share resources efficiently.
  3. Jeremy England & Mark Kaganovich – Labmeeting, Keeping Stalin Out of Science (though I would also love to do John Cumbers on synthetic biology for space colonization, that is just so cool)
  4. Pedro Beltrao & Peter Binfield – Dealing with Noise in Science / How should scientific articles be measured.
  5. Hard choice: Andrew Hessel – building an open source biotech company or Nikesh Kotecha + Shirley Wu – Motivating annotation
  6. Another doozy: John Cumbers – Science Worship / Science Marketing or Hilary Spencer & Mathias Crawford – Interests in Scientific IP – Who Owns/Controls Scientific Communication and Data?  The Major Players.
  7. Better turn up to mine I guess :)
  8.  Joseph Perla – Cloud computing, Robotics and the future of Science and  Joel Dudley & Charles Parrot – Open Access Scientific Computing Grids & OpenMac Grid

I am beginning to think I should have brought two laptops and two webcams. Then I could have recorded one and gone to the other. Whatever happens I will try to cover as much as I can in the BioBarCamp room at FriendFeed, and where possible and appropriate I will broadcast and record via Mogulus. The wireless was a bit tenuous yesterday so I am not absolutely sure how well this will work.

Finally, this has been great opportunity to meet up with people I know and have met before, those who I feel I know well but have never met face to face, and indeed those whose name I vaguely know (or should know) but have never connected with before. I’m not going to say who is in which list because I will forget someone! But if I haven’t said hello yet do come up and harass me because I probably just haven’t connected your online persona with the person in front of me!

We need to get out more…

The speaker had started the afternoon with a quote from Ian Rogers, ‘Losers wish for scarcity. Winners leverage scale.’ He went on to eloquently, if somewhat bluntly, make the case for exposing data and discuss the importance of making it available in a useable and re-useable form. In particular he discussed the sophisticated re-analysis and mashing that properly exposed data enables while excoriating a number of people in the audience for forcing him to screen scrape data from their sites.

All in all, as you might expect, this was music to my ears. This was the case for open science made clearly and succinctly, and with passion, over the course of several days. The speaker? Mike Ellis from EduServ; I suspect both a person and an organization of which most of the readers of this blog have never heard. Why? Because he comes from a background in museums, the data he wanted was news streams, addresses, and lat long for UK higher education institutions, or library catalogues, not NMR spectra or gene sequences. Yet the case to be made is the same. I wrote last week about the need to make better connections between the open science blogosphere and the wider interested science policy and funding community. But we also need to make more effective connections with those for whom the open data agenda is part of their daily lives.

I spent several enjoyable days last week at the UKOLN Institutional Web Managers’ Workshop in Aberdeen. UKOLN is a UK centre of excellence for web based activities in HE in the UK and IWMW is their annual meeting. It is attended primarily by the people who manage web systems within UK HE including IT services, Web services, and library services, as well as the funders, and support organisations associated with these activities.

There were a number of other talks that would be of interest to this community and many of the presentations are available as video at the conference website. James Curral on Web Archiving, Stephanie Taylor on Institutional Repositories, and David Hyett of the British Antarctic Survey providing the sceptics view of implementing Web2.0 services for communicating with the public. His central point, which was well made, was that there is no point adding a whole bunch of wizzbang features to an institutional website if you haven’t got the fundamentals right: quality content; straightforward navigation; relevance to the user. Where I disagreed with his position was that I felt he extrapolated from the fact that most user generated content is poor to the presumption that ‘user generated content on my site will be poor’. This to me misses the key point: that it is by focussing on community building that you generate high quality content that is of relevance to that community. Nonetheless, his central point, don’t build in features that your users don’t want or need, is well made.

David made the statement ‘90% of blogs are boring’ during his talk. I took some exception to this (I am sure the situation is far, far, worse than that). In a question I made the point that it was generally accepted that Google had made the web useable by making things findable amongst the rubbish but that for social content we needed to adopt a different kind of ‘social search’ strategy with different tools. That with the right strategies and the right tools every person could find their preferred 10% (or 1% or 0.00001%) of the world’s material. That in fact this social search approach led to the formation of new communities and new networks.

After the meeting however it struck me that I had failed to successfully execute my own advice. Mike Ellis blogs a bit, twitters a lot, and is well known within the institutional web management community. He lives not far away from me. He is a passionate advocate of data availability and has the technical smarts to do clever staff with the data that is available. Why hadn’t I already made this connection? If I go around making the case that web based tools will transform our ability to communicate where is the evidence that this happens in practice. Our contention is that online publishing frees up communication and allows the free flow of information and ideas. The sceptics contention is that it just allows us to be happy in our own little echo chamber. Elements of both are true but I think it is fair to say that we are not effectively harnessing the potential of the medium to drive forward our agenda. By broadening the community and linking up with like minded people in museums, institutional web services, archives, and libraries we can undoubtedly do better.

So there are two approaches to solving this problem, the social approach and the technical approach. Both are intertwined but can be separated to a certain extent. The social approach is to link existing communities and allow the interlinks between them to grow. This blog post is one attempt – some of you may go on to look at Mike’s Blog. Another is for people to act as supernodes within the community network. Michael Nielsen’s joining of the (mostly) life science oriented community on FriendFeed and more widely in the blogosphere has connected that community with a theoretical physics community and another ‘Open Science’ community that was largely separate from the existing online community. A small number of connections made a big difference to overall network size. I was very happy to accept the invitation to speak at the IWMW meeting precisely because I hoped to make these kinds of connections. Hopefully a few people from the meeting may read this blog post (if so please do leave a comment – lets build on this!). We make contacts we expand the network – but this relies very heavily on supernodes within the network and their ability to cope with the volume.

So is there a technical solution to the problem? Well in this specific case there is a technical problem to the problem. Mike doesn’t use Friendfeed but is a regular Twitter user. My most likely connection to Mike is Brian Kelly, based at UKOLN, who does have a Friendfeed account but I suspect doesn’t monitor it. The connection fails because the social networks don’t effectively interconnect. It turns out the web management community aren’t convinced by FriendFeed and prefer Twitter. So a technical solution would somehow have to bridge this gap. Right at the moment that bridge is most likely to be a person, not a machine, which leaves us back where we started, and I don’t see that changing anytime soon. The problem is an architectural one, not an application or service one. I can aggregate Twitter, FriendFeed or anything else in one place but unless everyone else does the same thing its not really going to help.

I don’t really have a solution except once again to make the case for the value of those people who build stronger connections between poorly interconnected networks. It is not just that information is valuable, but the timely delivery of that information is valuable. These people add value. What is more, if we are going to fully exploit the potential of the web in the near term, not to mention demonstrate the value of exploiting it to others, we need to value these people and support their activities. How we do that is an open question. It will clearly cost money. The question is where to get it from and how to get it to where it needs to be.

Pedro Beltrao writes on the backlash against open science

Pedro has written a thoughtful post detailing arguments he has received against Open Practice in science. He makes a good point that as the ideas around Open Science spread there will inevitably be a backlash. Part of the response to this is to keep saying – as Pedro does and as Jean-Claude, Bill Hooker and others have said repeatedly that we are not forcing anyone to take this approach. Research funders, such as the BBSRC, may have data sharing policies that require some measure of openness but, at the end of the day, if they are paying they get to call the shots.

The other case to make is that this is a more efficient and effective way of doing science. There is a danger, particularly in the US, that open approaches get labelled as ‘socialist’ or something similar. PRISM and the ACS when attacking open access have used the term ‘socialized science’. This has a particular resonance in the US and, I think, is seen as a totally bizarre argument elsewhere in the world but that is not the point. The key point to make is that the case for Open Science is a pure market based argument. Reducing barriers for re-use, breaking out of walled gardens, adds value and makes the market more efficient not less. John Wilbanks has some great blog posts on this subject and an article in Nature Precedings which I highly recommend.

In the comments in Pedro’s post Michael Kuhn asks:

Hmm, just briefly some unbalanced thoughts (I don’t have time to offer more than the advocatus diaboli argument):

Open Science == Communism? I’m wondering if a competition of scientific theories is actually necessary to further science in a sound way. Just to draw the parallel, a lot of R&D in the private sector is done in parallel and in competition, with the result of increased productivity. On the other side, we’ve had things like Comecon and five-year plans to “order” the development and reduce competition, and the result was lower productivity.

I think it is important to counter this kind of argument (and I note that Michael is playing devil’s advocate here – albeit in latin) with arguments that use the economic benefits, and case studies, such as those used by James Boyle in his talk at the recent Science Commons run meeting in Barcelona (which I blogged about here), to show that there is a strong business case to be made. Openness may be more social but it isn’t in any sense socialist. In fact it drives us closer to a pure market than the current system in many way. The business of building value on open content has taken off on the web. Science can do the same and open approaches are more efficient.

Person Bill Hooker

Right click for SmartMenu shortcuts

Policy for Open Science – reflections on the workshop

Written on the train on the way from Barcelona to Grenoble. This life really is a lot less exotic than it sounds… 

The workshop that I’ve reported on over the past few days was both positive and inspiring. There is a real sense that the ideas of Open Access and Open Data are becoming mainstream. As several speakers commented, within 12-18 months it will be very unusual for any leading institution not to have a policy on Open Access to its published literature. In many ways as far as Open Access to the published literature is concerned the war has been won. There will remains battles to be fought over green and gold routes – the role of licenses and the need to be able to text mine – successful business models remain to be made demonstrably sustainable – and there will be pain as the inevitable restructuring of the publishing industry continues. But be under no illusions that this restructuring has already begun and it will continue in the direction of more openness as long as the poster children of the movement like PLoS and BMC continue to be successful.

Open Data remains further behind, both with respect to policy and awareness. Many people spoke over the two days about Open Access and then added, almost as an addendum ‘Oh and we need to think about data as well’. I believe the policies will start to grow and examples such as the BBSRC Data Sharing Policy give a view of the future. But there is still much advocacy work to be done here. John Wilbanks talked about the need to set achievable goals, lines in the sand which no-one can argue with. And the easiest of these is one that we have discussed many times. All data associated with a published paper, all analysis, and all processing procedures, should be made available. This is very difficult to argue with – nonetheless we know of examples where the raw data of important experiments is being thrown away. But if an experiment cannot be shown to have been done, cannot be replicated and checked, can it really be publishable? Nonetheless this is a very useful marker, and a meme that we can spread and promote.

In the final session there was a more critical analysis of the situation. A number of serious questions were raised but I think they divide into two categories. The first involves the rise of the ‘Digital Natives’ or the ‘Google Generation’. The characteristics of this new generation (a gross simplification in its own right) are often presented as a pure good. Better networked, more sharing, better equipped to think in the digital network. But there are some characteristics that ought to give pause. A casualness about attribution, a sense that if something is available then it is fine to just take it (its not stealing after all, just copying). There is perhaps a need to recover the roots of ‘Mertonian’ science, to as I think James Boyle put it, publicise and embed the attitudes of the last generation of scientists, for whom science was a public good and a discipline bounded by strict rules of behaviour. Some might see this as harking back to an elitist past but if we are constructing a narrative about what we want science to be then we can take the best parts of all of our history and use it to define and refine our vision. There is certainly a place for a return to the compulsory study of science history and philosophy.

The second major category of issues discussed in the last session revolved around the question of what do we actually do now. There is a need to move on many fronts, to gather evidence of success, to investigate how different open practices work – and to ask ourselves the hard questions. Which ones do work, and indeed which ones do not. Much of the meeting revolved around policy with many people in favour of, or at least not against, mandates of one sort or another. Mike Carroll objected to the term mandate – talking instead about contractual conditions. I would go further and say that until these mandates are demonstrated to be working in practice they are aspirations. When they are working in practice they will be norms, embedded in the practice of good science. The carrot may be more powerful than the stick but peer pressure is vastly more powerful than both.

So they key questions to me revolve around how we can convert aspirations into community norms. What is needed in terms of infrastructure, in terms of incentives, and in terms of funding to make this stuff happen? One thing is to focus on the infrastructure and take a very serious and critical look at what is required. It can be argued that much of the storage infrastructure is in place. I have written on my concerns about institutional repositories but the bottom line remains that we probably have a reasonable amount of disk space available. The network infrastructure is pretty good so these are two things we don’t need to worry about. What we do need to worry about, and what wasn’t really discussed very much in the meeting, is the tools that will make it easy and natural to deposit data and papers.

The incentive structure remains broken – this is not a new thing – but if sufficiently high profile people start to say this should change, and act on those beliefs, and they are, then things will start to shift. It will be slow but bit by bit we can imagine getting there. Can we take shortcuts. Well there are some options. I’ve raised in the past the idea of a prize for Open Science (or in fact two, one for an early career researcher and one for an established one). Imagine if we could make this a million dollar prize, or at least enough for someone to take a year off. High profile, significant money, and visible success for someone each year. Even without money this is still something that will help people – give them something to point to as recognition of their contribution. But money would get people’s attention.

I am sceptical about the value of ‘microcredit’ systems where a person’s diverse and perhaps diffuse contributions are aggregated together to come up with some sort of ‘contribution’ value, a number by which job candidates can be compared. Philosophically I think it’s a great idea, but in practice I can see this turning into multiple different calculations, each of which can be gamed. We already have citation counts, H-factors, publication number, integrated impact factor as ways of measuring and comparing one type of output. What will happen when there are ten or 50 different types of output being aggregated? Especially as no-one will agree on how to weight them. What I do believe is that those of us who mentor staff, or who make hiring decisions should encourage people to describe these contributions, to include them in their CVs. If we value them, then they will value them. We don’t need to compare the number of my blog posts to someone else’s – but we can ask which is the most influential – we can compare, if subjectively, the importance of a set of papers to a set of blog posts. But the bottom line is that we should actively value these contributions – let’s start asking the questions ‘Why don’t you write online? Why don’t you make your data available? Where are your protocols described? Where is your software, your workflows?’

Funding is key, and for me one of the main messages to come from the meeting was the need to think in terms of infrastructure, and in particular, to distinguish what is infrastructure and what is science or project driven. In one discussion over coffee I discussed the problem of how to fund development projects where the two are deeply intertwined and how this raises challenges for funders. We need new funding models to make this work. It was suggested in the final panel that as these tools become embedded in projects there will be less need to worry about them in infrastructure funding lines. I disagree. Coming from an infrastructure support organisation I think there is a desperate need for critical strategic oversight of the infrastructure that will support all science – both physical facilities, network and storage infrastructure, tools, and data. This could be done effectively using a federated model and need not be centralised but I think there is a need to support the assumption that the infrastructure is available and this should not be done on a project by project basis. We build central facilities for a reason – maybe the support and development of software tools doesn’t fit this model but I think it is worth considering.

This ‘infrastructure thinking’ goes wider than disk space and networks, wider than tools, and wider than the data itself. The concept of ‘law as infrastructure’ was briefly discussed. There was also a presentation looking at different legal models of a ‘commons’; the public domain, a contractually reconstructed commons, escrow systems etc. In retrospect I think there should have been more of this. We need to look critically at different models, what they are good for, how they work. ‘Open everything’ is a wonderful philosophical position but we need to be critical about where it will work, where it won’t, and where it needs contractual protection, or where such contractual protection is counter productive. I spoke to John Wilbanks about our ideas on taking Open Source Drug Discovery into undergraduate classes and schools and he was critical of the model I was proposing, not from the standpoint of the aims or where we want to be, but because it wouldn’t be effective at drawing in pharmaceutical companies and protecting their investment. His point was, I think, that by closing off the right piece of the picture with contractual arrangements you bring in vastly more resources and give yourself greater ability to ensure positive outcomes. That sometimes to break the system you need to start by working within it by, in this case, making it possible to patent a drug. This may not be philosophically in tune with my thinking but it is pragmatic. There will be moments, especially when we deal with the interface with commerce, where we have to make these types of decisions. There may or may not be ‘right’ answers, and if there are they will change over time but we need to know our options and know them well so as to make informed decisions on specific issues.

But finally, as is my usual wont, I come back to the infrastructure of tools. The software that will actually allow us to record and order this data that we are supposed to be sharing. Again there was relatively little on this in the meeting itself. Several speakers recognised the need to embed the collection of data and metadata within existing workflows but there was very little discussion of good examples of this. As we have discussed before this is much easier for big science than for ‘long tail’ or ‘small science’. I stand by my somewhat provocative contention that for the well described central experiments of big science this is essentially a solved problem – it just requires the will and resources to build the language to describe the data sets, their formats, and their inputs. But the problem is that even for big science, the majority of the workflow is not easily automated. There are humans involved, making decisions moment by moment, and these need to be captured. The debate over institutional repositories and self archiving of papers is instructive here. Most academics don’t deposit because they can’t be bothered. The idea of a negative click repository – where this is a natural part of the workflow can circumvent this. And if well built it can make the conventional process of article submission easier. It is all a question of getting into the natural workflow of the scientist early enough that not only do you capture all the contextual information you want, but that you can offer assistance that makes them want to put that information in.

The same is true for capturing data. We must capture it at source. This is the point where it has the potential to add the greatest value to the scientist’s workflow by making their data and records more available, by making them more consistent, by allowing them to reformat and reanalyse data with ease, and ultimately by making it easy for them to share the full record. We can and we will argue about where best to order and describe the elements of this record. I believe that this point comes slightly later – after the experiment – but wherever it happens it will be made much easier by automatic capture systems that hold as much contextual information as possible. Metadata is context – almost all of it should be possible to catch automatically. Regardless of this we need to develop a diverse ecosystem of tools. It needs to be an open and standards based ecosystem and in my view needs to be built up of small parts, loosely coupled. We can build this – it will be tough, and it will be expensive but I think we know enough now to at least outline how it might work, and this is the agenda that I want to explore at SciFoo.

John Wilbanks had the last word, and it was a call to arms. He said ‘We are the architects of Open’. There are two messages in this. The first is we need to get on and build this thing called Open Science. The moment to grasp and guide the process is now. The second is that if you want to have a part in this process the time to join the debate is now. One thing that was very clear to me was that the attendees of the meeting were largely disconnected from the more technical community that reads this and related blogs. We need to get the communication flowing in both directions – there are things the blogosphere knows, that we are far ahead on, and we need to get the information across. There are things we don’t know much about, like the legal frameworks, the high level policy discussions that are going on. We need to understand that context. It strikes me though that if we can combine the strengths of all of these communities and their differing modes of communication then we will be a powerful force for taking forward the open agenda.

Policy for open science – the wrap up session

Today Science Commons sponsored a meeting looking at the policy issues involved in Open Access and Open Science more widely. I blogged James Boyle’s keynote earlier and there was some notes along the way via Twitter. This is a set of notes from the last session of the meeting, a panel with Alexis-Michel Mugabushaka (European Science Foundation), Javier Hernandez-Ros (European Commission), Michael Carroll (University of Villanova, Creative Commons) and John Wilbanks (Science Commons). These notes were taken at speed and are my own record of what happened. They should not necessarily be taken as a transcript of what the panellists said and any inaccuracies are my own fault.

It is important to focus on the barriers to open science. Much has been said about ideals and beliefs but there remains a paucity of real evidence to support the assertions of the open science community. Changing policy, culture, and challenging entrenched positions, putting 50,000 jobs in the publishing industry at risk, requires strong evidence of benefits.  On top of this the consideration of efficiency both with respect to time and money. Details are important. What is data? What is meant by data sharing? Both from a legal and descriptive perspective.

Some issues looked at by the European Commission:

1) Who does what, how is this to be organised? At national, institutional, discpline, European, level? A mixture of top down and bottom up witll be required to make progress. Again efficient provision is required because uneceessary duplication wll be counterproductive and expensive.

2)  The legal issues, copright etc. The issue is not copyright per se but licensing and the power games associated with the ecomonic need of commercial entities. There is a need for a meaningful discussion across the stakeholders. 

3) Technical issues. The whole area is developing and it is not clear where the funding streams will continue as these become more embedded in general scientific practice.

The key issues are to bring everyone on board as part of the discussion. What is the contribution that publishers can make? View this from a positive angle, not a a combative angle.  What are the roles of the various stakeholders.

Things actually need to be done! Much talk about what should be done. Less on how to actually do it.

This meeting is a momentin time where the idea of Openness has become mainstream. The argument started on the fringe. There is a risk of over confidence in the arguments. The issue of evidence is an important one. Can ROI be actually shown? It is known that there are unexpected audiences for information from the web – new innovation will emerge but it is unclear how much or where it will come from.

In the short term there is still a fight but in the long term the world will change – as the next generation of scientists come through. But we don’t necessarily need or want to wait for them to arrive. We may want to ensure that some values from the previous generation of scientists are transmitted effectively. This moment in time is still transitional. Most of us were born into an analogue world but we still don’t think digitally. When people design initiatives and projects with the network in mind – rather than as an afterthought or add on – the efficiencies will be seen much more strongly. There is an argument that the EU model of funding and thinking can help to drive the network effects that will demonstrate the benefits more effectively than the US funding models. A missing part is the link to business models – what is the profit model for openness? Again, Google works (and is profitable).

The fight will go on – but thinking forward, how would we do work and plan for future projects built into the network.

The law as infrastructure. Creative Commons came out of the wish of lawyers to build a technical infrastructure that works for open networks. The legal issues of sharing data haven’t been discussed very much. How can a legal infrastructure be built to assist openness. This is the role of Creative/Science Commons. ‘No-one want to invite lawyers to the party’ but some of them really want to help it go off well!

The purpose of the meeting was to bring people together. There is a need for people to understand they are doing the same thing in different domains. It is difficult to fund the commons – but by brnging people together it is possible to build a social network that may assist in identifying funding options. Science funders don’t do policy, but policy people don’t fund science. A policy statement will be forthcoming in the future taking the current recommendations.

The benefits of the open web come from the explosion of people actually using a computer network. We must think of the users of an open architected science with the same potential for explosion. Can we make it possible to do science withouot being a) rich and b) a member of a closed guild. What would happen if 100 million people could ask scientific questions – even if relatively few of them were really smart questions.

The tower of babel is all over science but also all over computation. What we need is an institute of science langugages to standardise. A cancer marker with 94 synonyms would make Academic Francaise apoplectic. We are a long way from being able to solve either of these problems. How can the tools and standards be built.

We are the architects of Open. We need to link all the work that has been described in the meeting. We need an Open architecture to describe and deliver openness. The short term impact of technology is always over estimated but the long term is underestimated. We don’t have to worry too much about the long term impacr – we can just get on and do it.

‘In theory there is no difference between theory and practice. In practice there is’ – lets just deploy and get on with it. Lets build the evidence and the tools and the standards that will take us forward.

Policy and technology for e-science – A forum on on open science policy

I’m in Barcelona at a satellite meeting of the EuroScience Open Forum organised by Science Commons and a number of their partners.  Today is when most of the meeting will be with forums on ‘Open Access Today’, ‘Moving OA to the Scientific Enterprise:Data, materials, software’, ‘Open access in the the knowledge network’, and ‘Open society, open science: Principle and lessons from OA’. There is also a keynote from Carlos Morais-Pires of the European Commission and the lineup for the panels is very impressive.

Last night was an introduction and social kickoff as well. James Boyle (Duke Law School, Chair of board of directors of Creative Commons, Founder of Science commons) gave a wonderful talk (40 minutes, no slides, barely taking breath) where his central theme was the relationship between where we are today with open science and where international computer networks were in 1992. He likened making the case for open science today with that of people suggesting in 1992 that the networks would benefit from being made freely accessible, freely useable, and based on open standards. The fears that people have today of good information being lost in a deluge of dross, of their being large quantities of nonsense, and nonsense from people with an agenda, can to a certain extent be balanced against the idea that to put it crudely, that Google works. As James put it (not quite a direct quote) ‘You need to reconcile two statements; both true. 1) 99% of all material on the web is incorrect, badly written, and partial. 2) You probably  haven’t opened an encylopedia as a reference in ten year.

James gave two further examples, one being the availability of legal data in the US. Despite the fact that none of this is copyrightable in the US there are thriving businesses based on it. The second, which I found compelling, for reasons that Peter Murray-Rust has described in some detail. Weather data in the US is free. In a recent attempt to get long term weather data a research effort was charged on the order of $1500, the cost of the DVDs that would be needed to ship the data, for all existing US weather data. By comparison a single German state wanted millions for theirs. The consequence of this was that the European data didn’t go into the modelling. James made the point that while the European return on investment for weather data was a respectable nine-fold, that for the US (where they are giving it away remember) was 32 times. To me though the really compelling part of this argument is if that data is not made available we run the risk of being underwater in twenty years with nothing to eat. This particular case is not about money, it is potentially about survival.

Finally – and this you will not be surprised was the bit I most liked – he went on to issue a call to arms to get on and start building this thing that we might call the data commons. The time has come to actually sit down and start to take these things forward, to start solving the issues of reward structures, of identifying business models, and to build the tools and standards to make this happen. That, he said was the job for today. I am looking forward to it.

I will attempt to do some updates via twitter/friendfeed (cameronneylon on both) but I don’t know how well that will work. I don’t have a roaming data tariff and the charges in Europe are a killer so it may be a bit sparse.

Science in the YouTube Age – introductory screencast for a talk I’m giving at IWMW

The UKOLN Institutional Web Managers Workshop is running in Aberdeen from 22-24 July and I am giving a talk discussing the impact of Web2.0 tools on science. My main theme will be the that the main cultural reasons for lack of uptake relate to the fear of losing control over data and ideas. Web2.0 tools rely absolutely on the willingness of people to make useful material available. In science this material is data, ideas, protocols, and analyses. Prior to publication most scientists are very sceptical of making their hard earned data available – but without this the benefits that we know can be achieved through network effects, re-use of data, and critical analysis of data and analysis, will not be seen. The key to benefiting from web based technologies is adopting more open practices.

The video below is a screencast of a shorter version of the talk intended to give people the opportunity to make comments, ask questions, or offer suggestions. I wanted to keep it short so there are relatively few examples in there – there will be much more in the full talk. For those who can’t make it to Aberdeen I am told that the talks are expected to be live videocast and I will provide a URL as soon as I can. If this works I am also intending to try and respond to comments and questions via FriendFeed or Twitter in real time. This may be foolhardy but we’ll see how it goes. Web2 is supposed to be about real time interaction after all!

I don’t seem to be able to embed the video but you can find it here.

Open Science Workshop at Southampton – 31 August and 1 September 2008

Southampton, England, United-Kingdom

Image via Wikipedia

I’m aware I’ve been trailing this idea around for sometime now but its been difficult to pin down due to issues with room bookings. However I’m just going to go ahead and if we end up meeting in a local bar then so be it! If Southampton becomes too difficult I might organise to have it at RAL instead but Southampton is more convenient in many ways.

Science Blogging 2008: London will be held on August 30 at the Royal Institution and as a number of people are coming to that it seemed a good opportunity to get a few more people together to have a get together and discuss how we might move things forward.  This now turns out to be one of a series of such workshops following on from Collaborating for the future of open science, organised by Science Commons as a satellite meeting of EuroScience Open Forum in Barcelona next month, BioBarCamp/Scifoo from 5-10 August and a possible Open Science Workshop at Stanford on Monday 11 August, as well as the Open Science Workshop in Hawaii (can’t let the bioinformaticians have all the good conference sites to themselves!) at the Pacific Symposium on Biocomputing.

For the Southampton meeting I would propose that we essentially look at having four themed sessions: Tools, Data standards, Policy/Funding, and Projects. Within this we adopt an unconference style where we decide who speaks based on who is there and want to present something. My ideas is essentially to meet on the Sunday evening at a local hostelry to discuss and organise the specifics of the program for Monday. On the Monday we spend the day with presentations and leave plenty of room for discussion. People can leave in the afternoon, or hang around into the evening for further discussion. We have absolutely zero, zilch, nada funding available so I will be asking for a contribution (to be finalised later but probably £10-15 each) to cover coffee/tea and lunch on the Monday.

Zemanta Pixie

Data is free or hidden – there is no middle ground

Science commons and other are organising a workshop on Open Science issues as a satellite meeting of the European Science Open Forum meeting in July. This is pitched as an opportunity to discuss issues around policy, funding, and social issues with an impact on the ‘Open Research Agenda’. In preparation for that meeting I wanted to continue to explore some of the conflicts that arise between wanting to make data freely available as soon as possible and the need to protect the interests of the researchers that have generated data and (perhaps) have a right to the benefits of exploiting that data.

John Cumbers proposed the idea of a ‘Protocol’ for open science that included the idea of a ‘use embargo’; the idea that when data is initially made available, no-one else should work on it for a specified period of time. I proposed more generally that people could ask that people leave data alone for any particular period of time, but that there ought to be an absolute limit on this type of embargo to prevent data being tied up. These kinds of ideas revolve around the need to forge community norms – standards of behaviour that are expected, and to some extent enforced, by a community. The problem is that these need to evolve naturally, rather than be imposed by committee. If there isn’t community buy in then proposed standards have no teeth.

An alternative approach to solving the problem is to adopt some sort ‘license’. A legal or contractual framework that creates obligation about how data can be used and re-used. This could impose embargoes of the type that John suggested, perhaps as flexible clauses in the license. One could imagine an ‘Open data – six month analysis embargo’ license. This is attractive because it apparently gives you control over what is done with your data while also allowing you to make it freely available. This is why people who first come to the table with an interest in sharing content always start with CC-BY-NC. They want everyone to have their content, but not to make money out of it. It is only later that people realise what other effects this restriction can have.

I had rejected the licensing approach because I thought it could only work in a walled garden, something which goes against my view of what open data is about. More recently John Wilbanks has written some wonderfully clear posts on the nature of the public domain, and the place of data in it, that make clear that it can’t even work in a walled garden. Because data is in the public domain, no contractual arrangement can protect your ability to exploit that data, it can only give you a legal right to punish someone who does something you haven’t agreed to. This has important consequences for the idea of Open Science licences and standards.

If we argue as an ‘Open Science Movement’ that data is in and must remain in the public domain then, if we believe this is in the common good, we should also argue for the widest possible interpretation of what is data. The results of an experiment, regardless of how clever its design might be, are a ‘fact of nature’, and therefore in the public domain (although not necessarily publically available). Therefore if any person has access to that data they can do whatever the like with it as long as they are not bound by a contractual arrangement. If someone breaks a contractual arrangement and makes the data freely available there is no way you can get that data back. You can punish the person who made it available if they broke a contract with you. But you can’t recover the data. The only way you can protect the right to exploit data is by keeping it secret. The is entirely different to creative content where if someone ignores or breaks licence terms then you can legally recover the content from anyone that has obtained it.

Why does this matter to the Open Science movement? Aren’t we all about making the data available for people to do whatever anyway? It matters because you can’t place any legal limitations on what people do with data you make available. You can’t put something up and say ‘you can only use this for X’ or ‘you can only use it after six months’ or even ‘you must attribute this data’. Even in a walled garden, once there is one hole, the entire edifice is gone. The only way we can protect the rights of those who generate data to benefit from exploiting it is through the hard work of developing and enforcing community norms that provide clear guidelines on what can be done. It’s that or simply keep the data secret.

What is important is that we are clear about this distinction between legal and ethical protections. We must not tell people that their data can be protected because essentially they can’t. And this is a real challenge to the ethos of open data because it means that our only absolutely reliable method for protecting people is by hiding data. Strong community norms will, and do, help but there is a need to be careful about how we encourage people to put data out there. And we need to be very strong in condemning people who do the ‘wrong’ thing. Which is why a discussion on what we believe is ‘right’ and ‘wrong’ behaviour is incredibly important. I hope that discussion kicks off in Barcelona and continues globally over the next few months. I know that not everyone can make the various meetings that are going on – but between them and the blogosphere and the ‘streamosphere‘ we have the tools, the expertise, and hopefully the will, to figure these things out.

Related articles

Zemanta Pixie

Defining error rates in the Illumina sequence: A useful and feasible open project?

Panorama image of the EBI (left) and Sulston Laboratories (right) of the Sanger Institute on the Genome campus in Cambridgeshire, England.

Regular readers will know I am a great believer in the potential of Web2.0 tools to enable rapid aggregation of loose networks of collaborators to solve a particular problem and the possibilities of using this approach to do science better, faster, and more efficiently. The reason why we haven’t had great successes on this thus far is fundamentally down to the size of the network we have in place and the bias in the expertise of that network towards specific areas. There is a strong bioinformatics/IT bias in the people interested in these tools and this plays out in a number of fields from the people on Friendfeed, to the relative frequency of commenting on PLoS Computational Biology versus PLoS ONE.

Putting these two together one obvious solution is to find a problem that is well suited to the people who are around, may be of interest to them, and is also quite useful to solve. I think I may have found such a problem.

The Illumina next generation sequencing platform developed originally by Solexa is the latest kid on the block as far as the systems that have reached the market. I spent a good part of today talking about how the analysis pipeline for this system could be improved. But one thing that came out as an issue is that no-one seems to have published  detailed analysis of the types of errors that are generated experimentally by this system. Illumina probably have done this analysis in some form but have better things to do than write it up.

The Solexa system is based on sequencing by synthesis. A population of DNA molecules, all amplified from the same single molecule, is immobilised on a surface. A new strand of DNA is added, one base at a time. In the Solexa system each base has a different fluorescent marker on it plus a blocking reagent. After the base is added, and the colour read, the blocker is removed and the next base can be added. More details can be found on the genographia wiki. There are two major sources of error here. Firstly, for a proportion of each sample, the base is not added successfully. This means in the next round, that part of the sample may generate a readout for the previous base. Secondly the blocker may fail, leading to the addition of two bases, causing a similar problem but in reverse. As the cycles proceed the ends of each DNA strand in the sample get increasingly out of phase making it harder and harder to tell which is the correct signal.

These error rates are probably dependent both on the identity of the base being added and the identity of the previous base. It may also be related to the number of cycles that have been carried out. There is also the possibility that the sample DNA has errors in it due to the amplification process though these are likely to be close to insignificant. However there is no data on these error rates available. Simple you might think to get some of the raw data and do the analysis – fit the sequence of raw intensity data to a model where the parameters are error rates for each base.

Well we know that the availability of data makes re-processing possible and we further believe in the power of the social network. And I know that a lot of you guys are good at this kind of analysis, and might be interested in having a play with some of the raw data. It could also be a good paper – Nature Biotech/Nature Methods perhaps and I am prepared to bet it would get an interesting editorial writeup on the process as well. I don’t really have the skills to do the work but if others out there are interested then I am happy to coordinate. This could all be done, in the wild, out in the open and I think that would be a brilliant demonstration of the possibilities.

Oh, the data? We’ve got access to the raw and corrected spot intensities and the base calls from a single ‘tile’ of the phiX174 control lane for a run from the 1000 Genomes Project which can be found at http://sgenomics.org/phix174.tar.gz courtesy of Nava Whiteford from the Sanger Centre. If you’re interested in the final product you can see some of the final read data being produced here.

What I had in mind was taking the called sequence, align onto phiX174 so we know the ‘true’ sequence. Then use that sequence plus a model with error rates to parameterise those error rates. Perhaps there is a better way to approach the problem? There are a series of relatively simple error models that could be tried and if the error rates can be defined then it will enable a really significant increase in both the quality and quantity of data that can be determined by these machines. I figure splitting the job up into a few small groups working on different models, putting the whole thing up on google code with a wiki there to coordinate and capture other issues as we go forward. Anybody up for it (and got the time)?

Related articles