Q&A in this week’s Nature – one or two (minor) clarifications

So a bit of a first for me. I can vaguely claim to have contributed to two things into the print version of Nature this week. Strictly speaking my involvement in the first, the ‘From the Blogosphere‘ piece on the Science Blogging Challenge, was really restricted to discussing the idea (originally from Richard Grant I believe) and now a bit of cheerleading and ultimately some judging. The second item though I can claim some credit for in as much as it is a Q&A with myself and Jean-Claude Bradley that was done when we visited Nature Publishing Group in London a few weeks back.

It is great that a journal like Nature views the ideas of data publication, open notebook science, and open science in general as worthy of featuring. This is not an isolated instance either, as we can point to the good work of the Web Publishing Group, in developing useful resources such as Nature Precedings, as well as previous features in the print edition such as the Horizons article (there is also another version on Nature Precedings) written by Peter Murray-Rust. One thing I have heard said many times in recent months is that while people who advocate open science may not agree with everything NPG with respect to copyright and access, people are impressed and encouraged by the degree of engagement that they maintain with the community.

I did however just want to clarify one or two of the things I apparently said. I am not claiming that I didn’t say those things – the interview was recorded after all – but just that on paper they don’t really quite match what I think I meant to say. Quoting from the article:

CN-Most publishers regard what we do as the equivalent of presenting at a conference, or a preprint. That hasn’t been tested across a wide range of publishers, and there’s at least one — the American Chemical Society — that doesn’t allow prepublication in any form whatsoever.

That sounds a little more extreme than what I meant to say – there are a number of publishers that don’t allow submission of material that has appeared online as a pre-print and the ACS has said that they regard online publication as equivalent to a pre-print. I don’t have any particular sympathy for the ACS but I think they probably do allow publication of material that was presented at ACS conferences.

CN-Open notebooks are practical but tough at the moment. My feeling is that the tools are not yet easy enough to use. But I would say that a larger proportion of people will be publishing electronically and openly in ten years.

Here I think what I said is too conservative on one point and possibly not conservative enough on the other. I did put my neck out and say that I think the majority of scientists will be using electronic lab notebooks of one sort or another in ten years. Funder data sharing policies will drive a much greater volume of material online post publication (hopefully with higher quality description) and this may become the majority of all research data. I think that more people will be making more material available openly as it is produced as well but I doubt that this will be a majority of people in ten years – I hope for a sizeable and significant minority and that’s what we will continue to work towards.

Linking up open science online

I am currently sitting at the dining table of Peter Murray-Rust with Egon Willighagen opposite me talking to Jean-Claude Bradley. We pulling together sets of data from Jean-Claude’s UsefulChem project into CML to make it more semantically rich and do a bunch of cool stuff. Jean-Claude has a recently published preprint on Nature Precedings of a paper that has been submitted to JoVE. Egon was able to grab the InChiKeys from the relevant UsefulChem pages and passing those to CDK via a script that he wrote on the spot (which he has also just blogged) generated CML react for those molecules.

Peter at the same time has cut and pasted an existing example of a CML react XML document into a GoogleDoc which we will then modify to represent one example of the Ugi reactions that Jean-Claude reported in the precedings paper. You will be able to see all of these documents. The only way we would be able to do this is with four laptops all online – and for all the relevant documents and services to be available from where we are sitting. I’ve never been involved in a hackfest like this before but actually the ability to handle different aspects of the same document via GoogleDocs is a very powerful way of handling multiple processes at the same time.

Re-inventing the wheel (again) – what the open science movement can learn from the history of the PDB

One of the many great pleasures of SciFoo was to meet with people who had a different, and in many cases much more comprehensive, view of managing data and making it available. One of the long term champions of data availability is Professor Helen Berman, the head of the Protein Data Bank (the international repository for biomacromolecular structures), and I had the opportunity to speak with her for some time on the Friday afternoon before Scifoo kicked off in earnest (in fact this was one of many somewhat embarrasing situations where I would carefully explain my background in my very best ‘speaking to non-experts’ voice only to find they knew far more about it than I did – however Jim Hardy of Gahaga Biosciences takes the gold medal for this event for turning to the guy called Larry next to him while having dinner at Google Headquarters and asking what line of work he was in).

I have written before about how the world might look if the PDB and other biological databases had never existed, but as I said then I didn’t know much of the real history. One of the things I hadn’t realised was how long it was after the PDB was founded before deposition of structures became expected for all atomic resolution biomacromolecular structures. The road from a repository of seven structures with a handful of new submissions a year to the standards that mean today that any structure published in a reputable journal must be deposited was a long and rocky one. The requirement to deposit structures on publication only became general in the early 1990s, nearly twenty years after it was founded and there was a very long and extended process where the case for making the data available was only gradually accepted by the community.

Helen made the point strongly that it had taken 37 years to get the PDB to where it is today; a gold standard international and publically available repository of a specific form of research data supported by a strong set of community accepted, and enforced, rules and conventions.  We don’t want to take another 37 years to achieve the widespread adoption of high standards in data availability and open practice in research more generally. So it is imperative that we learn the lessons and benefit from the experience of those who built up the existing repositories. We need to understand where things went wrong and attempt to avoid repeating mistakes. We need to understand what worked well and use this to our advantage. We also need to recognise where the technological, social, and political environment that we find ourselves in today means that things have changed, and perhaps to recognise that in many ways, particularly in the way people behave, things haven’t changed at all.

I’ve written this in a hurry and therefore not searched as thoroughly as I might but I was unable to find any obvious ‘history of the PDB’ online. I imagine there must be some out there – but they are not immediately accessible. The Open Science movement could benefit from such documents being made available – indeed we could benefit from making them required reading. While at Scifoo Michael Nielsen suggested the idea of a panel of the great and the good – those who would back the principles of data availability, open access publication, and the free transfer of materials. Such a panel would be great from the perspective of publicity but as an advisory group it could have an even greater benefit by providing the opportunity to benefit from the experience many of these people have in actually doing what we talk about.

Southampton Open Science Workshop 31 August and 1 September

An update on the Workshop that I announced previously. We have a number of people confirmed to come down and I need to start firming up numbers. I will be emailing a few people over the weekend so sorry if you get this via more than one route. The plan of attack remains as follows:

Meet on evening of Sunday 31 August in Southampton, most likely at a bar/restaurant near the University to coordinate/organise the details of sessions.

Commence on Monday at ~9:30 and finish around 4:30pm (with the option of discussion going into the evening) with three or four sessions over the course of the day broadly divided into the areas of tools, social issues, and policy. We have people interested and expert in all of these areas coming so we should be able to to have a good discussion. The object is to keep it very informal but to keep the discussion productive. Numbers are likely to be around 15-20 people. For those not lucky enough to be in the area we will aim to record and stream the sessions, probably using a combination of dimdim, mogulus, and slideshare. Some of these may require you to be signed into our session so if you are interested drop me a line at the account below.

To register for the meeting please send me an email to my gmail account (cameronneylon). To avoid any potential confusion, even if you have emailed me in the past week or so about this please email again so that I have a comprehensive list in one place. I will get back to you with a request via PayPal for £15 to cover coffees and lunch for the day (so if you have a PayPal account you want to use please send the email from that address). If there is a problem with the cost please state so in your email and we will see what we can do. We can suggest options for accomodation but will ask you to sort it out for yourself.

I have set up a wiki to discuss the workshop which is currently completely open access. If I see spam or hacking problems I will close it down to members only (so it would be helpful if you could create an account) but hopefully it might last a few weeks in the open form. Please add your name and any relevant details you are happy to give out to the Attendees page and add any presentations or demos you would be interested in giving, or would be interested in hearing about, on the Programme suggestion page.

Notes from Scifoo

I am too tired to write anything even vaguely coherent. As will have been obvious there was little opportunity for microblogging, I managed to take no video at all, and not even any pictures. It was non-stop, at a level of intensity that I have very rarely encountered anywhere before. The combination of breadth and sharpness that many of the participants brought was, to be frank, pretty intimidating but their willingness to engage and discuss and my realisation that, at least in very specific areas, I can hold my own made the whole process very exciting. I have many new ideas, have been challenged to my core about what I do, and how; and in many ways I am emboldened about what we can achieve in the area of open data and open notebooks. Here are just some thoughts that I will try to collect some posts around in the next few days.

  • We need to stop fretting about what should be counted as ‘academic credit’. In another two years there will be another medium, another means of communication, and by then I will probably be conservative enough to dismiss it. Instead of just thinking that diversifying the sources of credit is a good thing we should ask what we want to achieve. If we believe that we need a more diverse group of people in academia than that is what we should articulate – Courtesy of a discussion with Michael Eisen and Sean Eddy.
  • ‘Open Science’ is a term so vague as to be actively dangerous (we already knew that). We need a clear articulation of principles or a charter. A set of standards that are clear, and practical in the current climate. As these will be lowest common denominator standards at the beginning we need a mechanism that enables or encourages a process of incrementally raising those standards. The electronic Geophysical Year Declaration is a good working model for this – Courtesy of session led by Peter Fox.
  • The social and personal barriers to sharing data can be codified and made sense of (and this has been done). We can use this understanding to frame structures that will make more data available – session led by Christine Borgman
  • The Open Science movement needs to harness the experience of developing the open data repositories that we now take for granted. The PDB took decades of continuous work to bring to its current state and much of it was a hard slog. We don’t want to take that much time this time round – Courtesy of discussion led by Sarah Berman
  • Data integration is tough, but it is not helped by the fact that bench biologists don’t get ontologies, and that ontologists and their proponents don’t really get what the biologists are asking. I know I have an agenda on this but social tagging can be mapped after the fact onto structured data (as demonstrated to me by Ben Good). If we get the keys right then much else will follow.
  • Don’t schedule a session at the same time as Martin Rees does one of his (aside from anything else you miss what was apparently a fabulous presentation).
  • Prosthetic limbs haven’t changed in 100 years and they suck. Might an open source approach to building a platform be the answer – discussion with Jon Kuniholm, founder of the Open Prosthetics Project.
  • The platform for Open Science is very close and some of the key elements are falling into place. In many ways this is no longer a technical problem.
  • The financial system backing academic research is broken when the cost of reproducing or refuting specific claims rises to 10 to 20-fold higher than the original work. Open Notebook Science is a route to reducing this cost – discussion with Jamie Heywood.
  • Chris Anderson isn’t entirely wrong – but he likes being provocative in his articles.
  • Google run a fantasticaly slick operation. Down to the fact that the chocolate coated oatmeal biscuit icecream sandwiches are specially ordered in made with proper sugar instead of hugh fructose corn syrup.

Enough. Time to sleep.

We need to get out more…

The speaker had started the afternoon with a quote from Ian Rogers, ‘Losers wish for scarcity. Winners leverage scale.’ He went on to eloquently, if somewhat bluntly, make the case for exposing data and discuss the importance of making it available in a useable and re-useable form. In particular he discussed the sophisticated re-analysis and mashing that properly exposed data enables while excoriating a number of people in the audience for forcing him to screen scrape data from their sites.

All in all, as you might expect, this was music to my ears. This was the case for open science made clearly and succinctly, and with passion, over the course of several days. The speaker? Mike Ellis from EduServ; I suspect both a person and an organization of which most of the readers of this blog have never heard. Why? Because he comes from a background in museums, the data he wanted was news streams, addresses, and lat long for UK higher education institutions, or library catalogues, not NMR spectra or gene sequences. Yet the case to be made is the same. I wrote last week about the need to make better connections between the open science blogosphere and the wider interested science policy and funding community. But we also need to make more effective connections with those for whom the open data agenda is part of their daily lives.

I spent several enjoyable days last week at the UKOLN Institutional Web Managers’ Workshop in Aberdeen. UKOLN is a UK centre of excellence for web based activities in HE in the UK and IWMW is their annual meeting. It is attended primarily by the people who manage web systems within UK HE including IT services, Web services, and library services, as well as the funders, and support organisations associated with these activities.

There were a number of other talks that would be of interest to this community and many of the presentations are available as video at the conference website. James Curral on Web Archiving, Stephanie Taylor on Institutional Repositories, and David Hyett of the British Antarctic Survey providing the sceptics view of implementing Web2.0 services for communicating with the public. His central point, which was well made, was that there is no point adding a whole bunch of wizzbang features to an institutional website if you haven’t got the fundamentals right: quality content; straightforward navigation; relevance to the user. Where I disagreed with his position was that I felt he extrapolated from the fact that most user generated content is poor to the presumption that ‘user generated content on my site will be poor’. This to me misses the key point: that it is by focussing on community building that you generate high quality content that is of relevance to that community. Nonetheless, his central point, don’t build in features that your users don’t want or need, is well made.

David made the statement ‘90% of blogs are boring’ during his talk. I took some exception to this (I am sure the situation is far, far, worse than that). In a question I made the point that it was generally accepted that Google had made the web useable by making things findable amongst the rubbish but that for social content we needed to adopt a different kind of ‘social search’ strategy with different tools. That with the right strategies and the right tools every person could find their preferred 10% (or 1% or 0.00001%) of the world’s material. That in fact this social search approach led to the formation of new communities and new networks.

After the meeting however it struck me that I had failed to successfully execute my own advice. Mike Ellis blogs a bit, twitters a lot, and is well known within the institutional web management community. He lives not far away from me. He is a passionate advocate of data availability and has the technical smarts to do clever staff with the data that is available. Why hadn’t I already made this connection? If I go around making the case that web based tools will transform our ability to communicate where is the evidence that this happens in practice. Our contention is that online publishing frees up communication and allows the free flow of information and ideas. The sceptics contention is that it just allows us to be happy in our own little echo chamber. Elements of both are true but I think it is fair to say that we are not effectively harnessing the potential of the medium to drive forward our agenda. By broadening the community and linking up with like minded people in museums, institutional web services, archives, and libraries we can undoubtedly do better.

So there are two approaches to solving this problem, the social approach and the technical approach. Both are intertwined but can be separated to a certain extent. The social approach is to link existing communities and allow the interlinks between them to grow. This blog post is one attempt – some of you may go on to look at Mike’s Blog. Another is for people to act as supernodes within the community network. Michael Nielsen’s joining of the (mostly) life science oriented community on FriendFeed and more widely in the blogosphere has connected that community with a theoretical physics community and another ‘Open Science’ community that was largely separate from the existing online community. A small number of connections made a big difference to overall network size. I was very happy to accept the invitation to speak at the IWMW meeting precisely because I hoped to make these kinds of connections. Hopefully a few people from the meeting may read this blog post (if so please do leave a comment – lets build on this!). We make contacts we expand the network – but this relies very heavily on supernodes within the network and their ability to cope with the volume.

So is there a technical solution to the problem? Well in this specific case there is a technical problem to the problem. Mike doesn’t use Friendfeed but is a regular Twitter user. My most likely connection to Mike is Brian Kelly, based at UKOLN, who does have a Friendfeed account but I suspect doesn’t monitor it. The connection fails because the social networks don’t effectively interconnect. It turns out the web management community aren’t convinced by FriendFeed and prefer Twitter. So a technical solution would somehow have to bridge this gap. Right at the moment that bridge is most likely to be a person, not a machine, which leaves us back where we started, and I don’t see that changing anytime soon. The problem is an architectural one, not an application or service one. I can aggregate Twitter, FriendFeed or anything else in one place but unless everyone else does the same thing its not really going to help.

I don’t really have a solution except once again to make the case for the value of those people who build stronger connections between poorly interconnected networks. It is not just that information is valuable, but the timely delivery of that information is valuable. These people add value. What is more, if we are going to fully exploit the potential of the web in the near term, not to mention demonstrate the value of exploiting it to others, we need to value these people and support their activities. How we do that is an open question. It will clearly cost money. The question is where to get it from and how to get it to where it needs to be.

Pedro Beltrao writes on the backlash against open science

Pedro has written a thoughtful post detailing arguments he has received against Open Practice in science. He makes a good point that as the ideas around Open Science spread there will inevitably be a backlash. Part of the response to this is to keep saying – as Pedro does and as Jean-Claude, Bill Hooker and others have said repeatedly that we are not forcing anyone to take this approach. Research funders, such as the BBSRC, may have data sharing policies that require some measure of openness but, at the end of the day, if they are paying they get to call the shots.

The other case to make is that this is a more efficient and effective way of doing science. There is a danger, particularly in the US, that open approaches get labelled as ‘socialist’ or something similar. PRISM and the ACS when attacking open access have used the term ‘socialized science’. This has a particular resonance in the US and, I think, is seen as a totally bizarre argument elsewhere in the world but that is not the point. The key point to make is that the case for Open Science is a pure market based argument. Reducing barriers for re-use, breaking out of walled gardens, adds value and makes the market more efficient not less. John Wilbanks has some great blog posts on this subject and an article in Nature Precedings which I highly recommend.

In the comments in Pedro’s post Michael Kuhn asks:

Hmm, just briefly some unbalanced thoughts (I don’t have time to offer more than the advocatus diaboli argument):

Open Science == Communism? I’m wondering if a competition of scientific theories is actually necessary to further science in a sound way. Just to draw the parallel, a lot of R&D in the private sector is done in parallel and in competition, with the result of increased productivity. On the other side, we’ve had things like Comecon and five-year plans to “order” the development and reduce competition, and the result was lower productivity.

I think it is important to counter this kind of argument (and I note that Michael is playing devil’s advocate here – albeit in latin) with arguments that use the economic benefits, and case studies, such as those used by James Boyle in his talk at the recent Science Commons run meeting in Barcelona (which I blogged about here), to show that there is a strong business case to be made. Openness may be more social but it isn’t in any sense socialist. In fact it drives us closer to a pure market than the current system in many way. The business of building value on open content has taken off on the web. Science can do the same and open approaches are more efficient.

Person Bill Hooker

Right click for SmartMenu shortcuts

Policy for Open Science – reflections on the workshop

Written on the train on the way from Barcelona to Grenoble. This life really is a lot less exotic than it sounds… 

The workshop that I’ve reported on over the past few days was both positive and inspiring. There is a real sense that the ideas of Open Access and Open Data are becoming mainstream. As several speakers commented, within 12-18 months it will be very unusual for any leading institution not to have a policy on Open Access to its published literature. In many ways as far as Open Access to the published literature is concerned the war has been won. There will remains battles to be fought over green and gold routes – the role of licenses and the need to be able to text mine – successful business models remain to be made demonstrably sustainable – and there will be pain as the inevitable restructuring of the publishing industry continues. But be under no illusions that this restructuring has already begun and it will continue in the direction of more openness as long as the poster children of the movement like PLoS and BMC continue to be successful.

Open Data remains further behind, both with respect to policy and awareness. Many people spoke over the two days about Open Access and then added, almost as an addendum ‘Oh and we need to think about data as well’. I believe the policies will start to grow and examples such as the BBSRC Data Sharing Policy give a view of the future. But there is still much advocacy work to be done here. John Wilbanks talked about the need to set achievable goals, lines in the sand which no-one can argue with. And the easiest of these is one that we have discussed many times. All data associated with a published paper, all analysis, and all processing procedures, should be made available. This is very difficult to argue with – nonetheless we know of examples where the raw data of important experiments is being thrown away. But if an experiment cannot be shown to have been done, cannot be replicated and checked, can it really be publishable? Nonetheless this is a very useful marker, and a meme that we can spread and promote.

In the final session there was a more critical analysis of the situation. A number of serious questions were raised but I think they divide into two categories. The first involves the rise of the ‘Digital Natives’ or the ‘Google Generation’. The characteristics of this new generation (a gross simplification in its own right) are often presented as a pure good. Better networked, more sharing, better equipped to think in the digital network. But there are some characteristics that ought to give pause. A casualness about attribution, a sense that if something is available then it is fine to just take it (its not stealing after all, just copying). There is perhaps a need to recover the roots of ‘Mertonian’ science, to as I think James Boyle put it, publicise and embed the attitudes of the last generation of scientists, for whom science was a public good and a discipline bounded by strict rules of behaviour. Some might see this as harking back to an elitist past but if we are constructing a narrative about what we want science to be then we can take the best parts of all of our history and use it to define and refine our vision. There is certainly a place for a return to the compulsory study of science history and philosophy.

The second major category of issues discussed in the last session revolved around the question of what do we actually do now. There is a need to move on many fronts, to gather evidence of success, to investigate how different open practices work – and to ask ourselves the hard questions. Which ones do work, and indeed which ones do not. Much of the meeting revolved around policy with many people in favour of, or at least not against, mandates of one sort or another. Mike Carroll objected to the term mandate – talking instead about contractual conditions. I would go further and say that until these mandates are demonstrated to be working in practice they are aspirations. When they are working in practice they will be norms, embedded in the practice of good science. The carrot may be more powerful than the stick but peer pressure is vastly more powerful than both.

So they key questions to me revolve around how we can convert aspirations into community norms. What is needed in terms of infrastructure, in terms of incentives, and in terms of funding to make this stuff happen? One thing is to focus on the infrastructure and take a very serious and critical look at what is required. It can be argued that much of the storage infrastructure is in place. I have written on my concerns about institutional repositories but the bottom line remains that we probably have a reasonable amount of disk space available. The network infrastructure is pretty good so these are two things we don’t need to worry about. What we do need to worry about, and what wasn’t really discussed very much in the meeting, is the tools that will make it easy and natural to deposit data and papers.

The incentive structure remains broken – this is not a new thing – but if sufficiently high profile people start to say this should change, and act on those beliefs, and they are, then things will start to shift. It will be slow but bit by bit we can imagine getting there. Can we take shortcuts. Well there are some options. I’ve raised in the past the idea of a prize for Open Science (or in fact two, one for an early career researcher and one for an established one). Imagine if we could make this a million dollar prize, or at least enough for someone to take a year off. High profile, significant money, and visible success for someone each year. Even without money this is still something that will help people – give them something to point to as recognition of their contribution. But money would get people’s attention.

I am sceptical about the value of ‘microcredit’ systems where a person’s diverse and perhaps diffuse contributions are aggregated together to come up with some sort of ‘contribution’ value, a number by which job candidates can be compared. Philosophically I think it’s a great idea, but in practice I can see this turning into multiple different calculations, each of which can be gamed. We already have citation counts, H-factors, publication number, integrated impact factor as ways of measuring and comparing one type of output. What will happen when there are ten or 50 different types of output being aggregated? Especially as no-one will agree on how to weight them. What I do believe is that those of us who mentor staff, or who make hiring decisions should encourage people to describe these contributions, to include them in their CVs. If we value them, then they will value them. We don’t need to compare the number of my blog posts to someone else’s – but we can ask which is the most influential – we can compare, if subjectively, the importance of a set of papers to a set of blog posts. But the bottom line is that we should actively value these contributions – let’s start asking the questions ‘Why don’t you write online? Why don’t you make your data available? Where are your protocols described? Where is your software, your workflows?’

Funding is key, and for me one of the main messages to come from the meeting was the need to think in terms of infrastructure, and in particular, to distinguish what is infrastructure and what is science or project driven. In one discussion over coffee I discussed the problem of how to fund development projects where the two are deeply intertwined and how this raises challenges for funders. We need new funding models to make this work. It was suggested in the final panel that as these tools become embedded in projects there will be less need to worry about them in infrastructure funding lines. I disagree. Coming from an infrastructure support organisation I think there is a desperate need for critical strategic oversight of the infrastructure that will support all science – both physical facilities, network and storage infrastructure, tools, and data. This could be done effectively using a federated model and need not be centralised but I think there is a need to support the assumption that the infrastructure is available and this should not be done on a project by project basis. We build central facilities for a reason – maybe the support and development of software tools doesn’t fit this model but I think it is worth considering.

This ‘infrastructure thinking’ goes wider than disk space and networks, wider than tools, and wider than the data itself. The concept of ‘law as infrastructure’ was briefly discussed. There was also a presentation looking at different legal models of a ‘commons’; the public domain, a contractually reconstructed commons, escrow systems etc. In retrospect I think there should have been more of this. We need to look critically at different models, what they are good for, how they work. ‘Open everything’ is a wonderful philosophical position but we need to be critical about where it will work, where it won’t, and where it needs contractual protection, or where such contractual protection is counter productive. I spoke to John Wilbanks about our ideas on taking Open Source Drug Discovery into undergraduate classes and schools and he was critical of the model I was proposing, not from the standpoint of the aims or where we want to be, but because it wouldn’t be effective at drawing in pharmaceutical companies and protecting their investment. His point was, I think, that by closing off the right piece of the picture with contractual arrangements you bring in vastly more resources and give yourself greater ability to ensure positive outcomes. That sometimes to break the system you need to start by working within it by, in this case, making it possible to patent a drug. This may not be philosophically in tune with my thinking but it is pragmatic. There will be moments, especially when we deal with the interface with commerce, where we have to make these types of decisions. There may or may not be ‘right’ answers, and if there are they will change over time but we need to know our options and know them well so as to make informed decisions on specific issues.

But finally, as is my usual wont, I come back to the infrastructure of tools. The software that will actually allow us to record and order this data that we are supposed to be sharing. Again there was relatively little on this in the meeting itself. Several speakers recognised the need to embed the collection of data and metadata within existing workflows but there was very little discussion of good examples of this. As we have discussed before this is much easier for big science than for ‘long tail’ or ‘small science’. I stand by my somewhat provocative contention that for the well described central experiments of big science this is essentially a solved problem – it just requires the will and resources to build the language to describe the data sets, their formats, and their inputs. But the problem is that even for big science, the majority of the workflow is not easily automated. There are humans involved, making decisions moment by moment, and these need to be captured. The debate over institutional repositories and self archiving of papers is instructive here. Most academics don’t deposit because they can’t be bothered. The idea of a negative click repository – where this is a natural part of the workflow can circumvent this. And if well built it can make the conventional process of article submission easier. It is all a question of getting into the natural workflow of the scientist early enough that not only do you capture all the contextual information you want, but that you can offer assistance that makes them want to put that information in.

The same is true for capturing data. We must capture it at source. This is the point where it has the potential to add the greatest value to the scientist’s workflow by making their data and records more available, by making them more consistent, by allowing them to reformat and reanalyse data with ease, and ultimately by making it easy for them to share the full record. We can and we will argue about where best to order and describe the elements of this record. I believe that this point comes slightly later – after the experiment – but wherever it happens it will be made much easier by automatic capture systems that hold as much contextual information as possible. Metadata is context – almost all of it should be possible to catch automatically. Regardless of this we need to develop a diverse ecosystem of tools. It needs to be an open and standards based ecosystem and in my view needs to be built up of small parts, loosely coupled. We can build this – it will be tough, and it will be expensive but I think we know enough now to at least outline how it might work, and this is the agenda that I want to explore at SciFoo.

John Wilbanks had the last word, and it was a call to arms. He said ‘We are the architects of Open’. There are two messages in this. The first is we need to get on and build this thing called Open Science. The moment to grasp and guide the process is now. The second is that if you want to have a part in this process the time to join the debate is now. One thing that was very clear to me was that the attendees of the meeting were largely disconnected from the more technical community that reads this and related blogs. We need to get the communication flowing in both directions – there are things the blogosphere knows, that we are far ahead on, and we need to get the information across. There are things we don’t know much about, like the legal frameworks, the high level policy discussions that are going on. We need to understand that context. It strikes me though that if we can combine the strengths of all of these communities and their differing modes of communication then we will be a powerful force for taking forward the open agenda.

Policy and technology for e-science – A forum on on open science policy

I’m in Barcelona at a satellite meeting of the EuroScience Open Forum organised by Science Commons and a number of their partners.  Today is when most of the meeting will be with forums on ‘Open Access Today’, ‘Moving OA to the Scientific Enterprise:Data, materials, software’, ‘Open access in the the knowledge network’, and ‘Open society, open science: Principle and lessons from OA’. There is also a keynote from Carlos Morais-Pires of the European Commission and the lineup for the panels is very impressive.

Last night was an introduction and social kickoff as well. James Boyle (Duke Law School, Chair of board of directors of Creative Commons, Founder of Science commons) gave a wonderful talk (40 minutes, no slides, barely taking breath) where his central theme was the relationship between where we are today with open science and where international computer networks were in 1992. He likened making the case for open science today with that of people suggesting in 1992 that the networks would benefit from being made freely accessible, freely useable, and based on open standards. The fears that people have today of good information being lost in a deluge of dross, of their being large quantities of nonsense, and nonsense from people with an agenda, can to a certain extent be balanced against the idea that to put it crudely, that Google works. As James put it (not quite a direct quote) ‘You need to reconcile two statements; both true. 1) 99% of all material on the web is incorrect, badly written, and partial. 2) You probably  haven’t opened an encylopedia as a reference in ten year.

James gave two further examples, one being the availability of legal data in the US. Despite the fact that none of this is copyrightable in the US there are thriving businesses based on it. The second, which I found compelling, for reasons that Peter Murray-Rust has described in some detail. Weather data in the US is free. In a recent attempt to get long term weather data a research effort was charged on the order of $1500, the cost of the DVDs that would be needed to ship the data, for all existing US weather data. By comparison a single German state wanted millions for theirs. The consequence of this was that the European data didn’t go into the modelling. James made the point that while the European return on investment for weather data was a respectable nine-fold, that for the US (where they are giving it away remember) was 32 times. To me though the really compelling part of this argument is if that data is not made available we run the risk of being underwater in twenty years with nothing to eat. This particular case is not about money, it is potentially about survival.

Finally – and this you will not be surprised was the bit I most liked – he went on to issue a call to arms to get on and start building this thing that we might call the data commons. The time has come to actually sit down and start to take these things forward, to start solving the issues of reward structures, of identifying business models, and to build the tools and standards to make this happen. That, he said was the job for today. I am looking forward to it.

I will attempt to do some updates via twitter/friendfeed (cameronneylon on both) but I don’t know how well that will work. I don’t have a roaming data tariff and the charges in Europe are a killer so it may be a bit sparse.

Data is free or hidden – there is no middle ground

Science commons and other are organising a workshop on Open Science issues as a satellite meeting of the European Science Open Forum meeting in July. This is pitched as an opportunity to discuss issues around policy, funding, and social issues with an impact on the ‘Open Research Agenda’. In preparation for that meeting I wanted to continue to explore some of the conflicts that arise between wanting to make data freely available as soon as possible and the need to protect the interests of the researchers that have generated data and (perhaps) have a right to the benefits of exploiting that data.

John Cumbers proposed the idea of a ‘Protocol’ for open science that included the idea of a ‘use embargo’; the idea that when data is initially made available, no-one else should work on it for a specified period of time. I proposed more generally that people could ask that people leave data alone for any particular period of time, but that there ought to be an absolute limit on this type of embargo to prevent data being tied up. These kinds of ideas revolve around the need to forge community norms – standards of behaviour that are expected, and to some extent enforced, by a community. The problem is that these need to evolve naturally, rather than be imposed by committee. If there isn’t community buy in then proposed standards have no teeth.

An alternative approach to solving the problem is to adopt some sort ‘license’. A legal or contractual framework that creates obligation about how data can be used and re-used. This could impose embargoes of the type that John suggested, perhaps as flexible clauses in the license. One could imagine an ‘Open data – six month analysis embargo’ license. This is attractive because it apparently gives you control over what is done with your data while also allowing you to make it freely available. This is why people who first come to the table with an interest in sharing content always start with CC-BY-NC. They want everyone to have their content, but not to make money out of it. It is only later that people realise what other effects this restriction can have.

I had rejected the licensing approach because I thought it could only work in a walled garden, something which goes against my view of what open data is about. More recently John Wilbanks has written some wonderfully clear posts on the nature of the public domain, and the place of data in it, that make clear that it can’t even work in a walled garden. Because data is in the public domain, no contractual arrangement can protect your ability to exploit that data, it can only give you a legal right to punish someone who does something you haven’t agreed to. This has important consequences for the idea of Open Science licences and standards.

If we argue as an ‘Open Science Movement’ that data is in and must remain in the public domain then, if we believe this is in the common good, we should also argue for the widest possible interpretation of what is data. The results of an experiment, regardless of how clever its design might be, are a ‘fact of nature’, and therefore in the public domain (although not necessarily publically available). Therefore if any person has access to that data they can do whatever the like with it as long as they are not bound by a contractual arrangement. If someone breaks a contractual arrangement and makes the data freely available there is no way you can get that data back. You can punish the person who made it available if they broke a contract with you. But you can’t recover the data. The only way you can protect the right to exploit data is by keeping it secret. The is entirely different to creative content where if someone ignores or breaks licence terms then you can legally recover the content from anyone that has obtained it.

Why does this matter to the Open Science movement? Aren’t we all about making the data available for people to do whatever anyway? It matters because you can’t place any legal limitations on what people do with data you make available. You can’t put something up and say ‘you can only use this for X’ or ‘you can only use it after six months’ or even ‘you must attribute this data’. Even in a walled garden, once there is one hole, the entire edifice is gone. The only way we can protect the rights of those who generate data to benefit from exploiting it is through the hard work of developing and enforcing community norms that provide clear guidelines on what can be done. It’s that or simply keep the data secret.

What is important is that we are clear about this distinction between legal and ethical protections. We must not tell people that their data can be protected because essentially they can’t. And this is a real challenge to the ethos of open data because it means that our only absolutely reliable method for protecting people is by hiding data. Strong community norms will, and do, help but there is a need to be careful about how we encourage people to put data out there. And we need to be very strong in condemning people who do the ‘wrong’ thing. Which is why a discussion on what we believe is ‘right’ and ‘wrong’ behaviour is incredibly important. I hope that discussion kicks off in Barcelona and continues globally over the next few months. I know that not everyone can make the various meetings that are going on – but between them and the blogosphere and the ‘streamosphere‘ we have the tools, the expertise, and hopefully the will, to figure these things out.

Related articles

Zemanta Pixie