Sci – Bar – Foo etc. Part III – Google Wave Session at SciFoo

Google Wave has got an awful lot of people quite excited. And others are more sceptical. A lot of SciFoo attendees were therefore very excited to be able to get an account on the developer sandbox as part of the weekend. At the opening plenary Stephanie Hannon gave a demo of Wave and, although there were numerous things that didn’t work live, that was enough to get more people interested. On the Saturday morning I organized a session to discuss what we might do and also to provide an opportunity for people to talk about technical issues. Two members of the wave team came along and kindly offered their expertise, receiving a somewhat intense grilling as thanks for their efforts.

I think it is now reasonably clear that there are two short to medium term applications for Wave in the research process. The first is the collaborative authoring of documents and the conversations around those. The second is the use of wave as a recording and analysis platform. Both types of functionality were discussed with many ideas for both. Martin Fenner has also written up some initial impressions.

Naturally we recorded the session in Wave and even as I type, over a week later, there is a conversation going in real time about the details of taking things forward. There are many things to get used to, not leastwhen it is polite to delete other people’s comments and clean them up, but the potential (and the weaknesses and areas for development) are becoming clear.

I’ve pasted our functionality brainstorm at the bottom to give people an idea of what we talked about but the discussion was very wide ranging. Functionality divided into a few categories. Firstly Robots for bringing scientific objects, chemical structures, DNA sequences, biomolecular structures, videos, and images into the wave in a functional form with links back to a canonical URI for the object. In its simplest form this might just provide a link back to a database. So typing “chem:benzene” or “pdb:1ecr” would trigger a robot to insert a link back to the database entry. More complex robots could insert an image of the chemical (or protein structure) or perhaps rdf or microformats that provide a more detailed description of the molecule.

Taking this one step further we also explored the idea of pulling data or status information from larboratory instruments to create a “laboratory dashboard” and perhaps controlling them. This discussion was helpful in getting a feel for what Wave can and can’t do as well as how different functionalities are best implemented. A robot can be built to populate a wave with information or data from laboratory instruments and such a robot could also pass information from the wave back to the instrument in principle. However both of these will still require some form of client running on the instrument side that is capable of talking to the robot web service. So the actual problem of interfacing with the instrument will remain. We can hope that instrument manufacturers might think of writing out nice simple XML log files at some point but in the meantime this is likely to involve hacking things together. If you can manage this then a Gadget will provide a nice way of providing a visual dashboard type interface to keep you updated as to what is happening.

Sharing data analysis is something of significant interest to me and the fact that there is already a robot (called Monty) that will intepret Python is a very interesting starting point for exploring this. There is some basic graphing functionality (Graphy naturally). For me this is where some of the most exciting potential lies; not just sharing printouts or the results of data analysis procedures but the details of the data and a live representation of the process that lead to the results. Expect much more from me on this in the future as we start to take it forward.

The final area of discussion, and the one we probably spent the most time on, was looking at Wave in the authoring and publishing process. Formatting of papers, sharing of live diagrams and charts, automated reference searching and formatting, as well as submission processes, both to journals and to other repositories, and even the running of peer review process were all discussed. This is the area where the most obvious and rapid gains can be made. In a very real sense Wave was designed to remove the classic problem of sending around manuscript versions with multiple figure and data files by email so you would expect it to solve a number of the obvious problems. The interesting thing in my view will be to try it out in anger.

Which was where we finished the session. I proposed the idea of writing a paper, in Wave, about the development and application of tools needed to author papers in Wave. As well as the technical side, such a paper would discuss the user experience, and any of the social issues that arise out of such a live collaborative authoring experience. If it were possible to run an actual peer review process in Wave that would also be very cool however this might not be feasible given existing journal systems. If not we will run a “mock” peer review process and look at how that works. If you are interested in being involved, drop a note in the comments, or join the Google Group that has been set up for discussions (or if you have a developer sandbox account and want access to the Wave drop me a line).

There will be lots of details to work through but the overall feel of the session for me was very exciting and very positive. There will clearly be technical and logistical barriers to be overcome. Not least that a a significant quantity of legacy toolingmay not be a good fit for Wave. Some architectural thinking on how to most effectively re-use existing code may be required. But overall the problem seems to be where to start on the large set of interesting possibilities. And that seems a good place to be with any new technology.

Continue reading “Sci – Bar – Foo etc. Part III – Google Wave Session at SciFoo”

Sci – Bar – Foo etc. Part I – SciBarCamp Palo Alto

Last week I was lucky enough to attend both SciBarCamp Palo Alto and SciFoo; both for the second time. In the next few posts I will give a brief survey of the highlights of both, kicking off with SciBarCamp. I will follow up with more detail on some of the main things to come out of these meetings over the next week or so.

SciBarCamp followed on from last year’s BioBarCamp and was organized by Jamie McQuay, John Cumbers, Chris Patil, and Shirley Wu. It was held at the Institute for the Future at Palo Alto which is a great space for a small multisession meeting for about 70 people.

A number of people from last year’s camp came but there was a good infusion of new people as well with a strong element of astronomy and astonautics as well as a significant number of people with one sort of media experience or another who were interested in science providing a different kind of perspective.

After introductions and a first past at the session planning the meeting was kicked off by a keynote from Sean Mooney on web tools for research. The following morning kicked off for me with a session lead by Chris Patil on Open Source text books with an interesting discussion on how to motivate people to develop content. I particularly liked the notion of several weeks in a pleasant place drinking cocktails hammering out the details of the content. Joanna Scott and Andy Lang gave a session on the use of Second Life for visualization and scientific meetings. You can see Andy’s slides at slideshare.

Tantek Celik gave a session on how to make data available from a technical perspective with a focus on microformats as a means of marking up elements. His list of five key points for publishing data on the web make a good checklist. Unsurprisingly, being a key player at microformats.org he played up microformats. There was a pretty good discussion, that continued through some other sessions, on the relative value of microformats versus XML or rdf. Tantik was dismissive which I would agree with for much of the consumer web, but I would argue that the place where semantic web tools are starting to make a difference is the sciences and the microformats, at least in their controlled vocabulary form, are unlikely to deliver. In any case a discussion worth having, and continuing.

An excellent Indian lunch (although I would take issue with John’s assertion that it was the best outside of Karachi, we don’t do too badly here in the UK), was followed by a session from Alicia Grubb on Scooping, Patents, and Open Science. I tried to keep my mouth shut and listen but pretty much failed. Alicia is also running a very interesting project looking at researcher’s attitudes towards reproducibility and openness. Do go and fill out her survey. After this (or actually maybe it was before – it’s becoming a blur) Pete Binfield ran a session on how (or whether) academic publishers might survive the next five years. This turned into a discussion more about curation and archival than anything else although there was a lengthy discussion of business models as well.

Finally myself, Jason Hoyt, and Duncan Hull did a tag team effort entitled “Bending the Internet to Scientists (not the other way around)“. I re-used the first part of the slides from my NESTA Crucible talk to raise the question of how we maximise the efficiency of the public investment in research. Jason talked about why scientists don’t use the web, using Mendeley as an example of trying to fit the web to scientists’ needs rather than the other way around, and Duncan closed up with discussion of online researcher identities. Again this kicked off an interesting discussion.

Video of several sessions is available thanks to Naomi Most. The friendfeed room is naturally chock full of goodness and there is always a Twitter search for #sbcPA. I missed several sessions which sounded really interesting, which is the sign of a great BarCamp. It was great to catch up with old friends, finally meet several people who I know well from online, as well as meet a whole new bunch of cool people. As Jamie McQuay said in response to Kirsten Sanford, it’s the attendees that make these conferences work. Congrats to the organizers for another great meeting. Here’s looking forward to next year.

How I got into open science – a tale of opportunism and serendipity

So Michael Nielsen, one morning at breakfast at Scifoo asked one of those questions which never has a short answer; ‘So how did you get into this open science thing?’ and I realised that although I have told the story to many people I haven’t ever written it down. Perhaps this is a meme worth exploring more generally but I thought others might be interested in my story, partly because it illustrates how funding drives scientists, and partly because it shows how the combination of opportunism and serendipity can make for successful bedfellows.

In late 2004 I was spending a lot of my time on the management of a large collaborative research project and had had a run of my own grant proposals rejected. I had a student interested in doing a PhD but no direct access to funds to support the consumables cost of the proposed project. Jeremy Frey had been on at me for a while to look at implementing the electronic lab notebook system that he had lead the development of and at the critical moment he pointed out to me a special call from the BBSRC for small projects to prototype, develop, or implement e-science technologies in the biological sciences. It was a light touch review process and a relatively short application. More to the point it was a way of funding some consumables.

So the grant was written. I wrote the majority of it, which makes somewhat interesting reading in retrospect. I didn’t really know what I was talking about at the time (which seems to be a theme with my successful grants). The original plan was to use the existing, fully semantic, rdf backed electronic lab notebook and develop models for use in a standard biochemistry lab. We would then develop systems to enable a relational database to be extracted from the rdf representation and present this on the web.

The grant was successful but the start was delayed due to shenanigans over the studentship that was going to support the grant and the movement of some of the large project to another institution with one of the investigators. Partly due to the resulting mess I applied for the job I ultimately accepted at RAL and after some negotiation organised an 80:20 split between RAL and Southampton.

By the time we had a student in place and had got the grant started it was clear that the existing semantic ELN was not in a state that would enable us to implement new models for our experiments. However at this stage there was a blog system that had been developed in Jeremy’s group and it was thought it would be an interesting experiment to use this as a notebook. This would be almost the precise opposite of the rdf backed ELN. Looking back at it now I would describe it as taking the opportunity to look at a Web 2.0 approach to the notebook as compared to a Web 3.0 approach but bear in mind that at the time I had little or no idea of what these terms meant, let alone the care with which they need to be used.

The blog based system was great for me as it meant I could follow the student’s work online and doing this I gradually became aware of blogs in general and the use of feed readers. The RSS feed of the LaBLog was a great help as it made following the details of experiments remotely straightforward. This was important as by now I was spending three or four days a week at RAL while the student was based in Southampton. As we started to use the blog, at first in a very naïve way we found problems and issues which ultimately led to us thinking about and designing the organisational approach I have written about elsewhere [1, 2]. By this stage I had started to look at other services online and was playing around with OpenWetWare and a few other services, becoming vaguely aware of Creative Commons licenses and getting a grip on the Web 2.0 versus Web 3.0 debate.

To implement our newly designed approach to organising the LaBLog we decided the student would start afresh with a clean slate in a new blog. By this stage I was playing with using the blog for other things and had started to discover that there were issues that meant the ID authentication we were using didn’t always work through the RAL firewall. I ended up having complicated VPN setups, particularly working from home, where I couldn’t log on to the blog and I have my email online at the same time. This, obviously, was a pain and as we were moving to a new blog which could have new security settings I said, ‘stuff it, let’s just make it completely visible and be done with it’.

So there you go. The critical decision to move to an Open Notebook status was taken as the result of a firewall. So serendipity, or at least the effect of outside pressures, was what made it happen.  I would like to say it was a carefully thought out philosophical decision but, although the fact that I was aware of the open access movement, creative commons, OpenWetWare, and others no doubt prepared the background that led me to think down that route, it was essentially the result of frustration.

So, so far, opportunism and serendipity, which brings us back to opportunism again, or at least seizing an opportunity. Having made the decision to ‘go open’ two things clicked in my mind. Firstly the fact that this was rather radical. Secondly, the fact that all of these Web 2.0 tools combined with an open approach could lead to a marked improvement in the efficiency of collaborative science, a kind of ‘Science 2.0’ [yes, I know, don’t laugh, this would have been around March 2007]. Here was an opportunity to get my name on a really novel and revolutionary concept! A quick Google search revealed that, funnily enough, I wasn’t the first person to think of this (yes! I’d been scooped!), but more importantly it led to what I think ought to be three of the Standard Works of Open Science, Bill Hooker’s three part series on Open Science at 3 Quarks Daily [1, 2, 3], Jean-Claude Bradley’s presentation on Open Notebook Science at Nature Precedings (and the associated original blog post coining the term), and Deepak Singh’s talk on Open Science at Ignite Seattle. From there I was inspired to seize the opportunity, get a blog of my own, and get involved. The rest of my story story, so far, is more or less available online here and via the usual sources.

Which leads me to ask. What got you involved in the ‘open’ movement? What, for you, were the ‘primary texts’ of open science and open research? There is a value in recording this, or at least our memories of it, for ourselves, to learn from our mistakes and perhaps discern the direction going forward. Perhaps it isn’t even too self serving to think of it as history in the making. Or perhaps, more in line with our own aims as ‘open scientists’, that we would be doing a poor job if we didn’t record what brought us to where we are and what is influencing our thinking going forward. I think the blogosphere does a pretty good job of the latter, but perhaps a little more recording of the former would be helpful.

The problem of academic credit and the value of diversity in the research community

This is the second in a series of posts (first one here) in which I am trying to process and collect ideas that came out of Scifoo. This post arises out of a discussion I had with Michael Eisen (UC Berkely) and Sean Eddy (HHMI Janelia Farm) at lunch on the Saturday. We had drifted from a discussion of the problem of attribution stacking and citing datasets (and datasets made up of datasets) into the problem of academic credit. I had trotted out the usual spiel about the need for giving credit for data sets and for tool development.

Michael made two interesting points. The first was that he felt people got too much credit for datasets already and that making them more widely citeable would actually devalue the contribution. The example he cited was genome sequences. This is a case where, for historical reasons, the publication of a dataset as a paper in a high ranking journal is considered appropriate.

In a sense I agree with this case. The problem here is that for this specific case it is allowable to push a dataset sized peg into a paper sized hole. This has arguably led to an over valuing of the sequence data itself and an undervaluing of the science it enables. Small molecule crystallography is similar in some regards with the publication of crystal structures in paper form bulking out the publication lists of many scientists. There is a real sense in which having a publication stream for data, making the data itself directly citeable, would lead to a devaluation of these contributions. On the other hand it would lead to a situation where you would cite what you used, rather than the paper in which it was, perhaps peripherally described. I think more broadly that the publication of data will lead to greater efficiency in research generally and more diversity in the streams to which people can contribute.

Michael’s comment on tool development was more telling though. As people at the bottom of the research tree (and I count myself amongst this group) it is easy to say ‘if only I got credit for developing this tool’, or ‘I ought to get more credit for writing my blog’, or anyone of a thousand other things we feel ‘ought to count’. The problem is that there is no such thing as ‘credit’. Hiring decisions and promotion decisions are made on the basis of perceived need. And the primary needs of any academic department are income and prestige. If we believe that people who develop tools should be more highly valued then there is little point in giving them ‘credit’ unless that ‘credit’ will be taken seriously in hiring decisions. We have this almost precisely backwards. If a department wanted tool developers then it would say so, and would look at CVs for evidence of this kind of work. If we believe that tool developers should get more support then we should be saying that at a higher, strategic level, not just trying to get it added as a standard section in academic CVs.

More widely there is a question as to why we might think that blogs, or public lectures, or code development, or more open sharing of protocols are something for which people should be given credit. There is often a case to be made for the contribution of a specific person in a non-traditional medium, but that doesn’t mean that every blog written by a scientists is a valuable contribution. In my view it isn’t the medium that is important, but the diversity of media and the concomitant diversity of contributions that they enable. In arguing for these contributions being significant what we are actually arguing for is diversity in the academic community.

So is diversity a good thing? The tightening and concentration of funding has, in my view, led to a decrease in diversity, both geographical and social, in the academy. In particular there is a tendency to large groups clustered together in major institutions, generally led by very smart people. There is a strong argument that these groups can be more productive, more effective, and crucially offer better value for money. Scifoo is a place where those of us who are less successful come face to face with the fact that there are many people a lot smarter than us and that these people are probably more successful for a reason. And you have to question whether your own small contribution with a small research group is worth the taxpayer’s money. In my view this is something you should question anyway as an academic researcher – there is far too much comfortable complacency and sense of entitlement, but that’s a story for another post.

So the question is; do I make a valid contribution? And does that provide value for money? And again for me Scifoo provides something of an answer. I don’t think I spoke to any person over the weekend without at least giving them something new to think about, a slightly different view on a situation, or just an introduction to something that hadn’t heard of before. These contributions were in very narrow areas, ones small enough for me to be expert, but my background and experience provided a different view. What does this mean for me? Probably that I should focus more on what makes my background and experience unique – that I should build out from that in the directions most likely to provide a complementary view.

But what does it mean more generally? I think that it means that a diverse set of experiences, contributions, and abilities will improve the quality of the research effort. At one session of Scifoo, on how to support ground breaking science, I made the tongue in cheek comment that I thought we needed more incremental science, more filling in of tables, of laying the foundations properly. The more I think about this the more I think it is important. If we don’t have proper foundations, filled out with good data and thought through in detail, then there are real risks in building new skyscrapers. Diversity adds reinforcement by providing better tools, better datasets, and different views from which to examine the current state of opinion and knowledge. There is an obvious tension between delivering radical new technologies and knowledge and the incremental process of filling in, backing up, and checking over the details. But too often the discussion is purely about how to achieve the first, with no attention given to the importance of the second. This is about balance not absolutes.

So to come back around to the original point, the value of different forms of contribution is not due to the fact that they are non-traditional or because of the medium per se, it is because they are different. If we value diversity at hiring committees, and I think we should, then looking at a diverse set of contributions, and the contribution that a given person is likely to make in the future based on their CVs, we can assess more effectively how they will differ from the people we already have. The tendency of ‘the academy’ to hire people in its own image is well established. No monoculture can ever be healthy; certainly not in a rapidly changing environment. So diversity is something we should value for its own sake, something we should try to encourage, and something that we should search CVs for evidence of. Then the credit for these activities will flow of its own accord.

Re-inventing the wheel (again) – what the open science movement can learn from the history of the PDB

One of the many great pleasures of SciFoo was to meet with people who had a different, and in many cases much more comprehensive, view of managing data and making it available. One of the long term champions of data availability is Professor Helen Berman, the head of the Protein Data Bank (the international repository for biomacromolecular structures), and I had the opportunity to speak with her for some time on the Friday afternoon before Scifoo kicked off in earnest (in fact this was one of many somewhat embarrasing situations where I would carefully explain my background in my very best ‘speaking to non-experts’ voice only to find they knew far more about it than I did – however Jim Hardy of Gahaga Biosciences takes the gold medal for this event for turning to the guy called Larry next to him while having dinner at Google Headquarters and asking what line of work he was in).

I have written before about how the world might look if the PDB and other biological databases had never existed, but as I said then I didn’t know much of the real history. One of the things I hadn’t realised was how long it was after the PDB was founded before deposition of structures became expected for all atomic resolution biomacromolecular structures. The road from a repository of seven structures with a handful of new submissions a year to the standards that mean today that any structure published in a reputable journal must be deposited was a long and rocky one. The requirement to deposit structures on publication only became general in the early 1990s, nearly twenty years after it was founded and there was a very long and extended process where the case for making the data available was only gradually accepted by the community.

Helen made the point strongly that it had taken 37 years to get the PDB to where it is today; a gold standard international and publically available repository of a specific form of research data supported by a strong set of community accepted, and enforced, rules and conventions.  We don’t want to take another 37 years to achieve the widespread adoption of high standards in data availability and open practice in research more generally. So it is imperative that we learn the lessons and benefit from the experience of those who built up the existing repositories. We need to understand where things went wrong and attempt to avoid repeating mistakes. We need to understand what worked well and use this to our advantage. We also need to recognise where the technological, social, and political environment that we find ourselves in today means that things have changed, and perhaps to recognise that in many ways, particularly in the way people behave, things haven’t changed at all.

I’ve written this in a hurry and therefore not searched as thoroughly as I might but I was unable to find any obvious ‘history of the PDB’ online. I imagine there must be some out there – but they are not immediately accessible. The Open Science movement could benefit from such documents being made available – indeed we could benefit from making them required reading. While at Scifoo Michael Nielsen suggested the idea of a panel of the great and the good – those who would back the principles of data availability, open access publication, and the free transfer of materials. Such a panel would be great from the perspective of publicity but as an advisory group it could have an even greater benefit by providing the opportunity to benefit from the experience many of these people have in actually doing what we talk about.

Notes from Scifoo

I am too tired to write anything even vaguely coherent. As will have been obvious there was little opportunity for microblogging, I managed to take no video at all, and not even any pictures. It was non-stop, at a level of intensity that I have very rarely encountered anywhere before. The combination of breadth and sharpness that many of the participants brought was, to be frank, pretty intimidating but their willingness to engage and discuss and my realisation that, at least in very specific areas, I can hold my own made the whole process very exciting. I have many new ideas, have been challenged to my core about what I do, and how; and in many ways I am emboldened about what we can achieve in the area of open data and open notebooks. Here are just some thoughts that I will try to collect some posts around in the next few days.

  • We need to stop fretting about what should be counted as ‘academic credit’. In another two years there will be another medium, another means of communication, and by then I will probably be conservative enough to dismiss it. Instead of just thinking that diversifying the sources of credit is a good thing we should ask what we want to achieve. If we believe that we need a more diverse group of people in academia than that is what we should articulate – Courtesy of a discussion with Michael Eisen and Sean Eddy.
  • ‘Open Science’ is a term so vague as to be actively dangerous (we already knew that). We need a clear articulation of principles or a charter. A set of standards that are clear, and practical in the current climate. As these will be lowest common denominator standards at the beginning we need a mechanism that enables or encourages a process of incrementally raising those standards. The electronic Geophysical Year Declaration is a good working model for this – Courtesy of session led by Peter Fox.
  • The social and personal barriers to sharing data can be codified and made sense of (and this has been done). We can use this understanding to frame structures that will make more data available – session led by Christine Borgman
  • The Open Science movement needs to harness the experience of developing the open data repositories that we now take for granted. The PDB took decades of continuous work to bring to its current state and much of it was a hard slog. We don’t want to take that much time this time round – Courtesy of discussion led by Sarah Berman
  • Data integration is tough, but it is not helped by the fact that bench biologists don’t get ontologies, and that ontologists and their proponents don’t really get what the biologists are asking. I know I have an agenda on this but social tagging can be mapped after the fact onto structured data (as demonstrated to me by Ben Good). If we get the keys right then much else will follow.
  • Don’t schedule a session at the same time as Martin Rees does one of his (aside from anything else you miss what was apparently a fabulous presentation).
  • Prosthetic limbs haven’t changed in 100 years and they suck. Might an open source approach to building a platform be the answer – discussion with Jon Kuniholm, founder of the Open Prosthetics Project.
  • The platform for Open Science is very close and some of the key elements are falling into place. In many ways this is no longer a technical problem.
  • The financial system backing academic research is broken when the cost of reproducing or refuting specific claims rises to 10 to 20-fold higher than the original work. Open Notebook Science is a route to reducing this cost – discussion with Jamie Heywood.
  • Chris Anderson isn’t entirely wrong – but he likes being provocative in his articles.
  • Google run a fantasticaly slick operation. Down to the fact that the chocolate coated oatmeal biscuit icecream sandwiches are specially ordered in made with proper sugar instead of hugh fructose corn syrup.

Enough. Time to sleep.

BioBarCamp – Meeting friends old and new and virtual

So BioBarCamp started yesterday with a bang and a great kick off. Not only did we somehow manage to start early we were consistently running ahead of schedule. With several hours initially scheduled for introductions this actually went pretty quick, although it was quite comprehensive. During the introduction many people expressed an interest in ‘Open Science’, ‘Open Data’, or some other open stuff, yet it was already pretty clear that many people meant many different things by this. It was suggested that with the time available we have a discussion session on what ‘Open Science’ might mean. Pedro and mysey live blogged this at Friendfeed and the discussion will continue this morning.

I think for me the most striking outcome of that session was that not only is this a radically new concept for many people but that many people don’t have any background understanding of open source software either which can make the discussion totally impenetrable to them. This, in my view strengthens the need for having some clear brands, or standards, that are easy to point to and easy to sign up to (or not). I pitched the idea, basically adapting from John Wilbank’s pitch at the meeting in Barcelona, that our first target should that all data and analysis associated with a published paper should be available. This seems an unarguable basic standard, but is one that we currently fall far short of. I will pitch this again in the session I have proposed on ‘Building a data commons’.

The schedule for today is up as a googledoc spreadsheet with many difficult decisions to make. My current thinking is;

  1. Kaitlin Thaney – Open Science Session
  2. Ricardo Vidal and Vivek Murthy (OpenWetWare and Epernicus).  Using online communities to share resources efficiently.
  3. Jeremy England & Mark Kaganovich – Labmeeting, Keeping Stalin Out of Science (though I would also love to do John Cumbers on synthetic biology for space colonization, that is just so cool)
  4. Pedro Beltrao & Peter Binfield – Dealing with Noise in Science / How should scientific articles be measured.
  5. Hard choice: Andrew Hessel – building an open source biotech company or Nikesh Kotecha + Shirley Wu – Motivating annotation
  6. Another doozy: John Cumbers – Science Worship / Science Marketing or Hilary Spencer & Mathias Crawford – Interests in Scientific IP – Who Owns/Controls Scientific Communication and Data?  The Major Players.
  7. Better turn up to mine I guess :)
  8.  Joseph Perla – Cloud computing, Robotics and the future of Science and  Joel Dudley & Charles Parrot – Open Access Scientific Computing Grids & OpenMac Grid

I am beginning to think I should have brought two laptops and two webcams. Then I could have recorded one and gone to the other. Whatever happens I will try to cover as much as I can in the BioBarCamp room at FriendFeed, and where possible and appropriate I will broadcast and record via Mogulus. The wireless was a bit tenuous yesterday so I am not absolutely sure how well this will work.

Finally, this has been great opportunity to meet up with people I know and have met before, those who I feel I know well but have never met face to face, and indeed those whose name I vaguely know (or should know) but have never connected with before. I’m not going to say who is in which list because I will forget someone! But if I haven’t said hello yet do come up and harass me because I probably just haven’t connected your online persona with the person in front of me!

Policy for Open Science – reflections on the workshop

Written on the train on the way from Barcelona to Grenoble. This life really is a lot less exotic than it sounds… 

The workshop that I’ve reported on over the past few days was both positive and inspiring. There is a real sense that the ideas of Open Access and Open Data are becoming mainstream. As several speakers commented, within 12-18 months it will be very unusual for any leading institution not to have a policy on Open Access to its published literature. In many ways as far as Open Access to the published literature is concerned the war has been won. There will remains battles to be fought over green and gold routes – the role of licenses and the need to be able to text mine – successful business models remain to be made demonstrably sustainable – and there will be pain as the inevitable restructuring of the publishing industry continues. But be under no illusions that this restructuring has already begun and it will continue in the direction of more openness as long as the poster children of the movement like PLoS and BMC continue to be successful.

Open Data remains further behind, both with respect to policy and awareness. Many people spoke over the two days about Open Access and then added, almost as an addendum ‘Oh and we need to think about data as well’. I believe the policies will start to grow and examples such as the BBSRC Data Sharing Policy give a view of the future. But there is still much advocacy work to be done here. John Wilbanks talked about the need to set achievable goals, lines in the sand which no-one can argue with. And the easiest of these is one that we have discussed many times. All data associated with a published paper, all analysis, and all processing procedures, should be made available. This is very difficult to argue with – nonetheless we know of examples where the raw data of important experiments is being thrown away. But if an experiment cannot be shown to have been done, cannot be replicated and checked, can it really be publishable? Nonetheless this is a very useful marker, and a meme that we can spread and promote.

In the final session there was a more critical analysis of the situation. A number of serious questions were raised but I think they divide into two categories. The first involves the rise of the ‘Digital Natives’ or the ‘Google Generation’. The characteristics of this new generation (a gross simplification in its own right) are often presented as a pure good. Better networked, more sharing, better equipped to think in the digital network. But there are some characteristics that ought to give pause. A casualness about attribution, a sense that if something is available then it is fine to just take it (its not stealing after all, just copying). There is perhaps a need to recover the roots of ‘Mertonian’ science, to as I think James Boyle put it, publicise and embed the attitudes of the last generation of scientists, for whom science was a public good and a discipline bounded by strict rules of behaviour. Some might see this as harking back to an elitist past but if we are constructing a narrative about what we want science to be then we can take the best parts of all of our history and use it to define and refine our vision. There is certainly a place for a return to the compulsory study of science history and philosophy.

The second major category of issues discussed in the last session revolved around the question of what do we actually do now. There is a need to move on many fronts, to gather evidence of success, to investigate how different open practices work – and to ask ourselves the hard questions. Which ones do work, and indeed which ones do not. Much of the meeting revolved around policy with many people in favour of, or at least not against, mandates of one sort or another. Mike Carroll objected to the term mandate – talking instead about contractual conditions. I would go further and say that until these mandates are demonstrated to be working in practice they are aspirations. When they are working in practice they will be norms, embedded in the practice of good science. The carrot may be more powerful than the stick but peer pressure is vastly more powerful than both.

So they key questions to me revolve around how we can convert aspirations into community norms. What is needed in terms of infrastructure, in terms of incentives, and in terms of funding to make this stuff happen? One thing is to focus on the infrastructure and take a very serious and critical look at what is required. It can be argued that much of the storage infrastructure is in place. I have written on my concerns about institutional repositories but the bottom line remains that we probably have a reasonable amount of disk space available. The network infrastructure is pretty good so these are two things we don’t need to worry about. What we do need to worry about, and what wasn’t really discussed very much in the meeting, is the tools that will make it easy and natural to deposit data and papers.

The incentive structure remains broken – this is not a new thing – but if sufficiently high profile people start to say this should change, and act on those beliefs, and they are, then things will start to shift. It will be slow but bit by bit we can imagine getting there. Can we take shortcuts. Well there are some options. I’ve raised in the past the idea of a prize for Open Science (or in fact two, one for an early career researcher and one for an established one). Imagine if we could make this a million dollar prize, or at least enough for someone to take a year off. High profile, significant money, and visible success for someone each year. Even without money this is still something that will help people – give them something to point to as recognition of their contribution. But money would get people’s attention.

I am sceptical about the value of ‘microcredit’ systems where a person’s diverse and perhaps diffuse contributions are aggregated together to come up with some sort of ‘contribution’ value, a number by which job candidates can be compared. Philosophically I think it’s a great idea, but in practice I can see this turning into multiple different calculations, each of which can be gamed. We already have citation counts, H-factors, publication number, integrated impact factor as ways of measuring and comparing one type of output. What will happen when there are ten or 50 different types of output being aggregated? Especially as no-one will agree on how to weight them. What I do believe is that those of us who mentor staff, or who make hiring decisions should encourage people to describe these contributions, to include them in their CVs. If we value them, then they will value them. We don’t need to compare the number of my blog posts to someone else’s – but we can ask which is the most influential – we can compare, if subjectively, the importance of a set of papers to a set of blog posts. But the bottom line is that we should actively value these contributions – let’s start asking the questions ‘Why don’t you write online? Why don’t you make your data available? Where are your protocols described? Where is your software, your workflows?’

Funding is key, and for me one of the main messages to come from the meeting was the need to think in terms of infrastructure, and in particular, to distinguish what is infrastructure and what is science or project driven. In one discussion over coffee I discussed the problem of how to fund development projects where the two are deeply intertwined and how this raises challenges for funders. We need new funding models to make this work. It was suggested in the final panel that as these tools become embedded in projects there will be less need to worry about them in infrastructure funding lines. I disagree. Coming from an infrastructure support organisation I think there is a desperate need for critical strategic oversight of the infrastructure that will support all science – both physical facilities, network and storage infrastructure, tools, and data. This could be done effectively using a federated model and need not be centralised but I think there is a need to support the assumption that the infrastructure is available and this should not be done on a project by project basis. We build central facilities for a reason – maybe the support and development of software tools doesn’t fit this model but I think it is worth considering.

This ‘infrastructure thinking’ goes wider than disk space and networks, wider than tools, and wider than the data itself. The concept of ‘law as infrastructure’ was briefly discussed. There was also a presentation looking at different legal models of a ‘commons’; the public domain, a contractually reconstructed commons, escrow systems etc. In retrospect I think there should have been more of this. We need to look critically at different models, what they are good for, how they work. ‘Open everything’ is a wonderful philosophical position but we need to be critical about where it will work, where it won’t, and where it needs contractual protection, or where such contractual protection is counter productive. I spoke to John Wilbanks about our ideas on taking Open Source Drug Discovery into undergraduate classes and schools and he was critical of the model I was proposing, not from the standpoint of the aims or where we want to be, but because it wouldn’t be effective at drawing in pharmaceutical companies and protecting their investment. His point was, I think, that by closing off the right piece of the picture with contractual arrangements you bring in vastly more resources and give yourself greater ability to ensure positive outcomes. That sometimes to break the system you need to start by working within it by, in this case, making it possible to patent a drug. This may not be philosophically in tune with my thinking but it is pragmatic. There will be moments, especially when we deal with the interface with commerce, where we have to make these types of decisions. There may or may not be ‘right’ answers, and if there are they will change over time but we need to know our options and know them well so as to make informed decisions on specific issues.

But finally, as is my usual wont, I come back to the infrastructure of tools. The software that will actually allow us to record and order this data that we are supposed to be sharing. Again there was relatively little on this in the meeting itself. Several speakers recognised the need to embed the collection of data and metadata within existing workflows but there was very little discussion of good examples of this. As we have discussed before this is much easier for big science than for ‘long tail’ or ‘small science’. I stand by my somewhat provocative contention that for the well described central experiments of big science this is essentially a solved problem – it just requires the will and resources to build the language to describe the data sets, their formats, and their inputs. But the problem is that even for big science, the majority of the workflow is not easily automated. There are humans involved, making decisions moment by moment, and these need to be captured. The debate over institutional repositories and self archiving of papers is instructive here. Most academics don’t deposit because they can’t be bothered. The idea of a negative click repository – where this is a natural part of the workflow can circumvent this. And if well built it can make the conventional process of article submission easier. It is all a question of getting into the natural workflow of the scientist early enough that not only do you capture all the contextual information you want, but that you can offer assistance that makes them want to put that information in.

The same is true for capturing data. We must capture it at source. This is the point where it has the potential to add the greatest value to the scientist’s workflow by making their data and records more available, by making them more consistent, by allowing them to reformat and reanalyse data with ease, and ultimately by making it easy for them to share the full record. We can and we will argue about where best to order and describe the elements of this record. I believe that this point comes slightly later – after the experiment – but wherever it happens it will be made much easier by automatic capture systems that hold as much contextual information as possible. Metadata is context – almost all of it should be possible to catch automatically. Regardless of this we need to develop a diverse ecosystem of tools. It needs to be an open and standards based ecosystem and in my view needs to be built up of small parts, loosely coupled. We can build this – it will be tough, and it will be expensive but I think we know enough now to at least outline how it might work, and this is the agenda that I want to explore at SciFoo.

John Wilbanks had the last word, and it was a call to arms. He said ‘We are the architects of Open’. There are two messages in this. The first is we need to get on and build this thing called Open Science. The moment to grasp and guide the process is now. The second is that if you want to have a part in this process the time to join the debate is now. One thing that was very clear to me was that the attendees of the meeting were largely disconnected from the more technical community that reads this and related blogs. We need to get the communication flowing in both directions – there are things the blogosphere knows, that we are far ahead on, and we need to get the information across. There are things we don’t know much about, like the legal frameworks, the high level policy discussions that are going on. We need to understand that context. It strikes me though that if we can combine the strengths of all of these communities and their differing modes of communication then we will be a powerful force for taking forward the open agenda.