Cameron Neylon – Page 31 – Science in the Open

June 12, 2008December 30, 2009

Data is free or hidden – there is no middle ground

Science commons and other are organising a workshop on Open Science issues as a satellite meeting of the European Science Open Forum meeting in July. This is pitched as an opportunity to discuss issues around policy, funding, and social issues with an impact on the â€˜Open Research Agendaâ€™. In preparation for that meeting I wanted to continue to explore some of the conflicts that arise between wanting to make data freely available as soon as possible and the need to protect the interests of the researchers that have generated data and (perhaps) have a right to the benefits of exploiting that data.

John Cumbers proposed the idea of a â€˜Protocolâ€™ for open science that included the idea of a â€˜use embargoâ€™; the idea that when data is initially made available, no-one else should work on it for a specified period of time. I proposed more generally that people could ask that people leave data alone for any particular period of time, but that there ought to be an absolute limit on this type of embargo to prevent data being tied up. These kinds of ideas revolve around the need to forge community norms â€“ standards of behaviour that are expected, and to some extent enforced, by a community. The problem is that these need to evolve naturally, rather than be imposed by committee. If there isnâ€™t community buy in then proposed standards have no teeth.

An alternative approach to solving the problem is to adopt some sort â€˜licenseâ€™. A legal or contractual framework that creates obligation about how data can be used and re-used. This could impose embargoes of the type that John suggested, perhaps as flexible clauses in the license. One could imagine an â€˜Open data â€“ six month analysis embargoâ€™ license. This is attractive because it apparently gives you control over what is done with your data while also allowing you to make it freely available. This is why people who first come to the table with an interest in sharing content always start with CC-BY-NC. They want everyone to have their content, but not to make money out of it. It is only later that people realise what other effects this restriction can have.

I had rejected the licensing approach because I thought it could only work in a walled garden, something which goes against my view of what open data is about. More recently John Wilbanks has written some wonderfully clear posts on the nature of the public domain, and the place of data in it, that make clear that it canâ€™t even work in a walled garden. Because data is in the public domain, no contractual arrangement can protect your ability to exploit that data, it can only give you a legal right to punish someone who does something you havenâ€™t agreed to. This has important consequences for the idea of Open Science licences and standards.

If we argue as an â€˜Open Science Movementâ€™ that data is in and must remain in the public domain then, if we believe this is in the common good, we should also argue for the widest possible interpretation of what is data. The results of an experiment, regardless of how clever its design might be, are a â€˜fact of natureâ€™, and therefore in the public domain (although not necessarily publically available). Therefore if any person has access to that data they can do whatever the like with it as long as they are not bound by a contractual arrangement. If someone breaks a contractual arrangement and makes the data freely available there is no way you can get that data back. You can punish the person who made it available if they broke a contract with you. But you canâ€™t recover the data. The only way you can protect the right to exploit data is by keeping it secret. The is entirely different to creative content where if someone ignores or breaks licence terms then you can legally recover the content from anyone that has obtained it.

Why does this matter to the Open Science movement? Arenâ€™t we all about making the data available for people to do whatever anyway? It matters because you canâ€™t place any legal limitations on what people do with data you make available. You canâ€™t put something up and say â€˜you can only use this for Xâ€™ or â€˜you can only use it after six monthsâ€™ or even â€˜you must attribute this dataâ€™. Even in a walled garden, once there is one hole, the entire edifice is gone. The only way we can protect the rights of those who generate data to benefit from exploiting it is through the hard work of developing and enforcing community norms that provide clear guidelines on what can be done. Itâ€™s that or simply keep the data secret.

What is important is that we are clear about this distinction between legal and ethical protections. We must not tell people that their data can be protected because essentially they can’t. And this is a real challenge to the ethos of open data because it means that our only absolutely reliable method for protecting people is by hiding data. Strong community norms will, and do, help but there is a need to be careful about how we encourage people to put data out there. And we need to be very strong in condemning people who do the ‘wrong’ thing. Which is why a discussion on what we believe is ‘right’ and ‘wrong’ behaviour is incredibly important. I hope that discussion kicks off in Barcelona and continues globally over the next few months. I know that not everyone can make the various meetings that are going on – but between them and the blogosphere and the ‘streamosphere‘ we have the tools, the expertise, and hopefully the will, to figure these things out.

The Open Data licensing issue – Deepak Singh [viaÂ Zemanta]
On the erosion of the public domain – John Wilbanks
Chemspider: Good intentions and the fog of licensing – John Wilbanks
Going Legal on CC-0 [viaÂ Zemanta]

June 10, 2008December 30, 2009

The trouble with institutional repositories

I spent today at an interesting meeting at Talis headquarters where there was a wide range of talks. Most of the talks were liveblogged by Andy Powell and also by Owen Stephens (who has written a much more comprehensive summary of Andy’s talk) and there will no doubt be some slides and video available on the web in future. The programme is also available. Here I want to focus on Andy Powellâ€™s talk (slides), partly because he obviously didnâ€™t liveblog it but primarily because it crystallised for me many aspects of the way we think about Institutional Repositories. For those not in the know, these are warehouses that are becoming steadily more popular, run generally by unversities to house their research outputs, in most cases peer reviewed papers. Self archiving of some version of published papers is the so called â€˜Green Routeâ€™ to open access.

The problem with institutional repositories in their current form is that academics donâ€™t use them. Even when they are being compelled there is massive resistance from academics. There are a variety of reasons for this: academics donâ€™t like being told how to do things; they particularly donâ€™t like being told what to do by their institution; the user interfaces are usually painful to navigate. Nonetheless they are a valuable part of the route towards making more research results available. I use plenty of things with ropey interfaces because I see future potential in them. Yet I donâ€™t use either of the repositories in the places where I work â€“ in fact they make my blood boil when I am forced to. Why?

So Andy was talking about the way repositories work and the reasons why people donâ€™t use them. He had already talked about the language problem. We always talk about â€˜putting things in the repositoryâ€™ rather than â€˜making them available on the webâ€™. He had mentioned already that the institutional nature of repositories does not map well onto the social networks of the academic users which probably bear little relationship with institutions and are much more closely aligned to discipline and possibly geographic boundaries (although they can easily be global).

But for me the key moment was when Andy asked â€˜How many of you have used SlideShareâ€™. Half the people in the room put their hands up. Most of the speakers during the day pointed to copies of their slides on SlideShare. My response was to mutter under my breath â€˜And how many of them have put presentations in the institutional repository?â€™ The answer to this; probably none. SlideShare is a much better â€˜repositoryâ€™ for slide presentations than IRs. There are more there, people may find mine, it is (probably) Google indexed. But more importantly I can put slides up with one click, it already knows who I am, I donâ€™t need to put in reams of metadata, just a few tags. And on top of this it provides added functionality including embedding in other web documents as well as all the social functions that are a natural part of a â€˜Web2.0â€™ site.

SlideShare is a very good model of what a Repository can be. It has issues. It is a third party product, it may not have long term stability, it may not be as secure as some people would like. But it provides much more of the functionality that I want from a service for making my presentations available on the web. It does not serve the purpose of an archive â€“ and maybe an institutional repository is better in that role. But for the author, the reason for making things available is so that people use them. If I make a video that relates to my research it will go on YouTube, Bioscreencast, or JoVE, not in the institutional repository, I put research related photos on Flickr, not in the institutional repository, and critically, I leave my research papers on the websites of the journal that published them, and cannot be bothered with the work required to put them in the institutional repository.

Andy was arguing for global discipline specific repositories. I would suggest that the lesson of the Web2.0 sites is that we should have data type specific repositories. FlickR is for pictures, SlideShare for presentations. In each case the specialisation enables a sort of implicit metadata and for the site to concentrate on providing functionality that adds value to that particular data type. Science repositories could win by doing the same. PDB, GenBank, SwissProt deal with specific types of data. Some might argue that GenBank is breaking under the strain of the different types and quantities of data generated by the new high throughput sequencing tools. Perhaps a new repository is required that is specially designed for this data.

So what is the role for the institutional repository? The preservation of data is one aspect. Pulling down copies of everything to provide an extra backup and retain an institutional record. If not copying then indexing and aggregating so as to provide a clear guide to the institutions outputs. This neednâ€™t be handled in house of course and can be outsourced. As Paul Miller suggested over lunch, the role of the institution need not be to keep a record of everything, but to make sure that such a record is kept. Curation may be another, although that may be too big a job to be tackled at institutional level. When is a decision made that something isnâ€™t worth keeping anymore? What level of metadata or detail is worth preserving?

But the key thing is that all of this should be done automatically and must not require intervention by the author. Nothing drives me up the wall more than having to put the same set of data into two subtly different systems more than once. And as far as I can see there is no need to do so. Aggregate my content automatically, wrap it up and put it in the repository, but I donâ€™t want to have to deal with it. Even in the case of peer reviewed papers it ought to be feasible to pull down the vast majority of the metadata required. Indeed, even for toll access publishers, everything except the appropriate version of the paper. Send me a polite automated email and ask me to attach that and reply. Job done.

For this to really work we need to take an extra step in the tools available. We need to move beyond files that are simply â€˜born digitalâ€™ because these files are in many ways still born. This current blog post, written in Word on the train is a good example. The laptop doesnâ€™t really know who I am, it probably doesnâ€™t know where I am, and it has not context for the particular word document Iâ€™m working on. When I plug this into the WordPress interface at OpenWetWare all of this changes. The system knows who I am (and could do that through OpenID). It knows what I am doing (writing a Blog post) and the Zemanta Firefox plug in does much better than that, suggesting tags, links, pictures and keywords.

Plugins and online authoring tools really have the potential to automatically generate those last pieces of metadata that arenâ€™t already there. When the semantics comes baked in then the semantic web will fly and the metadata that everyone knows they want, but can’t be bothered putting in, will be available and re-useable, along with the content. When documents are not only born digital but born on and for the web then the repositories will have probably still need to trawl and aggregate. But they wonâ€™t have to worry me about it. And then I will be a happy depositor.

SWAP and ORE [via Zemanta]
Slideshare Ramping Up – Leading Online Presentations App? [via Zemanta]
Powerset vs. Cognition: A Semantic Search Shoot-out [via Zemanta]

May 28, 2008December 30, 2009

Defining error rates in the Illumina sequence: A useful and feasible open project?

Regular readers will know I am a great believer in the potential of Web2.0 tools to enable rapid aggregation of loose networks of collaborators to solve a particular problem and the possibilities of using this approach to do science better, faster, and more efficiently. The reason why we havenâ€™t had great successes on this thus far is fundamentally down to the size of the network we have in place and the bias in the expertise of that network towards specific areas. There is a strong bioinformatics/IT bias in the people interested in these tools and this plays out in a number of fields from the people on Friendfeed, to the relative frequency of commenting on PLoS Computational Biology versus PLoS ONE.

Putting these two together one obvious solution is to find a problem that is well suited to the people who are around, may be of interest to them, and is also quite useful to solve. I think I may have found such a problem.

The Illumina next generation sequencing platform developed originally by Solexa is the latest kid on the block as far as the systems that have reached the market. I spent a good part of today talking about how the analysis pipeline for this system could be improved. But one thing that came out as an issue is that no-one seems to have publishedÂ detailed analysis of the types of errors that are generated experimentally by this system. Illumina probably have done this analysis in some form but have better things to do than write it up.

The Solexa system is based on sequencing by synthesis. A population of DNA molecules, all amplified from the same single molecule, is immobilised on a surface. A new strand of DNA is added, one base at a time. In the Solexa system each base has a different fluorescent marker on it plus a blocking reagent. After the base is added, and the colour read, the blocker is removed and the next base can be added. More details can be found on the genographia wiki. There are two major sources of error here. Firstly, for a proportion of each sample, the base is not added successfully. This means in the next round, that part of the sample may generate a readout for the previous base. Secondly the blocker may fail, leading to the addition of two bases, causing a similar problem but in reverse. As the cycles proceed the ends of each DNA strand in the sample get increasingly out of phase making it harder and harder to tell which is the correct signal.

These error rates are probably dependent both on the identity of the base being added and the identity of the previous base. It may also be related to the number of cycles that have been carried out. There is also the possibility that the sample DNA has errors in it due to the amplification process though these are likely to be close to insignificant. However there is no data on these error rates available. Simple you might think to get some of the raw data and do the analysis â€“ fit the sequence of raw intensity data to a model where the parameters are error rates for each base.

Well we know that the availability of data makes re-processing possible and we further believe in the power of the social network. And I know that a lot of you guys are good at this kind of analysis, and might be interested in having a play with some of the raw data. It could also be a good paper â€“ Nature Biotech/Nature Methods perhaps and I am prepared to bet it would get an interesting editorial writeup on the process as well. I donâ€™t really have the skills to do the work but if others out there are interested then I am happy to coordinate. This could all be done, in the wild, out in the open and I think that would be a brilliant demonstration of the possibilities.

Oh, the data? Weâ€™ve got access to the raw and corrected spot intensities and the base calls from a single â€˜tileâ€™ of the phiX174 control lane for a run from the 1000 Genomes Project which can be found at http://sgenomics.org/phix174.tar.gz courtesy of Nava Whiteford from the Sanger Centre. If youâ€™re interested in the final product you can see some of the final read data being produced here.

What I had in mind was taking the called sequence, align onto phiX174 so we know the ‘true’ sequence. Then use that sequence plus a model with error rates to parameterise those error rates. Perhaps there is a better way to approach the problem? There are a series of relatively simple error models that could be tried and if the error rates can be defined then it will enable a really significant increase in both the quality and quantity of data that can be determined by these machines. I figure splitting the job up into a few small groups working on different models, putting the whole thing up on google code with a wiki there to coordinate and capture other issues as we go forward. Anybody up for it (and got the time)?

Project to map 1,000 people’s DNA [viaÂ Zemanta]
Watson and … Kriek [viaÂ Zemanta]
Detailed gene map will lift lid on diseases [viaÂ Zemanta]

May 28, 2008December 30, 2009

Twittering labs? That is just so last year…

The Mars Phoenix landing has got a lot of coverage around the web, particularly from some misty eyed old blokes who remember watching landings via the Mosaic browser in an earlier, simpler age. The landing is cool, but one thing I thought was particularly clever was the use of Twitter by JPL to publicise the landing and what is happening on a minute to minute basis. Now my suspicion is that they havenâ€™t actually installed Twhirl on the Phoenix Lander and that there is actually a person at JPL writing the Tweets. But that isnâ€™t the point. The point is that the idea of an instrument (or in this case a spacecraft) outputting a stream of data is completely natural to people. The idea of the overexcited lander saying â€˜come on rocketsssssss!!!!!!!!â€™ is very appealing (you can tell itâ€™s a young spaceship, it hasnâ€™t learnt not to shout yet; although if your backside was at 2,000 Â°C you might have something to say about it as well).

Iâ€™ve pointed out some cool examples of this in the past including London Bridge and Steve Wilson, in Jeremy Freyâ€™s group at Southampton, has been doing some very fun stuff both logging what happens in a laboratory, and blogging that out to the web, using the tools developed by the Simile team at MIT. The notion of the instrument generating a data stream and using that stream as an input to an authoring tool like a laboratory notebook or into other automated processes is a natural one that fits well both with the way we work in the laboratory (even when your laboratory is the solar system) and our tendency to anthropomorphise our kit. However, the day the FPLC tells me it had a hard night and doesnâ€™t feel like working this morning is the day it gets tossed out. And the fact that it was me that fed it the 20% ethanol is neither here nor there.

Now the question is; can I persuade JPL to include actual telemetry, command, and acknowledgement data in the twitter stream? That would be very cool.

May 24, 2008December 30, 2009

How do we build the science data commons? A proposal for a SciFoo session

I realised the other day that I havenâ€™t written an exciteable blog post about getting an invitation to SciFoo! The reason for this is that I got overexcited over on FriendFeed instead and havenâ€™t really had time to get my head together to write something here. But in this post I want to propose a session and think through what the focus and aspects of that might be.

I am a passionate advocate of two things that I think are intimately related. I believe strongly in the need and benefits that will arise from building, using, and enabling the effective search and processing of a scientific data commons. I [1,2] and others (including John Wilbanks, Deepak Singh, and Plausible Accuracy) have written on this quite a lot recently. The second aspect is that I believe strongly in the need for effective useable and generic tools to record science as it happens and to process that record so that others can use it effectively. To me these two things are intimately related. By providing the tools that enable the record to be created and integrating them with the systems that will store and process the data commons we can enable scientists to record their work better, communicate it better, and make it available as a matter of course to other scientists (not necessarily immediately I should add, but when they are comfortable with it). Continue reading “How do we build the science data commons? A proposal for a SciFoo session”

May 23, 2008December 30, 2009

Approaching deadline for Open Science@PSB

Just a gentle reminder that the deadline for submissions for the Open Science Workshop at Pacific Symposium on Biocomputing is approaching. The purpose of the early deadline is so that we can give people plenty of notice that they have a talk so they or we can sort out funding.Â At this stage we really only need an abstract or even an outline so we can organise the program. We are hoping to be able to make a contribution to the costs of speakers and poster presenters so if you will need or would appreciate support then please make a note of that in your submission. And please do get in touch if you want to come but money is an issue and we will work with you to see what we can. If there is enough UK/Europe interest we can look at putting in a small grant application. The US seems a bit harder but we are working on that as well. As Shirley Wu has mentioned we are actively pursuing a range of fundrasing options (watch this space for more t-shirts and other open science merchandise – and yes we are doing more than just designing t-shirts). Any help with contacts or cold hard cash will also be greatly appreciated.

It is looking like it will be an exciting programme and we would like as many people to be there as possible. Submission instructions are at the Call for Proposals. You know you want to spend the middle of January in Hawaii, so this is the excuse you’ve been looking for.

May 16, 2008December 30, 2009

Avoid the pain and embarassment – make all the raw data available

A story of two major retractions from a well known research group has been getting a lot of play over the last few days with a News Feature (1) and Editorial (2) in the 15 May edition of Nature. The story turns on claim that Homme Hellingaâ€™s group was able to convert the E. coli ribose binding protein into a Triose phosphate isomerase (TIM) using a computational design strategy. Two papers on the work appeared, one in Science (3) and one in J Mol Biol (4). However another group, having obtained plasmids for the designed enzymes, could not reproduce the claimed activity. After many months of work the group established that the supposed activity appeared to that of the bacteriaâ€™s native TIM and not that of the designed enzyme. The paperâ€™s were retracted and Hellinga went on to accuse the graduate student who did the work of fabricating the results, a charge of which she was completely cleared.

Much of the heat the story is generating is about the characters involved and possible misconduct of various players, but thatâ€™s not what I want to cover here. My concern is about how much time, effort, and tears could have been saved if all the relevant raw data was made available in the first place. Demonstrating a new enzymatic activity is very difficult work. It is absolutely critical to rigorously exclude the possibility of any contaminating activity and in practice this is virtually impossible to guarantee. Therefore a negative control experiment is very important. It appears that this control experiment was carried out, but possibly only once, against a background of significant variability in the results. All of this lead to another group wasting on the order of twelve months trying to replicate these results. Well, not wasting, but correcting the record, arguably a very important activity, but one for which they will get little credit in any meaningful sense (an issue for another post and mentioned by Noam Harel in a comment at the News Feature online).

So what might have happened if the original raw data were available? Would it have prevented the publication of the papers in the first place? Itâ€™s very hard to tell. The referees were apparently convinced by the quality of the data. But if this was â€˜typical dataâ€™ (using the special scientific meaning of typical vis â€˜the best weâ€™ve gotâ€™) and the referees had seen the raw data with greater variability then maybe they would have wanted to see more or better controls; perhaps not. Certainly if the raw data were available the second group would have realised much sooner that something was wrong.

And this is a story we see over and over again. The selective publication of results without reference to the full set of data; a slight shortcut taken or potential issues with the data somewhere that is not revealed to referees or to the readers of the paper; other groups spending months or years attempting to replicate results or simply use a method described by another group. And in the meantime graduate students and postdocs get burnt on the pyre of scientific ‘progress’ discovering that something isn’t reproducible.

The Nature editorial is subtitled â€˜Retracted papers require a thorough explanation of what went wrong in the experimentsâ€™. In my view this goes nowhere near far enough. There is no longer any excuse for not providing all the raw and processed data as part of the supplementary information for published papers. Even in the form of scanned lab book pages this could have made a big difference in this case, immediately indicating the degree of variability and the purity of the proteins. Many may say that this is too much effort, that the data cannot be found. But if this is the case then serious questions need to be asked about the publication of the work. Publishers also need to play a role by providing more flexible and better indexed facilities for supplementary information, and making sure they are indexed by search engines.

Some of us go much further than this, and believe that making the raw data immediately available is a better way to do science. Certainly in this case it might have reduced the pressure to rush to publish, might have forced a more open and more thorough scrutiny of the underlying data. This kind of radical openness is not for everyone perhaps but it should be less prone to gaffes of the sort described here. I know I can have more faith in the work of my group where I can put my fingers on the raw data and check through the detail. We are still going through the process of implementing this move to complete (or as complete as we can be) openness and its not easy. But it helps.

Science has moved on from the days where the paper could only contain what would fit on the printed pages. It has moved on from the days when an informal circle of contacts would tell you which groupâ€™s work was repeatable and which was not. The pressures are high and potential for career disaster probably higher. In this world the reliability and completeness of the scientific record is crucial. Yes there are technical difficulties in making it all available. Yes it takes effort, and yes it will involve more work, and possibly less papers. But the only thing that ultimately can really be relied on is the raw data (putting aside deliberate fraud). If the raw data doesnâ€™t form a central part of the scientific record then we perhaps need to start asking whether the usefulness of that record in its current form is starting to run out.

Editorial Nature 453, 258 (2008)
Wenner M. Nature 453, 271-275 (2008)
Dwyer, M. A. , Looger, L. L. & Hellinga, H. W. Science 304, 1967â€“1971 (2004).
Allert, M. , Dwyer, M. A. & Hellinga, H. W. J. Mol. Biol. 366, 945â€“953 (2007).

May 11, 2008December 30, 2009

A new type of chemistry journal: Nature Chemistry requests input

As has been noted in a few places, Neil Withers, one of the editors of soon to be newest Nature journal, Nature Chemistry put out a request last week for input on a range of issues to do with how people use journals, formats, and technical widgets. Egon Willighagen, Rich Apodaca, and Oscar the Journal Munching Robot (masquerading as Peter Murray-Rust, or is that the other way around?) have already posted responses. Here I want to add my own thoughts and possibly amplify some of the points others have made. Continue reading “A new type of chemistry journal: Nature Chemistry requests input”

May 11, 2008December 30, 2009

More on the science exchance – or building and capitalising a data commons

Image from Wikipedia via Zemanta

Following on from the discussion a few weeks back kicked off by Shirley at One Big Lab and continued here I’ve been thinking about how to actually turn what was a throwaway comment into reality:

What is being generated here is new science, and science isnâ€™t paid for per se. The resources that generate science are supported by governments, charities, and industry but the actual production of science is not supported. The truly radical approach to this would be to turn the system on its head. Donâ€™t fund the universities to do science, fund the journals to buy science; then the system would reward increased efficiency.

There is a problem at the core of this. For someone to pay for access to the results, there has to be a monetary benefit to them. This may be through increased efficiency of their research funding but that’s a rather vague benefit. For a serious charitable or commercial funder there has to be the potential to either make money, or at least see that the enterprise could become self sufficient. But surely this means monetizing the data somehow? Which would require restrictive licences, which is not at the end what we’re about.

The other story of the week has been the, in the end very useful, kerfuffle caused by ChemSpider moving to a CC-BY-SA licence, and the confusion that has been revealed regarding data, licencing, and the public domain. John Wilbanks, whose comments on the ChemSpider licence, sparked the discussion has written two posts [1, 2] which I found illuminating and have made things much clearer for me. His point is that data naturally belongs in the public domain and that the public domain and the freedom of the data itself needs to be protected from erosion, both legal, and conceptual that could be caused by our obsession with licences. What does this mean for making an effective data commons, and the Science Exchange that could arise from it, financially viable? Continue reading “More on the science exchance – or building and capitalising a data commons”

May 5, 2008December 30, 2009

Attribution for all! Mechanisms for citation are the key to changing the academic credit culture

Image via Wikipedia

Once again a range of conversations in different places have collided in my feed reader. Over on Nature Networks, Martin Fenner posted on Researcher ID which lead to a discussion about attribution and in particular Martin’s comment that there was a need to be able to link to comments and the necessity of timestamps. Then DrugMonkey posted a thoughtful blog about the issue of funding body staff introducing ideas from unsuccessful grant proposals they have handled to projects which they have a responsibility in guiding. Continue reading “Attribution for all! Mechanisms for citation are the key to changing the academic credit culture”