Home » Blog

How do we build the science data commons? A proposal for a SciFoo session

24 May 2008 38 Comments

Sign at The Googleplex.  Google limited access to Xenu.net, in March 2002.I realised the other day that I haven’t written an exciteable blog post about getting an invitation to SciFoo! The reason for this is that I got overexcited over on FriendFeed instead and haven’t really had time to get my head together to write something here. But in this post I want to propose a session and think through what the focus and aspects of that might be.

I am a passionate advocate of two things that I think are intimately related. I believe strongly in the need and benefits that will arise from building, using, and enabling the effective search and processing of a scientific data commons. I [1,2] and others (including John Wilbanks, Deepak Singh, and Plausible Accuracy) have written on this quite a lot recently. The second aspect is that I believe strongly in the need for effective useable and generic tools to record science as it happens and to process that record so that others can use it effectively. To me these two things are intimately related. By providing the tools that enable the record to be created and integrating them with the systems that will store and process the data commons we can enable scientists to record their work better, communicate it better, and make it available as a matter of course to other scientists (not necessarily immediately I should add, but when they are comfortable with it).

At last year’s SciFoo Chris DiBona ran a session called ‘Give us the Data’ [Ed. As noted below Jon Trowbridge led the session with Chris DiBona acting as ‘session chair’] Now I wasn’t there but Deepak Singh wrote the session up in a blog post which is my main inspiration. The idea here seemed to be to send in big data sets and that Google could slurp these up for processing. Now the world has moved on from last year and I believe we can propose something much more ambitious with the tools we have available. We can build the systems, or at least the prototypes of systems, that will be so useful for recording and sharing data that people will want to use them. Google AppEngine, Amazon Web Services, Freebase, CouchDB, Twine, and the growing efforts in actually making controlled vocabularies useful and useable, as well as a growing web of repositories and services make this look like an achievable target in a way that it wasn’t twelve months ago. Once we start building these repositories of data then there will be a huge incentive both for scientists and big players such as Google, Microsoft, Yahoo, IBM, and Amazon to look at how to effectively process, re-process, and present this data. That will be exciting, but first we need to build the tools and systems that will enable us to put the data there.

What does this ‘system’ look like. I’m going to suggest a possible model, based on my non-technical understanding of how these things work. I will probably get the details wrong (and please comment where I do). Let’s start at the storage end. We are only interested in open repositories, where it is explicitly stated that data is in the public domain and therefore freely re-useable. Some of these repositories exist for specific domains. ChemSpider , NMRShiftDB and crystallography databases in the chemistry domain, GenBank, PDB, and many others in the biosciences domain. To be honest, for most large datasets repositories already exist. The problem is not ‘big science’, the problem is ‘small science’ or ‘long tail’ science. The vast quantity of data and description that is generated by small numbers o people in isolated labs. Huge amounts of data languishing on millions of laptops in non-standard formats. All that stuff that no-one can really be bothered dealing with.

The storage itself is not really a problem. It can go on Amazon S3, Google servers, or into the various Institutional Repositories that Universities are rapidly deploying internationally. In fact, it should go into all three and a few other places besides. Storage is cheap, redoing the experiment is not. The digital curation people are also working on the format problem with standards like ORE providing a means of wrapping collections of data up and describing the format internally. Combine this with an open repository of data formats and the problem of storing and understanding data goes away, or rather it is still a large technical problem but one that is reasonably soluble with sufficient effort. One thing that is needed is standards for describing data formats; both standards in terms of the description and quality standards for what are the minimal requirements for describing data (MiSDaFD anyone?). A lot of work is required in the area of controlled vocabularies and ontologies, particularly in science domains where they haven’t yet got much traction. These don’t need to be perfect, they just need to be descriptive. But once again, how to go about doing this is reasonably clear (devil is in the details I know).

The fundamental problem is how to capture the data, associated metadata, and do the wrapping in a way that is an easy and convenient enough that scientists will actually do it. It will come as no suprise to the regularn readers of this blog that I would advocate the development of tools to enable scientists to do this effectively at the stage at which they already record their experiments; when they are done. Not only is this the point where the scientist already has the task of planning, executing, and recording their experiment. It is also the point at which embedding good metadata and recording practice will be the most benefit to them.

So we need to build a lab book system, or a virtual research environment if you prefer, because this will encompass much more than the traditional paper notebook. It will also encompass much more than the current crop of electronic lab notebooks. I don’t want to go into the details of what I think this should look like here; that’s the subject for a post in its own right. But to give an outline this system will consist of a feed aggregator (not unlike Friendfeed) where the user can subscribe to various data streams, some from people, some from instruments or sensors, and some that they have generated themselves, pictures at Flickr, comments on Twitter. The lab book component itself will look more or less like a word processor, or a blog, or a wiki, depending on what the user wants. In this system the user will generate documents (or pages); some will describe samples, some procedures, and some ideas. Critically this authoring system will have an open plugin architecture supported by an open source community that enables the user to grab extra functionality automatically.

A simple plugin might recognise a DNA sequence, or an INCHI code, and provide a functional graphical display, complete with automatically generated links back to the appropriate databases. Another plugin might provide access to a range of ontologies or controlled vocabularies which enable autocompletion of typed text with the appropriate name or easy or automatically generate the PDB/SBML/MiBBI compliant datafiles required for deposition. Journal submission templates could also be plugins or gadgets that assist in pulling all the relevant information together for submission and wrapping it up in an appropriate format. Semantic authoring at the journal submission stage is never going to work. Building it in early on and sucking it through when required just might. Adding semantic information could be as easy (but probably should be no easier than) formatting text. Select, right click, add semantics from drop down menu.

This environment will have a proper versioning system, web integration to push documents into the wider world, and a WYSIWYG interface. It will expose human readable and machine readable versions of the files, and will almost certainly use an underlying XML document format. It will probably expose both microformats, rdf, and free text tags.

I think there are essentially two routes towards this kind of system. One is to build into existing word processor and document handling ecosystems. Essentially building a plugin architecture for Word or OpenOffice. This would have the advantage of a system that is already widely used and familiar to scientists. It has the disadvantage (at least in the Microsoft case) that it would be built on top of a closed source product and be reliant on the continued existence and support of a large company. The alternative approach is to bring a completely open source product through. This will require much more work, particularly on the user interface side, but may be easier to handle in terms of open source plugin architecture. It could be entirely web based. Google AppEngine provides a very interesting environment for building such a thing. CouchDB may be a good way of maintaining the underlying data. Google Gadgets may well provide a lot of the desired plugin architecture and widgets of various, types, javascript, bookmarklets, and all the various tools on the web may provide a lot of the rest.

The purpose of this system is ultimately to encourage people to place research results in the public domain in a way that makes them useful. For it to get traction it has to be a no-brainer for people to adopt as a central part of the way they run their laboratory. If it doesn’t make people’s life easier; a lot easier; its not going to get any traction. The User interface will be crucial, reliability will be crucial, and the availability of added functionality which is currently simply unavailable will be crucial.

Agreeing an architecture is likely to be a challenge but an open source project is probably a more effective way of leveraging effort from a wide community. There are many open standards available that can be used here and the key is getting them all working together. Ideally the choice of authoring environment won’t matter because it can still push documents to the web in the same format(s), still use the same plugins, still interface with the databases. Building this will cost money, quite a lot of money, but the flip side is enabling the availability of the data, the procedures into the public domain. As I heard Christoph Steinbeck say yesterday, we can either get on and do this, or wait for the singularity. I think we’ve got a while to wait for that, so I think its time to get on with the build. Who wants to be involved?

Related articles


  • Cameron,
    You’re covering a lot of ground here from different angles – this will make a great session at SciFoo certainly.

    My view is that the path of least resistance is to try to reformat lab notebooks that are already being maintained – I am thinking here of you, Gus and me as 3 easy examples. The system you describe should be able to handle these somewhat disparate fields and there won’t be any time spent on convincing people of the value of doing it.

    Concerning your point of capturing the data at the point of where it happens, it is going to be extremely difficult to do so with the active co-operation of most scientists. The most striking example to me is the lack of use of the JCAMP-DX format for NMR spectra. Even though it is trivial to take the raw data in that format so that it can be expanded and mined at any point in the future, most organic chemists still just print out their spectra! Not only does this take more time but it may mean redoing the spectrum to get a particular expansion that was not printed out initially. And that does not even take into account the benefit of being able to share spectra with the world through a browser and the free JSpecView.

    In the longer run I think most of the open experimental data collected will be IN SPITE of the average researcher because they will be executing through machines (for example of the type that I described from Mettler-Toledo or ChemSpeed). The machines doing the experiments will always record what happens objectively. All that will be required is permission to use – which is an entirely different conversation – and probably another session at SciFoo…

    I look forward to discussing this in detail in person with you.

  • Cameron,
    You’re covering a lot of ground here from different angles – this will make a great session at SciFoo certainly.

    My view is that the path of least resistance is to try to reformat lab notebooks that are already being maintained – I am thinking here of you, Gus and me as 3 easy examples. The system you describe should be able to handle these somewhat disparate fields and there won’t be any time spent on convincing people of the value of doing it.

    Concerning your point of capturing the data at the point of where it happens, it is going to be extremely difficult to do so with the active co-operation of most scientists. The most striking example to me is the lack of use of the JCAMP-DX format for NMR spectra. Even though it is trivial to take the raw data in that format so that it can be expanded and mined at any point in the future, most organic chemists still just print out their spectra! Not only does this take more time but it may mean redoing the spectrum to get a particular expansion that was not printed out initially. And that does not even take into account the benefit of being able to share spectra with the world through a browser and the free JSpecView.

    In the longer run I think most of the open experimental data collected will be IN SPITE of the average researcher because they will be executing through machines (for example of the type that I described from Mettler-Toledo or ChemSpeed). The machines doing the experiments will always record what happens objectively. All that will be required is permission to use – which is an entirely different conversation – and probably another session at SciFoo…

    I look forward to discussing this in detail in person with you.

  • Eva

    First sentence: “I realized the other data…”
    A scientist’s Freudian slip? =)

  • Eva

    First sentence: “I realized the other data…”
    A scientist’s Freudian slip? =)

  • Eva: indeed! Now corrected.

  • Eva: indeed! Now corrected.

  • Further update: Attila Csordas has pointed out that the session I mention above was organised by Jon Trowbridge and points to some more commentary and slides on his blog

    Jean-Claude, I think we’re coming at this from the same angle in essence, just that I think it is possible to build tools that will enable scientists to generate useful feeds in spite of themselves. Or perhaps rather that people will realise the value of aggregating their own information and then make the leap to the ability to aggregate others. A lot of it is inertia. What would happen if NMR technicians turned off the printer and started just emailing JCamp files to people for instance? And if journals insisted on receiving characterisation data fully electronically?

    But hey, if we agreed completely it would make a rather dull session! The other thing I wanted to suggest was ‘Science in the streamosphere’ but I thought I’d restrict myself to one at the moment :)

  • Further update: Attila Csordas has pointed out that the session I mention above was organised by Jon Trowbridge and points to some more commentary and slides on his blog

    Jean-Claude, I think we’re coming at this from the same angle in essence, just that I think it is possible to build tools that will enable scientists to generate useful feeds in spite of themselves. Or perhaps rather that people will realise the value of aggregating their own information and then make the leap to the ability to aggregate others. A lot of it is inertia. What would happen if NMR technicians turned off the printer and started just emailing JCamp files to people for instance? And if journals insisted on receiving characterisation data fully electronically?

    But hey, if we agreed completely it would make a rather dull session! The other thing I wanted to suggest was ‘Science in the streamosphere’ but I thought I’d restrict myself to one at the moment :)

  • As Attila notes, Jon was the leader. Chris was more of the session chair, since I think all the open data stuff is done within his group. Also, Jon’s slides at Scifoo were more detailed than his XTech presentation, but that gives you an idea of what they are thinking about.

    I think the key is to keep it simple. There is a lot that can be done, but we need to go with the basic building blocks and use the existing fabric. Getting past that inertia will be impossible otherwise.

  • As Attila notes, Jon was the leader. Chris was more of the session chair, since I think all the open data stuff is done within his group. Also, Jon’s slides at Scifoo were more detailed than his XTech presentation, but that gives you an idea of what they are thinking about.

    I think the key is to keep it simple. There is a lot that can be done, but we need to go with the basic building blocks and use the existing fabric. Getting past that inertia will be impossible otherwise.

  • Cameron – yes we are pretty much on the same side of things most of the time. I was just pointing out that “if you build it most will not come”, no matter how obviously useful it would be. And that is ok – just don’t expect it.

    Humans are creatures of habit. It took me a really long time to get used to reading article pdfs online instead of printing them out. It required repeatedly going over the inefficiency of that process until it became natural. (Books are a different story – I enjoy reading a physical book for pleasure)

    The point is that students learn by example. I’m sure you keep hearing the same arguments about how the new ways are not as beneficial as the old – for example many of my colleagues have pointed out that they keep to paper versions of journal for the ability to “browse”. A few days with FriendFeed or a properly configured blog reader will surely demonstrate how this is actually “browsing” on a whole new level using collaborative filtering.

    OK so what would happen if the NMR printer was down or if a journal required a JCAMP-DX format of an NMR for publication? Well first there would be lots of reflexive bitching and moaning. From that point how well the transition happens is dependent on how much support staff facilitate the process. In our case the support staff didn’t know about conversion from the proprietary NMR vendor format to the open JCAMP-DX. My group did some research and we found out how easy it actually was to do.

    This is really a minefield for many of the people involved and because of that change is difficult. I am going to continue to promote the use of the JCAMP format and how useful it is to use with JSpecView as I give talks and collaborate with people. But it is not my primary objective to convert scientists. I think we can do more good by helping the already convinced to implement these technologies.

  • Cameron – yes we are pretty much on the same side of things most of the time. I was just pointing out that “if you build it most will not come”, no matter how obviously useful it would be. And that is ok – just don’t expect it.

    Humans are creatures of habit. It took me a really long time to get used to reading article pdfs online instead of printing them out. It required repeatedly going over the inefficiency of that process until it became natural. (Books are a different story – I enjoy reading a physical book for pleasure)

    The point is that students learn by example. I’m sure you keep hearing the same arguments about how the new ways are not as beneficial as the old – for example many of my colleagues have pointed out that they keep to paper versions of journal for the ability to “browse”. A few days with FriendFeed or a properly configured blog reader will surely demonstrate how this is actually “browsing” on a whole new level using collaborative filtering.

    OK so what would happen if the NMR printer was down or if a journal required a JCAMP-DX format of an NMR for publication? Well first there would be lots of reflexive bitching and moaning. From that point how well the transition happens is dependent on how much support staff facilitate the process. In our case the support staff didn’t know about conversion from the proprietary NMR vendor format to the open JCAMP-DX. My group did some research and we found out how easy it actually was to do.

    This is really a minefield for many of the people involved and because of that change is difficult. I am going to continue to promote the use of the JCAMP format and how useful it is to use with JSpecView as I give talks and collaborate with people. But it is not my primary objective to convert scientists. I think we can do more good by helping the already convinced to implement these technologies.

  • Wow… this idea sounds *fantastic*, and also remarkably complicated from a software/uptake point of view. Not necessarily complicated in a bad way, but complicated in a “we’re going to need some very intelligent people to look at this” way. Which makes it an excellent topic for SciFoo.

    I’m just jealous that you get to go :)

  • Wow… this idea sounds *fantastic*, and also remarkably complicated from a software/uptake point of view. Not necessarily complicated in a bad way, but complicated in a “we’re going to need some very intelligent people to look at this” way. Which makes it an excellent topic for SciFoo.

    I’m just jealous that you get to go :)

  • Deepak, absolutely it needs to be kept simple and as far as possible it should involve tools that either people are already using or that are very easy to transfer to. I’m thinking less of an application and more of a platform that gives people a way of pulling things together.

    Jean-Claude, yes re JCamp I’ve experience the same problem with not being able to get the conversion done (being told ‘thats not the format it does’). Little steps could help though – I can see our NMR technicians at some point saying the paper is too expensive and starting to charge extra for that. But equally we embed the need to print out the spectrum in our training program. At their transfer vivas students are expected to bring a folder of their printed out characterisation data. They get marked down if they don’t. Its mad.

    In both cases I think the route forward is to offer a fair bit of carrot and a little stick. If journals asked for electronic characterisation data then the impetus to at least keep the electronic files would be greater. But if people were paid to deposit marked up data then you can see analytical services using that as a way to cover their costs. Provide the students with tools to organise and collate their data and they might buy into it – give their supervisors money for making it happen and their could be a flood. The stick would be funders holding back some of a grant until they were happy that data had been deposited. I suspect this will probably happen eventually anyway.

    Julius Lucks described to me once a conversation he had with Paul Ginsparg in which he said ‘If you build it, they won’t come’ (I think it was with respect to a Digg-like- system for arxiv.org). I think the way to overcome this is to offer these tools at the point where people realise they need them. If we can persuade funders/journals/etc to value high quality data deposition then people will want to know how to do it. If you can slide in at that point with ‘here are some tools that will make your life easier’ then I think we make things shift faster.

    Partly I’m just impatient – twenty years ago email was viewed with suspicion, 10-15 years ago publishers thought it would probably take 50 years before journals went fully online. Things have already changed beyond recognition. And they will continue to do so. I’d just like to influence how they do that a bit.

  • Deepak, absolutely it needs to be kept simple and as far as possible it should involve tools that either people are already using or that are very easy to transfer to. I’m thinking less of an application and more of a platform that gives people a way of pulling things together.

    Jean-Claude, yes re JCamp I’ve experience the same problem with not being able to get the conversion done (being told ‘thats not the format it does’). Little steps could help though – I can see our NMR technicians at some point saying the paper is too expensive and starting to charge extra for that. But equally we embed the need to print out the spectrum in our training program. At their transfer vivas students are expected to bring a folder of their printed out characterisation data. They get marked down if they don’t. Its mad.

    In both cases I think the route forward is to offer a fair bit of carrot and a little stick. If journals asked for electronic characterisation data then the impetus to at least keep the electronic files would be greater. But if people were paid to deposit marked up data then you can see analytical services using that as a way to cover their costs. Provide the students with tools to organise and collate their data and they might buy into it – give their supervisors money for making it happen and their could be a flood. The stick would be funders holding back some of a grant until they were happy that data had been deposited. I suspect this will probably happen eventually anyway.

    Julius Lucks described to me once a conversation he had with Paul Ginsparg in which he said ‘If you build it, they won’t come’ (I think it was with respect to a Digg-like- system for arxiv.org). I think the way to overcome this is to offer these tools at the point where people realise they need them. If we can persuade funders/journals/etc to value high quality data deposition then people will want to know how to do it. If you can slide in at that point with ‘here are some tools that will make your life easier’ then I think we make things shift faster.

    Partly I’m just impatient – twenty years ago email was viewed with suspicion, 10-15 years ago publishers thought it would probably take 50 years before journals went fully online. Things have already changed beyond recognition. And they will continue to do so. I’d just like to influence how they do that a bit.

  • Cameron, this is a fascinating but challenging proposal. I think that there are at least three different goals here: 1) creating a common data format, 2) building a software platform that does these things and 3) have most or all data openly available. 3) is a desired goal, but is not necessarily connected to 1) and 2). You propose an open repository, but technically there is no reason to do this (although it might be easier). And there are probably many different approaches to the right software platform, starting with the programming language: python, ruby, perl, java, php, erlang? As for the data format, it would be easy to spend months or years discussiong the format.

    What will your proposed system look like at its core? Is it a wiki? Or a platform with plugins à la facebook? Or a set of network protocols used my many different applications? And where do we start? Probably with one of the existing tools you mentioned. But you could also start with a publishing platform such as TOPAZ used by PLoS and then work yourself backwards.

  • Cameron, this is a fascinating but challenging proposal. I think that there are at least three different goals here: 1) creating a common data format, 2) building a software platform that does these things and 3) have most or all data openly available. 3) is a desired goal, but is not necessarily connected to 1) and 2). You propose an open repository, but technically there is no reason to do this (although it might be easier). And there are probably many different approaches to the right software platform, starting with the programming language: python, ruby, perl, java, php, erlang? As for the data format, it would be easy to spend months or years discussiong the format.

    What will your proposed system look like at its core? Is it a wiki? Or a platform with plugins à la facebook? Or a set of network protocols used my many different applications? And where do we start? Probably with one of the existing tools you mentioned. But you could also start with a publishing platform such as TOPAZ used by PLoS and then work yourself backwards.

  • Cameron – nothing wrong with impatience – it makes the world go round :)

  • Cameron – nothing wrong with impatience – it makes the world go round :)

  • Hi Martin, I think I need to draw some pictures of what I mean here. The system at core is an authoring tool (or tools) that can access a common plugin architecture. These tools do two things. They take the record that we create by ordering the datastreams we generate and to wrap that and place it in an appropriate repository or repositories. And then it also enables us to do the same thing with external datastreams (literature, data, social network tips, conversations etc). It is both an authoring and planning tool, and a layer through which we can interact with our own data (wherever it happens to be) and anyone elses data (wherever it happens to be)

    In my mind it looks a bit like the offspring of a liasion between Word and Google Home page but I think its important that it should be flexibile. Ideally it should work as a layer on top of anything you want, Word, Wiki, Blog, Text editor. It’s just a set of panels in your web browser if you like.

    Its true that there is no technical reason for making the respositories open, it just happens that that is my agenda. For me, that is the end game. My aim is to make these tools available because they will help to expand the data commons. It is the case, however, that by making these tools you are also making a good data storage and processing system for a closed repository. And someone could make money from that as a way to recoup the development costs.

    Finally, I think ‘common data format’ is the wrong idea. What we need are common standards for wrapping and describing data formats. ORE is one way to do this but a simple XML wrapper that points to a machine readable format description on the cloud could probably do the job. By the time we build this a lot of these data format issues will probably be soluble by machine processing on the fly anyway.

  • Hi Martin, I think I need to draw some pictures of what I mean here. The system at core is an authoring tool (or tools) that can access a common plugin architecture. These tools do two things. They take the record that we create by ordering the datastreams we generate and to wrap that and place it in an appropriate repository or repositories. And then it also enables us to do the same thing with external datastreams (literature, data, social network tips, conversations etc). It is both an authoring and planning tool, and a layer through which we can interact with our own data (wherever it happens to be) and anyone elses data (wherever it happens to be)

    In my mind it looks a bit like the offspring of a liasion between Word and Google Home page but I think its important that it should be flexibile. Ideally it should work as a layer on top of anything you want, Word, Wiki, Blog, Text editor. It’s just a set of panels in your web browser if you like.

    Its true that there is no technical reason for making the respositories open, it just happens that that is my agenda. For me, that is the end game. My aim is to make these tools available because they will help to expand the data commons. It is the case, however, that by making these tools you are also making a good data storage and processing system for a closed repository. And someone could make money from that as a way to recoup the development costs.

    Finally, I think ‘common data format’ is the wrong idea. What we need are common standards for wrapping and describing data formats. ORE is one way to do this but a simple XML wrapper that points to a machine readable format description on the cloud could probably do the job. By the time we build this a lot of these data format issues will probably be soluble by machine processing on the fly anyway.

  • Oops, forgot something. The idea of building backwards from a publishing platform is a very interesting one. When I get as far as writing up the RSC meeting on Open Access publishing in chemistry I want to talk about expanding the role of the publisher template as a way of getting more semantic information. Rather than just a wordprocessor template imagine that you’re writing your paper and something pops up saying ‘Would you like some help with this figure? Just point me at your data and I can draw a graph for you…’

    (Not my idea incidentally. The concept came from Jeremy Frey, Liz Lyons, or Simon Coles at a recent meeting)

  • Oops, forgot something. The idea of building backwards from a publishing platform is a very interesting one. When I get as far as writing up the RSC meeting on Open Access publishing in chemistry I want to talk about expanding the role of the publisher template as a way of getting more semantic information. Rather than just a wordprocessor template imagine that you’re writing your paper and something pops up saying ‘Would you like some help with this figure? Just point me at your data and I can draw a graph for you…’

    (Not my idea incidentally. The concept came from Jeremy Frey, Liz Lyons, or Simon Coles at a recent meeting)

  • Cameron, I start to understand where you want to go. Would Eclipse (Rich Client Platform), be a tool for that job (see for example Bioclipse)? Or does it have to be web-based?

    I’m all for open repositories, I just think that this would have an easier start if open data were not a requirement.

  • Cameron, I start to understand where you want to go. Would Eclipse (Rich Client Platform), be a tool for that job (see for example Bioclipse)? Or does it have to be web-based?

    I’m all for open repositories, I just think that this would have an easier start if open data were not a requirement.

  • Martin,
    If I dare speak for Cameron – openness is central to his objectives.

    There are already many commercial Electronic Notebook Systems and Content Management Systems out there but they are designed with precisely the opposite intent: to keep people out.

    This was made very clear a few weeks ago when I met with a scientific software distributor. There was not even an option to store data in a format not encrypted in a proprietary format.

    Some of us think that there will be a qualitative change in what science can accomplish and how its gets done when data are truly open and free. Distributed intelligence is extremely difficult with limited access to information.

  • Martin,
    If I dare speak for Cameron – openness is central to his objectives.

    There are already many commercial Electronic Notebook Systems and Content Management Systems out there but they are designed with precisely the opposite intent: to keep people out.

    This was made very clear a few weeks ago when I met with a scientific software distributor. There was not even an option to store data in a format not encrypted in a proprietary format.

    Some of us think that there will be a qualitative change in what science can accomplish and how its gets done when data are truly open and free. Distributed intelligence is extremely difficult with limited access to information.

  • You’re more than welcome to speak for me Jean-Claude, someone needs to cover for me when I’m asleep :)

    Yes, the key aim for me is this idea of the data commons. At the end of the day the rest is just software development which doesn’t really interest me for its own sake. But more than that, I don’t think we’ll convince the general science community to buy in unless they see the network effects that we know and love in Web2.0. And I don’t think we’ll see that until the data is available. I think JC puts it well in the comment above. ‘Distributed intelliftence…’

    I’ve looked at Bioclipse briefly and had a good talk with Christoph Steinbeck last week at the RSC meeting. I certainly think these things have a place in the ecosystem . The ideal situation would be everything loosely coupled with people able to use whatever authoring and ordering tools they like. This is at the core of the ‘web as platform’ concept.

    Technically I agree the build would be easier in a lcosed ecosystem with a defined repository architecture, but as JC says, that to a certain extent already exists. The technical challenges in building (and getting people to buy into) an open system are much larger, but both JC and myself, amongst others, believe that the benefits are exponentially larger as well.

    But also there is no technical reason why this system shouldn’t be able to cope with internal or closed data. The only problem is the additional security and authentication issues. My argument is that we will get buy in by actually paying people to make their data open. At the moment they’re not interested in making data available, closed, open, or whatever. By creating a significant carrot to do so, and thus making it part of the mainstream of science practice (as is anything that generates money) I think we take the agenda forward faster than it would otherwise naturally go.

    So not only do we need to build an open source interoperable software framework/architecture/applications we also need a significant quantity of money to persuade people to use it. Two impossible things before breakfast? :)

  • You’re more than welcome to speak for me Jean-Claude, someone needs to cover for me when I’m asleep :)

    Yes, the key aim for me is this idea of the data commons. At the end of the day the rest is just software development which doesn’t really interest me for its own sake. But more than that, I don’t think we’ll convince the general science community to buy in unless they see the network effects that we know and love in Web2.0. And I don’t think we’ll see that until the data is available. I think JC puts it well in the comment above. ‘Distributed intelliftence…’

    I’ve looked at Bioclipse briefly and had a good talk with Christoph Steinbeck last week at the RSC meeting. I certainly think these things have a place in the ecosystem . The ideal situation would be everything loosely coupled with people able to use whatever authoring and ordering tools they like. This is at the core of the ‘web as platform’ concept.

    Technically I agree the build would be easier in a lcosed ecosystem with a defined repository architecture, but as JC says, that to a certain extent already exists. The technical challenges in building (and getting people to buy into) an open system are much larger, but both JC and myself, amongst others, believe that the benefits are exponentially larger as well.

    But also there is no technical reason why this system shouldn’t be able to cope with internal or closed data. The only problem is the additional security and authentication issues. My argument is that we will get buy in by actually paying people to make their data open. At the moment they’re not interested in making data available, closed, open, or whatever. By creating a significant carrot to do so, and thus making it part of the mainstream of science practice (as is anything that generates money) I think we take the agenda forward faster than it would otherwise naturally go.

    So not only do we need to build an open source interoperable software framework/architecture/applications we also need a significant quantity of money to persuade people to use it. Two impossible things before breakfast? :)

  • The issues we discuss here are also important in clinical research with patients (what I do most of the time). An essay about the need for an open source clinical trial data-management system was recently published in PLos Medicine:
    http://medicine.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pmed.0050006

  • The issues we discuss here are also important in clinical research with patients (what I do most of the time). An essay about the need for an open source clinical trial data-management system was recently published in PLos Medicine:
    http://medicine.plosjournals.org/perlserv/?request=get-document&doi=10.1371/journal.pmed.0050006

  • Cameron brilliant as always.

    The biggest problem with ‘If you build it, they won’t come’ is that your and other Open Science blogs are all preaching to the choir. How do we get the message out to the broader scientific community?

    I 100% agree with you that the key is to “persuade funders/journals/etc to value high quality data deposition” and to somehow manipulate the current warped incentives in science to favor openness over secrecy. It will take money, which means it will take the conversion of a few key players such as big funders and journals…

    Hopefully you guys can pull some serious strings at SciFoo – enjoy!

  • Cameron brilliant as always.

    The biggest problem with ‘If you build it, they won’t come’ is that your and other Open Science blogs are all preaching to the choir. How do we get the message out to the broader scientific community?

    I 100% agree with you that the key is to “persuade funders/journals/etc to value high quality data deposition” and to somehow manipulate the current warped incentives in science to favor openness over secrecy. It will take money, which means it will take the conversion of a few key players such as big funders and journals…

    Hopefully you guys can pull some serious strings at SciFoo – enjoy!

  • Anna

    Noam – I think the message will get out through word of mouth. A few evangelists, suitably placed, and enough people for whom it works, and suddenly we are there.

    :)

  • Anna

    Noam – I think the message will get out through word of mouth. A few evangelists, suitably placed, and enough people for whom it works, and suddenly we are there.

    :)

  • we also need an interface to ask the database in natural language. This guy (http://www.media.mit.edu/cogmac/projects/hsp.html) in the media lab has a nice interface to deal with teras and teras of data… language I mean.

    maybe can be interesting to you,

    :)

  • we also need an interface to ask the database in natural language. This guy (http://www.media.mit.edu/cogmac/projects/hsp.html) in the media lab has a nice interface to deal with teras and teras of data… language I mean.

    maybe can be interesting to you,

    :)