Home » Blog

Sourceforge for science

6 November 2007 19 Comments

I got to meet Jeremiah Faith this morning and we had an excellent wide ranging discussion which I will try to capture in more detail later. However I wanted to get down some thoughts we had at the end of the discussion. We were talking about how to publicise and generate more interest and activity for Open Notebook Science. Jeremiah suggested the idea of a Sourceforge for science; a central clearing house somewhere on the web where projects could be described and people could opt in to contribute. There have been some ideas in this direction such as Totally retrosynthetic but I don’t think there has been a lot of uptake there.

This was all tied into the idea of making lab books findable and indexed in places where people might look for them. I have been taken with the way PostGenomic and ChemicalBlogSpace aggregate blogs, particularly blog posts on the peer reviewed literature and in the case of ChemicalBlogSpace aggregate comments on molecules, based on trawling for InChi Keys (I think). So can we propose that one of (both of?) these sites start aggregating online notebook posts? If we could make these point at peer reviewed papers online it would also be possible to use a modified version of the Blue Obelisk Grease Monkey that would popup whenever you were looking at a paper for which there was raw data online.

It wouldn’t be necessary, or perhaps even advisable, to limit these to people strictly practising Open Notebook Science. People could put up data once a paper was published or after a delay. Perhaps we could not even require that all the raw data be put up. If the barriers are lowered more people may do it. A range of appropriate tags (‘Partial Raw Data is available for this paper’, ‘Full raw data is available for this paper’, ‘Full raw data and associated data is available as an open notebook’,) would distinguish between what people are making available. Data could be dropped anywhere online and by aggregation it gains more visibility encouraging people to move from making specific data available towards making all their data available.

Any thoughts?


19 Comments »

  • » Patient drug information at PubMed » business|bytes|genes|molecules said:

    […] then use to create our own mashups and provide information to others (perhaps listed nin the “Sourceforge for science” that Cameron mentions). Certainly some companies are doing that today, either through […]

  • Jean-Claude Bradley said:

    Cameron,
    This relates to the discussion we had during your talk at Drexel about the different formats for different audiences. My UsefulChem blog is on both PostGenomic and ChemicalBlogSpace. (all you have to do is request that they add yours) And I repost the less technical stuff to my Chemistry Wide Open column at ScientificBlogging.com

    I don’t think many people will be interested in reading our notebooks in that way and that is why it is critical for them to be indexed in Google so that people who don’t know about our work will find it when they look. Of course we should link our “big picture” blog posts to specific notebook pages to support our arguments.

    This is the beauty of not giving away your copyright to a publisher :) – we can keep experimenting with different systems to communicate the same information

    You know I am always game for playing with new toys – if you and Jeremiah set something up lets see if we can populate it

  • Jean-Claude Bradley said:

    Cameron,
    This relates to the discussion we had during your talk at Drexel about the different formats for different audiences. My UsefulChem blog is on both PostGenomic and ChemicalBlogSpace. (all you have to do is request that they add yours) And I repost the less technical stuff to my Chemistry Wide Open column at ScientificBlogging.com

    I don’t think many people will be interested in reading our notebooks in that way and that is why it is critical for them to be indexed in Google so that people who don’t know about our work will find it when they look. Of course we should link our “big picture” blog posts to specific notebook pages to support our arguments.

    This is the beauty of not giving away your copyright to a publisher :) – we can keep experimenting with different systems to communicate the same information

    You know I am always game for playing with new toys – if you and Jeremiah set something up lets see if we can populate it

  • Bill said:

    Doesn’t OpenWetWare have a page of open projects? I just had a quick look but couldn’t find it… even if memory doesn’t serve (it happens) and they don’t have any such thing, they might be up for creating one.

  • Bill said:

    Doesn’t OpenWetWare have a page of open projects? I just had a quick look but couldn’t find it… even if memory doesn’t serve (it happens) and they don’t have any such thing, they might be up for creating one.

  • Pedro Beltrao said:

    I think it would be very interesting to go down to the notebook level in some cases. If am interested in reproducing the work in the paper it would be very useful to be able to follow the steps the authors took to produce the final results.

  • Pedro Beltrao said:

    I think it would be very interesting to go down to the notebook level in some cases. If am interested in reproducing the work in the paper it would be very useful to be able to follow the steps the authors took to produce the final results.

  • Bill said:

    Ah, memory is indeed faulty, but not completely so. I was thinking of The Synaptic Leap’s collaborative communities.

  • Bill said:

    Ah, memory is indeed faulty, but not completely so. I was thinking of The Synaptic Leap’s collaborative communities.

  • Jeremiah Faith said:

    That “collaborative communities” is along the right idea. However, I think Sourceforge has almost everything that is needed already without having to reinvent everything again (except I think their space limitations would be too small for practical purposes with large datasets).

    1) they have all kinds of ways for contributors to communicate (forums, mailing lists, feature requests)
    2) projects are well defined (you define how far along the project is, whether you’re willing to accept collaborators, what type of collaborators you need, the language, and the categories the project falls under)
    3) all code changes are kept in a cvs to track all changes from all the project’s coders, making sure that everyone is using the same code and that you can always revert to all previous versions (like a wiki; and what’s currently missing from blogs).

    In our open notebook science case, the cvs would hold the notebook and all revisions to the notebook by all contributors to the project (cvs automatically tracks who enters what and when). Then all that is needed are some viewers to allow interfaces to the cvs notebook. For example, you could write a latex-based engine to convert the work to pdf. You could write a second interface to post the information to a blog (e.g. via a custom wordpress plugin). The last thing needed would be a simple web-interface to make it easy for non-computer nerds to enter their data into the cvs (this too could probably be done by hacking on wordpress).

  • Jeremiah Faith said:

    That “collaborative communities” is along the right idea. However, I think Sourceforge has almost everything that is needed already without having to reinvent everything again (except I think their space limitations would be too small for practical purposes with large datasets).

    1) they have all kinds of ways for contributors to communicate (forums, mailing lists, feature requests)
    2) projects are well defined (you define how far along the project is, whether you’re willing to accept collaborators, what type of collaborators you need, the language, and the categories the project falls under)
    3) all code changes are kept in a cvs to track all changes from all the project’s coders, making sure that everyone is using the same code and that you can always revert to all previous versions (like a wiki; and what’s currently missing from blogs).

    In our open notebook science case, the cvs would hold the notebook and all revisions to the notebook by all contributors to the project (cvs automatically tracks who enters what and when). Then all that is needed are some viewers to allow interfaces to the cvs notebook. For example, you could write a latex-based engine to convert the work to pdf. You could write a second interface to post the information to a blog (e.g. via a custom wordpress plugin). The last thing needed would be a simple web-interface to make it easy for non-computer nerds to enter their data into the cvs (this too could probably be done by hacking on wordpress).

  • Cameron Neylon said:

    I was thinking less of the collaboration aspects – people could actually do that with our system (and JCB’s) as they are – but more of the potential of indexing. If there is a handle people can grab hold of might we be able to persuade PubMed/Google Scholar etc to aggregate the data stream and link it through to published paper.

    In particular I was thinking that if we could somehow label papers for which raw data is available then this might encourage people to make more raw data available and in turn encourage more to pursue ONS.

    The ‘Sourceforge’ concept was very much Jeremiah’s so I will leave him to expand on that. I guess something similar could also probably be done in Facebook (I really am going to have to look at it at some point). The problem with any central site is getting a big enough community to drive added value – which is kind of the problem we already have. And the central problem with any “{Generic Web2.0 tool} for Science”

  • Cameron Neylon said:

    I was thinking less of the collaboration aspects – people could actually do that with our system (and JCB’s) as they are – but more of the potential of indexing. If there is a handle people can grab hold of might we be able to persuade PubMed/Google Scholar etc to aggregate the data stream and link it through to published paper.

    In particular I was thinking that if we could somehow label papers for which raw data is available then this might encourage people to make more raw data available and in turn encourage more to pursue ONS.

    The ‘Sourceforge’ concept was very much Jeremiah’s so I will leave him to expand on that. I guess something similar could also probably be done in Facebook (I really am going to have to look at it at some point). The problem with any central site is getting a big enough community to drive added value – which is kind of the problem we already have. And the central problem with any “{Generic Web2.0 tool} for Science”

  • Jeremiah Faith said:

    Yeah, the lack of ONS people is certainly the fundamental problem. That’s why it’s good to toss ideas around and try things now to figure out what works and what doesn’t.

    For now, I think one of the best ideas was Cameron’s idea to publish all of the raw data and notebook as a pdf supplement with the publication. That will certainly boost awareness amongst the general non-blog reading scientific community (particularly , if the paper is high profile).

  • Jeremiah Faith said:

    Yeah, the lack of ONS people is certainly the fundamental problem. That’s why it’s good to toss ideas around and try things now to figure out what works and what doesn’t.

    For now, I think one of the best ideas was Cameron’s idea to publish all of the raw data and notebook as a pdf supplement with the publication. That will certainly boost awareness amongst the general non-blog reading scientific community (particularly , if the paper is high profile).

  • Julius said:

    Referring to Jeremiah’s first comment on the 3 things provided by SourceForge: I would argue that the OpenWetWare wiki itself covers the first 2 pretty well – communication methods and project definitions. What we really need is to fill the 3rd – a revision control system (of which CVS is one.)

    There has been some discussion in a recent OWW steering committee meeting about providing ‘code hosting’ on OWW whereby a user would request a code repository, and it would be hosted on OWW servers (much as is done on SourceForge). The advantages to this are many, but two of them are that:

    1.) From a users perspective, OWW can now house all the prose associated with a project (lab notebooks, protocols, paper writing, etc.) as well as the code used to perform data analysis. With both of these sources of information in one place, OWW can start to provide very convienient inter-linking such as the ability to reference the code files used to generate the plots on a particular wiki page (or blog post). It would be also easy for people to look up the wiki pages that were used to write a paper, click on links to access the code in the repositories and run the analysis again with the parameters or inputs of their choice.

    2.) OWW now has another piece of the puzzle to being an ‘Open Science’ project management system, very analogous to SourceForge being an Open Source project management system.

    The advantages to having things under one roof (rather than promoting the use of SourceForge) are several as well including:

    1.) It will be much easier to integrate things into the current OWW resources (mentioned above).
    2.) It will certainly turn out that the needs of scientists are different from the needs of software developers. I think scientists could certainly benefit from learning a few of the practices of modern software development (like version control), but there will be differences. We could tweak the OWW system to be more suited to scientists, while promoting the natural parts of the software development system.
    3.) This can eventually be added upon, including a versioning of scientific data sets that could be integrated into OWW services as described above.

    If OWW were to offer a reversion control system for code (or anything else you can put into such a system), would people be interested?

  • Julius said:

    Referring to Jeremiah’s first comment on the 3 things provided by SourceForge: I would argue that the OpenWetWare wiki itself covers the first 2 pretty well – communication methods and project definitions. What we really need is to fill the 3rd – a revision control system (of which CVS is one.)

    There has been some discussion in a recent OWW steering committee meeting about providing ‘code hosting’ on OWW whereby a user would request a code repository, and it would be hosted on OWW servers (much as is done on SourceForge). The advantages to this are many, but two of them are that:

    1.) From a users perspective, OWW can now house all the prose associated with a project (lab notebooks, protocols, paper writing, etc.) as well as the code used to perform data analysis. With both of these sources of information in one place, OWW can start to provide very convienient inter-linking such as the ability to reference the code files used to generate the plots on a particular wiki page (or blog post). It would be also easy for people to look up the wiki pages that were used to write a paper, click on links to access the code in the repositories and run the analysis again with the parameters or inputs of their choice.

    2.) OWW now has another piece of the puzzle to being an ‘Open Science’ project management system, very analogous to SourceForge being an Open Source project management system.

    The advantages to having things under one roof (rather than promoting the use of SourceForge) are several as well including:

    1.) It will be much easier to integrate things into the current OWW resources (mentioned above).
    2.) It will certainly turn out that the needs of scientists are different from the needs of software developers. I think scientists could certainly benefit from learning a few of the practices of modern software development (like version control), but there will be differences. We could tweak the OWW system to be more suited to scientists, while promoting the natural parts of the software development system.
    3.) This can eventually be added upon, including a versioning of scientific data sets that could be integrated into OWW services as described above.

    If OWW were to offer a reversion control system for code (or anything else you can put into such a system), would people be interested?

  • Jeremiah Faith said:

    Sorry, to reply so late Julius. I think those are some fantastic ideas and I definitely would be interested in such a system. I agree that OWW is a better spot than Sourceforge. It would be nice to have a wish list somewhere on OWW to start bouncing around ideas of what would be essential, useful, etc…

    Couple details:
    1) repository probably should use svn not cvs. I used cvs for my open lab notebook now, but it isn’t terribly good for binary raw data files
    2) you all would need quite a lot of space, since raw data can be pretty hefty (particularly versioned raw data)

    One definite thing for the wish list:
    Any project blog needs a feature on the comment submission that allows folks to get an email when new comments are added to a post they’ve commented on. It’s really difficult to have a group discussion like this with a blog comment system, because it’s hard to know when new comments have been added (unless you are the author of the post).

  • Jeremiah Faith said:

    Sorry, to reply so late Julius. I think those are some fantastic ideas and I definitely would be interested in such a system. I agree that OWW is a better spot than Sourceforge. It would be nice to have a wish list somewhere on OWW to start bouncing around ideas of what would be essential, useful, etc…

    Couple details:
    1) repository probably should use svn not cvs. I used cvs for my open lab notebook now, but it isn’t terribly good for binary raw data files
    2) you all would need quite a lot of space, since raw data can be pretty hefty (particularly versioned raw data)

    One definite thing for the wish list:
    Any project blog needs a feature on the comment submission that allows folks to get an email when new comments are added to a post they’ve commented on. It’s really difficult to have a group discussion like this with a blog comment system, because it’s hard to know when new comments have been added (unless you are the author of the post).