12 Replies to “Proposing a data model for Open Notebooks”

  1. I think it depends on how you want to handle it. I would put it in the procedure or conversely handle it at a higher level in a separate note/blog/wiki post. I am not trying to cover all parts of the process here, just the recording of facts. So you could treat intepretation either as a running note in the procedures, (I think this is working), or a direct characteristic of a product (you can isolate a precipitate), or perhaps more formally as a product of a consideration of your analysis data and other products (e.g. input = product [weighs X], NMR, MS etc; procedure is I decided whether I’d made the right thing; product is ‘yes’) but that seems a little ridiculous. A piece of metadata that says ‘this worked’ would probably do the job in many cases. More sophisticated intepretation will probably require more words and explanation.

    And I am assuming this would sit on a versioning system like a wiki. One thing that this system doesn’t capture well is the way you interact with students on the UsefulChem Wiki by pointing out gaps or mistakes. But maybe this is captured by the version system which logs who made what additions (which perhaps means you don’t need an author as a required tag?).

    In any case I am sure there are serious problems with this as it stands, not least that if you do a time course experiment, it requires you to keep splitting up the procedure. But may we can iterate this a few times and see whether it can ever work.

    The key aspect of my proposal, I think, is that things need a URI, and that the rest should be as flexibile as possible.

  2. I think it depends on how you want to handle it. I would put it in the procedure or conversely handle it at a higher level in a separate note/blog/wiki post. I am not trying to cover all parts of the process here, just the recording of facts. So you could treat intepretation either as a running note in the procedures, (I think this is working), or a direct characteristic of a product (you can isolate a precipitate), or perhaps more formally as a product of a consideration of your analysis data and other products (e.g. input = product [weighs X], NMR, MS etc; procedure is I decided whether I’d made the right thing; product is ‘yes’) but that seems a little ridiculous. A piece of metadata that says ‘this worked’ would probably do the job in many cases. More sophisticated intepretation will probably require more words and explanation.

    And I am assuming this would sit on a versioning system like a wiki. One thing that this system doesn’t capture well is the way you interact with students on the UsefulChem Wiki by pointing out gaps or mistakes. But maybe this is captured by the version system which logs who made what additions (which perhaps means you don’t need an author as a required tag?).

    In any case I am sure there are serious problems with this as it stands, not least that if you do a time course experiment, it requires you to keep splitting up the procedure. But may we can iterate this a few times and see whether it can ever work.

    The key aspect of my proposal, I think, is that things need a URI, and that the rest should be as flexibile as possible.

  3. There is a data model for scientific experiments that should set you on your way. Its called the Functional Genomics Experiment model (FuGE: http://fuge.sourceforge.net/). Its a UML model therefore this facilitates it to be realised in several concrete formats, for example, XML or relational tables. I declare a bias, as I have extended it to create an XML transfer format for gel electrophoresis (GelML: http://psidev.info/index.php?q=node/254) and therefore highly recommend it. The specificity comes for the use of ontologies or controlled vocabularies to add semantics to the generic structure. For example, the prospective use of the Ontology of Biomedical Investigations, or any number from the OBO foundry. Even though, the name suggests its for functional genomics, I am currently using it to represent neuroscience electrophysiology experiments within CARMEN (http://carmen.org.uk/), via Symba, (http://symba.sourceforge.net/) which is an open source database implementation of the FuGE data model providing a versioning mechanism and a user/query interface, although it is still in early development.
    Mm, managed to get a few plugs in there :) but I do think these technologies can provide a platform for a standard representation of scientific experiments. Is a wiki the best option…..? It mimics a traditional paper based labbook very well, however, it is not very computationally amenable. Do we need to think more outside the box? If we make our data open in our lab book do we not also want to perform analysis over it where it sits, rather than downloading it back to our desktop?

  4. There is a data model for scientific experiments that should set you on your way. Its called the Functional Genomics Experiment model (FuGE: http://fuge.sourceforge.net/). Its a UML model therefore this facilitates it to be realised in several concrete formats, for example, XML or relational tables. I declare a bias, as I have extended it to create an XML transfer format for gel electrophoresis (GelML: http://psidev.info/index.php?q=node/254) and therefore highly recommend it. The specificity comes for the use of ontologies or controlled vocabularies to add semantics to the generic structure. For example, the prospective use of the Ontology of Biomedical Investigations, or any number from the OBO foundry. Even though, the name suggests its for functional genomics, I am currently using it to represent neuroscience electrophysiology experiments within CARMEN (http://carmen.org.uk/), via Symba, (http://symba.sourceforge.net/) which is an open source database implementation of the FuGE data model providing a versioning mechanism and a user/query interface, although it is still in early development.
    Mm, managed to get a few plugs in there :) but I do think these technologies can provide a platform for a standard representation of scientific experiments. Is a wiki the best option…..? It mimics a traditional paper based labbook very well, however, it is not very computationally amenable. Do we need to think more outside the box? If we make our data open in our lab book do we not also want to perform analysis over it where it sits, rather than downloading it back to our desktop?

  5. Hi Frank, I don’t think we fundamentally disagree. I think controlled vocabs are great where they can be applied but in my experience you either end up in a situation where your vocab no longer fits (should I use FUGE or CML?) or worse, where your vocab actually constrains your experiment.

    Where they can be used, and where things can be agreed, then tools should be written to help people use them. But at the moment I don’t think those tools exist for the humble experimenter such as myself.

    To give an example. Your gelML documentation runs to N pages. From my perspective I would just say ‘I ran a gel’, see e.g. http://chemtools.chem.soton.ac.uk/projects/blog/blogs.php/bit_id/2733
    Now what would be brilliant would be if our system could automatically say, ‘well this is a gel, therefore I should try and expose the description in gelML with all the info that I have availablee’. But then someone has to implement that for gels, for columns, for NMR, for neutron scattering data, …

    My arguement is that we need to be able to use all of these vocabs, or none, to describe experiments to a practical level of detail. My proposal isn’t to ‘use a wiki’, in fact we use a blog. Most other people who are really using these systems in anger in a wetlab are using wikis because they prefer the functionality available. We need to have an extremely general framework that we can then slot all of these different systems into.

  6. Hi Frank, I don’t think we fundamentally disagree. I think controlled vocabs are great where they can be applied but in my experience you either end up in a situation where your vocab no longer fits (should I use FUGE or CML?) or worse, where your vocab actually constrains your experiment.

    Where they can be used, and where things can be agreed, then tools should be written to help people use them. But at the moment I don’t think those tools exist for the humble experimenter such as myself.

    To give an example. Your gelML documentation runs to N pages. From my perspective I would just say ‘I ran a gel’, see e.g. http://chemtools.chem.soton.ac.uk/projects/blog/blogs.php/bit_id/2733
    Now what would be brilliant would be if our system could automatically say, ‘well this is a gel, therefore I should try and expose the description in gelML with all the info that I have availablee’. But then someone has to implement that for gels, for columns, for NMR, for neutron scattering data, …

    My arguement is that we need to be able to use all of these vocabs, or none, to describe experiments to a practical level of detail. My proposal isn’t to ‘use a wiki’, in fact we use a blog. Most other people who are really using these systems in anger in a wetlab are using wikis because they prefer the functionality available. We need to have an extremely general framework that we can then slot all of these different systems into.

  7. Just to add to that. Looking at the gelML documentation, I would see that as a data file. Ideally this would be generated by some electrophoresis system. The ‘lab book’ would then point at that file as the ‘product’ of the procedure ‘ran gel’ with the inputs ‘these samples’. A clever system would either a) pull the procedure and the inputs from the gelML file, or take these from the lab book and then add them to the gelML file.

    My thinking primarily relates to the generation of samples. Once you get into instrumentation a lot of the issues are already taken care of by the systems available.

  8. Just to add to that. Looking at the gelML documentation, I would see that as a data file. Ideally this would be generated by some electrophoresis system. The ‘lab book’ would then point at that file as the ‘product’ of the procedure ‘ran gel’ with the inputs ‘these samples’. A clever system would either a) pull the procedure and the inputs from the gelML file, or take these from the lab book and then add them to the gelML file.

    My thinking primarily relates to the generation of samples. Once you get into instrumentation a lot of the issues are already taken care of by the systems available.

  9. “…My arguement is that we need to be able to use all of these vocabs, or none, to describe experiments to a practical level of detail….”

    I agree, FuGE will allow you to do this.

    With all these things, especially modelling, there are lots of ways to reach the end goal. With respect to vocabularies, often these are just recommend strings (which are valuable in themselves). However ontologies, specifically OBI are an effort to assign meaning and machine interpretability to terms as a community wide consensus, to avoid the issues you mention.

    Tools: tools are a big issue and often a stumbling block, ideally as you suggest the information or datafiles should be automatically exported from instrumentation. This is currently not the case and therefore there is a high cost involved, for some this is an unacceptable cast, to the “humble experimenter”, often the benefits of annotation and sharing are not obvious to the person that generated the data, but I am preaching to the converted :)

    In regards to the data model. FuGE is the data model [PMID: 17921998]. It is a heavy weight solution, although it is designed to represent the common aspects of a scientific experiment. The GelML case is an extension of the FuGE data model to be more specific in the representation of a gel electrophoresis experiment.

    However, if you are looking for a data model for the representation of scientific experiments, FuGE should be your first port of call. You do not have to use-it out of the box as one entity but re-use components. It allows the representation of protocols – the standard instructions/templates and the application of those protocols – capturing the day to day changeable information such as dates and lot numbers without having to fill in the full protocol again. Ontologies or controlled vocabularies are recommended although not restricted, you can use free text.

    Its feasible that text from wikis or blogs could be extracted, by text mining or rdf queries and then placed in a model like this, However this is close to the data integration paradigm that currently exists in bioinformatics. Its alot of wasted effort. I would argue it is better to structure an annotate the data at the source on day one, than trying to extract information later. I have this problem with my current project, in that semantic extraction involves picking up the phone and asking the experimenter what you did and what does it all mean. I can see similar parallels here, the only difference is that the lab book is electronic and not paper, its still unstructured information.

    Like I say, there are many ways to achieve the same end. A common data model for scientific experiments exists, which has undergone a long, cross community and public consultation process. It does need tool support. However, just hypothesising, even an xslt representation maybe able to be “plugged” into a blog or a wiki to provide the structure.

    I am afraid I am in the data model, structured and semantic data camp. Electronic notebooks and open data are a step in the right direction – a massive step. However I still believe that unless the data comes structured with semantics, either data models RDF or whatever the Zeitgeist technology, we are still not maximising the potential form this information.

  10. “…My arguement is that we need to be able to use all of these vocabs, or none, to describe experiments to a practical level of detail….”

    I agree, FuGE will allow you to do this.

    With all these things, especially modelling, there are lots of ways to reach the end goal. With respect to vocabularies, often these are just recommend strings (which are valuable in themselves). However ontologies, specifically OBI are an effort to assign meaning and machine interpretability to terms as a community wide consensus, to avoid the issues you mention.

    Tools: tools are a big issue and often a stumbling block, ideally as you suggest the information or datafiles should be automatically exported from instrumentation. This is currently not the case and therefore there is a high cost involved, for some this is an unacceptable cast, to the “humble experimenter”, often the benefits of annotation and sharing are not obvious to the person that generated the data, but I am preaching to the converted :)

    In regards to the data model. FuGE is the data model [PMID: 17921998]. It is a heavy weight solution, although it is designed to represent the common aspects of a scientific experiment. The GelML case is an extension of the FuGE data model to be more specific in the representation of a gel electrophoresis experiment.

    However, if you are looking for a data model for the representation of scientific experiments, FuGE should be your first port of call. You do not have to use-it out of the box as one entity but re-use components. It allows the representation of protocols – the standard instructions/templates and the application of those protocols – capturing the day to day changeable information such as dates and lot numbers without having to fill in the full protocol again. Ontologies or controlled vocabularies are recommended although not restricted, you can use free text.

    Its feasible that text from wikis or blogs could be extracted, by text mining or rdf queries and then placed in a model like this, However this is close to the data integration paradigm that currently exists in bioinformatics. Its alot of wasted effort. I would argue it is better to structure an annotate the data at the source on day one, than trying to extract information later. I have this problem with my current project, in that semantic extraction involves picking up the phone and asking the experimenter what you did and what does it all mean. I can see similar parallels here, the only difference is that the lab book is electronic and not paper, its still unstructured information.

    Like I say, there are many ways to achieve the same end. A common data model for scientific experiments exists, which has undergone a long, cross community and public consultation process. It does need tool support. However, just hypothesising, even an xslt representation maybe able to be “plugged” into a blog or a wiki to provide the structure.

    I am afraid I am in the data model, structured and semantic data camp. Electronic notebooks and open data are a step in the right direction – a massive step. However I still believe that unless the data comes structured with semantics, either data models RDF or whatever the Zeitgeist technology, we are still not maximising the potential form this information.

Comments are closed.