Science in the Open – Page 34 – The online home of Cameron Neylon

March 30, 2008December 30, 2009

Data models for capturing and describing experiments – the discussion continues

Frank Gibson has continued the discussion that kicked off here and has continued here [1, 2, 3, 4] and in other places [1, 2] along the way. Frank’s exposition on using FuGE as a data model is very clear in what it says and does not say and some of his questions have revealed sloppiness in the way I originally described what I was trying to do. Here I will respond to his responses and try to clarify what it is that I want, and what I want it to achieve. I still feel that we are trying to describe and achieve different things, but that this discussion is a great way of getting to the bottom of this and achieving some clarity in our description and language. Continue reading “Data models for capturing and describing experiments – the discussion continues”

March 26, 2008December 30, 2009

Responding to PM-R on the structured experiment

This started out as a comment on Peter Murray-Rust’s response to my post and grew to the point where it seemed to warrant its own post. We need a better medium (or perhaps a semantic markup framework for Blogs?) in which to capture discussions like this, but that’s a problem for another day…

Continue reading “Responding to PM-R on the structured experiment”

March 26, 2008December 30, 2009

The structured experiment

More on the discussion of structured vs unstructured experiment descriptions. Frank has put up a description of the Minimal Information about a Neuroscience Investigation standard at Nature Precedings which comes out of the CARMEN project. Neil Saunder’s has also made some comments on the resistance amongst the lab monkeys to think about structure. Lots of good points here. I wanted to pick out a couple in particular;

From Neil;

My take on the problem is that biologists spend a lot of time generating, analysing and presenting data, but they donâ€™t spend much time thinking about the nature of their data. When people bring me data for analysis I ask questions such as: what kind of data is this? ASCII text? Binary images? Is it delimited? Can we use primary keys? Not surprisingly this is usually met with blank stares, followed by â€œwellâ€¦I ran a gelâ€¦â€.

Part of this is a language issue. Computer scientists and biologists actually mean something quite different when they refer to ‘data’. For a comp sci person data implies structure. For a biologist data is something that requires structure to be made comprehensible. So don’t ask ‘what kind of data is this?’, ask ‘what kind of file are you generating?’. Most people don’t even know what a primary key is, including me as demonstrated by my misuse of the term when talking about CAS numbers which lead to significant confusion.

I do believe that any experiment [CN – my emphasis] can be described in a structured fashion, if researchers can be convinced to think generically about their work, rather than about the specifics of their own experiments. All experiments share common features such as: (1) a date/time when they were performed; (2) an aim (â€generate PCR productâ€, â€œrun crystal screen for protein Xâ€); (3) the use of protocols and instruments; (4) a result (correct size band on a gel, crystals in well plate A2). The only free-form part is the interpretation.

Here I disagree, but only at the level of detail. The results of any experiment can probably be structured after the event. But not all experiments can be clearly structured either in advance, or as they happen. Many can, and here Neil’s point is a good one, by making some slight changes in the way people think about their experiment much more structure can be captured. I have said before that the process of using our ‘unstructured’ lab book system has made me think and plan my experiments more carefully. Nonetheless I still frequently go off piste, things happen. What started as an SDS-PAGE gel turns into something else (say a quick column on the FPLC).

Without wishing to pick a fight, most people with a computer science background who lean towards the heavily semantic end of the spectrum are dealing with the wet lab scientists after the data has been taken and partially processed. I don’t disagree that it would help the comp sci people if the experimenters worked harder at structuring the data as they generate it, and I do think in general this is a good thing. The problem is that it doesn’t map well onto how the work is actually carried out. The solution I think is a mixture of the free form approach combined with useful tools and widgets that do two things: firstly they make the process of capturing the process easier; secondly the encourage the collection and structuring of data as it comes off. This is what the templates in our system do, and there is no reason in principle why they couldn’t be driven by agreed data models.

Actually the Frey group (who have done the development of the LaBLog system) already have a highly semantic lab book system developed during the MyTea project. One of our future aims is to take the best of both forward into a ‘semi-semantic’ or ‘freely semantic’ system. One of the main problems with implementing the MyTea notebook is that it requires data models. It was developed for synthetic chemistry but it would make sense, in expanding it into the biochemistry/molecular biology area to utilise existing data models with FuGE the obvious main source.

One more point: we need to teach students that every activity leading to a result is an experiment. From my time as a Ph.D. student in the wet lab, I remember feeling as though my day-to-day activities: PCR reactions, purifications, cloning werenâ€™t really experiments […] Experiments were clever, one-shot procedures performed by brilliant postdocs to answer big questions […] Break your activities into steps and ways to describe them as structured data should suggest themselves.

This is very true, and harks back to my comment about language. A lot of the issues here are actually because we mean very different things by ‘experiment’. We probably should use better words, although I think procedure and protocol are similarly loaded with conflicting meanings. Control of language is important and agreement on meaning is, after all, at the root of semantics (or is that semiotics, I’m never sure…)

March 25, 2008December 30, 2009

The heavyweights roll in…distinguishing recording the experiment from reporting it

Frank Gibson of peanutbutter has left a long comment on my post about data models for lab notebooks which I wanted to respond to in detail. We have also had some email exchanges. This is essentially an incarnation of the heavyweight vs lightweight debate when it comes to tools and systems for description of experiments. I think this is a very important issue and that it is also subject to some misunderstandings about what we and others are trying to do. In particular I think we need to draw a distinction between recording what we are doing in the lab and reporting what we have done after the fact. Continue reading “The heavyweights roll in…distinguishing recording the experiment from reporting it”

March 25, 2008December 30, 2009

Semantics in the real world? Part I – Why the triple needs to be a quint (or a sext, or…)

I’ve been mulling over this for a while, and seeing as I am home sick (can’t you tell from the rush of posts?) I’m going to give it a go. This definitely comes with a health warning as it goes way beyond what I know much about at any technical level. This is therefore handwaving of the highest order. But I haven’t come across anyone else floating the same ideas so I will have a shot at explaning my thoughts.

The Semantic Web, RDF, and XML are all the product of computer scientists thinking about computers and information. You can tell this because they deal with straightforward declarations that are absolute. X has property Y. Putting aside all the issues with the availability of tools and applications, the fact that triple stores don’t scale well, regardless of all the technical problems a central issue with applying these types of strategy to the real world is that absolutes don’t exist. I may assert that X has property Y, but what hppens when I change my mind, or when I realise I made a mistake, or when I find out that the underlying data wasn’t taken properly. How do we get this to work in the real world? Continue reading “Semantics in the real world? Part I – Why the triple needs to be a quint (or a sext, or…)”

March 25, 2008December 30, 2009

Incorporating My Experiment and Taverna into the LaBLog – A possible example

During the workshop in late February we had discussions about possible implementations of Taverna work flows to automate specific processes to make our life easier. One specific example we discussed was the reduction and initial analysis of Small Angle Neutrons Scattering data. Here I want to describe a bit of the background to what this is and what we might do to kick of the discussion. Continue reading “Incorporating My Experiment and Taverna into the LaBLog – A possible example”

March 24, 2008December 30, 2009

Open Science at PSB – Call for submissions

What Shirley said:

The call for participation for the Open Science workshop at PSB 2009 is now up! We welcome anyone with an interest in open science to submit proposals for talks. Note that although space is limited for talks and demos, anyone who registers for the conference can present a poster, so we also encourage poster submissions!

Please if you are interested in submitting a talk or poster get in touch. We would like to have a good and robust discussion with a range of perspectives on a range of topics. We are limited with respect to the time available so there will be some tough decisions to make. Nonetheless, please do get in touch; we would very much like to have a good representation of posters as well as talks. If there is interest then we can organise an unofficial session on the side of the meeting to take things further, perhaps towards ‘Open Science 2009’Â a meeting in its own right?

March 23, 2008December 30, 2009

Proposing a data model for Open Notebooks

‘No data model survives contact with reality’ – Me, Cosener’s House Workshop 29 February 2008

This flippant comment was in response to (I think) Paolo Missier asking me ‘what the data model is’ for our experiments. We were talking about how we might automate various parts of the blog system but the point I was making was that we can’t have a data model with any degree of specificity because we very quickly find the situation where they don’t fit. However, having spent some time thinking about machine readability and the possibility of converting a set of LaBLog posts to RDF, as well as the issues raised by the problems we have with tables, I think we do need some sort of data model. These are my initial thoughts on what that might look like. Continue reading “Proposing a data model for Open Notebooks”

March 14, 2008December 30, 2009

Open Science at Pacific Symposium on Biocomputing (and BioSysBio)

Shirley has already posted a quick notice on this but I thought I would follow up. Our proposal for a session at the Pacific Symposium on Biocomputing on Open Science was successful and we have been asked to put together a workshop session to run on January 5 next year in Hawaii. Continue reading “Open Science at Pacific Symposium on Biocomputing (and BioSysBio)”

March 13, 2008December 30, 2009

Giving credit, filtering, and blogs versus traditional research papers

Another post prompted by an exchange of comments on Neil Saunderâ€™s blog. The discussion here started about the somewhat arbitrary nature of what does and does not get counted as â€˜worthy contributionsâ€™ in the research community. Neil was commenting on an article in Nature Biotech that had similar subject matter to some Blog posts, and he was reflecting on the fact that one would look convincing on a CV and the others wouldnâ€™t. The conversation in the comments drifted somewhat into a discussion of peer review with Maxine (I am presuming Maxine Clarke from Nature?). You should read her commentÂ and the post and other comments in full but I wanted to pick out one bit. Continue reading “Giving credit, filtering, and blogs versus traditional research papers”