science2.0 – Science in the Open

November 6, 2009December 30, 2009

Reflections on Science 2.0 from a distance – Part II

This is the second of two posts discussing the talk I gave at the Science 2.0 Symposium organized by Greg Wilson in Toronto in July. As I described in the last post Jon Udell pulled out the two key points from my talk and tweeted them. The first suggested some ideas about what the limiting unit of science, or rather science communication, might be. The second takes me in to rather more controversial areas:

@cameronneylon uses tags to classify records in a bio lab wiki. When emergent ontology doesn’t match the standard, it’s useful info. #osci20

It may surprise many to know that I am a great believer in ontologies and controlled vocabularies. This is because I am a great believer in effectively communicating science and without agreed language effective communication isnâ€™t possible. Where I differ with many is that assumption that because an ontology exists it provides the best means of recording my research. This is borne out of my experiences trying to figure out how to apply existing data models and structured vocabularies to my own research work. Very often the fit isnâ€™t very good, and more seriously, it is rarely clear why or how to go about adapting or choosing the right ontology or vocabulary.

What I was talking about in Toronto was the use of key-value pairs within the Chemtools LaBLog system and the way we use them in templates. To re-cap briefly the templates were initially developed so that users can avoid having to manually mark up posts, particularly ones with tables, for common procedures. The fact that we were using a one item-one post system meant that we knew that important inputs into that table would have their own post and that the entry in the table could link to that post. This in turn meant that we could provide the user of a template with a drop down menu populated with post titles. We filter those on the basis of tags, in the form of key-value pairs, so as to provide the right set of possible items to the user. This creates a remarkably flexible, user-driven, system that has a strong positive reinforcement cycle. To make the templates work well, and to make your life easier you need to have the metadata properly recorded for research objects, but in turn you can create templates for your objects that make sure that the metadata is recorded correctly.

The effectiveness of the templates clearly depends very strongly on the organization of the metadata. The more the pattern of organization maps on to the realities of how substances and data files are used in the local research process, and the more the templates reflect the details of that process the more effective they are. We went though a number of cycles of template and metadata re-organization. We would re-organize, thinking we had things settled and then we would come across another instance of a template breaking, or not working effectively. The motivation to re-organize was to make the templates work well, and save effort. The system aided us in this by allowing us to make organizational changes without breaking any of the previous schemes.

Through repeated cycles of modification and adaption we identified an organizational scheme that worked effectively. Essentially this is a scheme that categorizes objects based on what they can be used for. A sample may be in the material form of a solution, but it may also be some form of DNA.Â Some procedures can usefully be applied to any solution, some are only usefully applied to DNA. If it is a form of DNA then we can ask whether it is a specific form, such as an oligonucleotide, that can be used in specific types of procedure, such as a PCR. So we ended up with a classification of DNA types based on what they might be used for (any DNA can be a PCR templates, only a relatively short single stranded DNA can be used as a â€“ conventional â€“ PCR primer). However in my work I also had to allow for the fact that something that was DNA might also be protein; I have done work on protein-DNA conjugates and I might want to run these on both a protein gel and a DNA gel.

We had, in fact, built our own, small scale laboratory ontology that maps onto what we actually do in our laboratory. There was little or no design that went into this, only thinking of how to make our templates work. What was interesting was the process of then mapping our terms and metadata onto designed vocabularies. The example I used in the talk was the Sequence Ontology terms relating to categories of DNA. We could map the SO term plasmid on to our key value pair DNA:plasmid, meaning a double stranded circular DNA capable in principle of transforming bacteria. SO:ss_oligo maps onto DNA:oligonucleotide (kind of, I’ve just noticed that synthetic oligo is another term in SO).

But we ran into problems with our type DNA:double_stranded_linear. In SO there is more than one term, including restriction fragments and PCR products. This distinction was not useful to us. In fact it would create a problem. For our purposes restriction fragments and PCR products were equivalent in terms of what we could do with them. The distinction the SO makes is in where they come from, not what they can do. Our schema is driven by what we can do with them. Where they came from and how they were generated is also implicit in our schema but it is separated from what an object can be used for.

There is another distinction here. The drop down menus in our templates do not have an â€œorâ€ logic in the current implementation. This drives us to classify the possible use of objects in as general a way as possible. We might wish to distinguish between â€œflat endedâ€ linear double stranded DNA (most PCR products) and â€œsticky endedâ€ or overhanging linear ds DNA (many restriction fragments) but we are currently obliged to have at least one key value pair places these together as many standard procedures can be applied to both. In ontology construction there is a desire to describe as much detail as possible. Our framework drives us towards being as general as possible. Both approaches have their uses and neither is correct. They are built for different purposes.

The bottom line is that for a structured vocabulary to be useful and used it has to map well onto two things. The processes that the user is operating and the inputs and outputs of those processes. That is it must match the mental model of the user. Secondly it must map well onto the tools that the user has to work with. Most existing biological ontologies do not map well onto our LaBLog system, although we can usually map to them relatively easy for specific purposes in a post-hoc fashion. However I think our system is mapped quite well by some upper ontologies.

Iâ€™m currently very intrigued by an idea that I heard from Allyson Lister, which matches well onto some other work Iâ€™ve recently heard about that involves â€œjust in timeâ€ and â€œper-useâ€ data integration. It also maps onto the argument I made in my recent paper that we need to separate the issues of capturing research from those involved in describing and communicating research. The idea was that for any given document or piece of work, rather than trying to fit it into a detailed existing ontology you build a single-use local ontology based on what is happening in this specific case based on a more general ontology, perhaps OBO, perhaps something even more general. Then this local description can be mapped onto more widely used and more detailed ontologies for specific purposes.

At the end of the day the key is effective communication. We donâ€™t all speak the same language and weâ€™re not going to. But if we had the tools to help us capture our research in an appropriate local dialect in a way that makes it easy for us, and others, to translate into whatever lingua franca is best for a given purpose, then we will make progress.

August 11, 2009December 30, 2009

The trouble with business models (Facebook buys Friendfeed)

…is that someone needs to make money out of them. It was inevitable at some point that Friendfeed would take a route that lead it towards mass adoption and away from the needs of the (rather small) community of researchers that have found a niche that works well for them. I had thought it more likely that Friendfeed would gradually move away from the aspects that researchers found attractive rather than being absorbed wholesale by a bigger player but then I don’t know much about how Silicon Valley really works. It appears that Friendfeed will continue in its current form as the two companies work out how they might integrate the functionality into Facebook but in the long term it seems unlikely that current service will survive. In a sense the sudden break may be a good thing because it forces some of the issues about providing this kind of research infrastructure out into the open in a way a gradual shift probably wouldn’t.

What is about Friendfeed that makes it particularly attractive to researchers? I think there are a couple of things, based more on hunches than hard data but in comparing with services like Twitter and Facebook there are a couple of things that standout.

Conversations are about objects. At the core of the way Friendfeed works are digital objects, images, blog posts, quotes, thoughts, being pushed into a shared space. Most other services focus on the people and the connections between them. Friendfeed (at least the way I use it) is about the objects and the conversations around them.
Conversation is threaded and aggregated. This is where Twitter loses out. It is almost impossible to track a specific conversation via Twitter unless you do so in real time. The threaded nature of FF makes it possible to track conversations days or months after they happen (as long as you can actually get into them)
Excellent “person discovery” mechanisms. The core functionality of Friendfeed means that you discover people who “like” and comment on things that either you, or your friends like and comment on. Friendfeed remains one of the most successful services I know of at exploiting this “friend of a friend” effect in a useful way.
The community. There is a specific community, with a strong information technology, information management, and bioinformatics/structural biology emphasis, that grew up and aggregated on Friendfeed. That community has immense value and it would be sad to lose it in any transition.

So what can be done? One option is to set back and wait to be absorbed into Facebook. This seems unlikely to be either feasible or popular. Many people in the FF research community don’t want this for reasons ranging from concerns about privacy, through the fundamentals of how Facebook works, to just not wanting to mix work and leisure contacts. All reasonable and all things I agree with.

We could build our own. Technically feasible but probably not financially. Lets assume a core group of say 1000 people (probably overoptimistic) each prepared to pay maybe $25 a year subscription as well as do some maintenance or coding work. That’s still only $25k, not enough to pay a single person to keep a service running let alone actually build something from scratch. Might the FF team make some of the codebase Open Source? Obviously not what they’re taking to Facebook but maybe an earlier version? Would help but there would still need to be either a higher subscription or many more subscribers to keep it running I suspect. Chalk one up for the importance of open source services though.

Reaggregating around other services and distributing the functionality would be feasible perhaps. A combination of Google Reader, Twitter, with services like Tumblr, Posterous, and StoryTlr perhaps? The community would be likely to diffuse but such a distributed approach could be more stable and less susceptible to exactly this kind of buy out. Nonetheless these are all commercial services that can easily dissappear. Google Wave has been suggested as a solution but I think has fundamental differences in design that make it at best a partial replacement. And it would still require a lot of work.

There is a huge opportunity for existing players in the Research web space to make a play here. NPG, Research Gate, and Seed, as well as other publishers or research funders and infrastructure providers (you know who you are) could fill this gap if they had the resource to build something. Friendfeed is far from perfect, the barrier to entry is quite high for most people, the different effective usage patterns are unclear for new users. Building something that really works for researchers is a big opportunity but it would still need a business model.

What is clear is that there is a signficant community of researchers now looking for somewhere to go. People with a real critical eye for the best services and functionality and people who may even be prepared to pay something towards it. And who will actively contribute to help guide design decisions and make it work. Build it right and we may just come.

July 17, 2009December 30, 2009

Sci – Bar – Foo etc. Part II – SciFoo – Engaging with the world

Last Friday afternoon (was it really only a week ago?) about 200 people made their way to the Googleplex in Mountain View for the fourth SciFoo. There are many people who got their blog posts out well before me so I will focus on the sessions which don’t seem to have been heavily discussed and try to draw a few themes out.

For me, the over riding theme that came through was Engagement. Engaging people beyond the narrow confines of the professional research community in real research projects, making science more engaging for students, and engaging in a serious way with both the tools that are available to help us do these things, and increasingly with data generation and dissemination processes that are not under our control.

I was involved in running two sessions. The first with Peter Murray-Rust was on Open Data, focussed on getting feedback on the current form of the Panton Principles and has been blogged in detail by Peter. For me the main message from this was a lack of push-back. Many of the more technical people in the room were bemused that there was a problem. “Just put it on the web” was a common response. Other’s were concerned about where data stops and creative works begin but the main message for me was that “for published data just put it explicitly in the public domain” was seen as the right thing to do by the people in the room. Indeed most were suprised it was even worth discussing.

The second session I ran was on Google Wave in research and this will get a whole post of its own very soon so I won’t discuss it in detail here. Suffice to say that there was excitement, great ideas about what could be done, and concerns about the details of technical implementation. Which to me seems like an excellent mix to make progress with. Engagement for these two sessions was engagement with the data and engagement with the technology for generating, annotating, and sharing that data.

The other sessions I would like to draw a common theme through were more focussed on public engagement and education. The first session I attended on Saturday morning was run by Daniel Glaser called Doing Science in Non-Science Spaces. This was an interesting discussion on many levels but particularly for me because it challenged my ideas about multi-disciplinary working and deploying research projects into an educational setting. Daniel described disciplinary boundaries as fractal and described multidisciplinary projects as requiring as space where people can come together in a safe common space to share ideas, but also a requirement for people to then disperse again and re-intepret the outputs in the context of their own experience and discipline. In this view disciplinary boundaries are important in enabling effective summarisation and communication of outputs. I’ve been kicking myself ever since for not thinking to ask whether that means these boundaries are any less arbitrary than those of us who are interdisciplinary always feel.

Another challenge to my thinking from this session was the need to give up control over the shared collaboration space. In thinking about putting research projects into educational settings I’ve always looked at the process as trying to find a question within the research that can be understood and answered by students. The argument here was that to truly engage students it would be necessary to let them find and answer their own questions. I’m not sure how in practice to think about that in terms of drug discovery or how it maps on the success of projects like Galaxy Zoo but it bears some thinking about.

Also focussed on interactions beyond the professional research community was Ariel Waldman‘s session “Open collaboration between scientists, communities, and the unknown” which followed on from a session of the same title at SciBarCamp which I somehow missed. Here the focus was on problems with sharing research with the wider world, with similar problems to those of sharing between researchers identified,Â and potential solutions. Some great projects were discussed and showcased with contributions on a new collaboration site for research into Parkinsons, getting the public to search for surface exposed fossils in high resolution ground images (Louise Leakey, Turkana Basin Institute), and the experience of being the public conduit for a spacecraft from Veronica “Mars Phoenix” McGregor. Once again a major theme was “just get the data out there” so that people can do something with it if they want to. If it isn’t available no-one is going to do anything.

The final session was lead by Joan Peckham on Computational Thinking, the idea that the principles behind good computing design should be taught as a core skill on a par with reading and writing, and that this techniques are widely applicable beyond computing per se. For more on the background to this you can checkout John Udell interviewing Joan on his Interviews with Innovators podcast. The point for me was to try and understand how I can most effectively learn these principles and techniques as it is clear to me that I need a better understanding of good software and system design for the work I would like to do. What was interesting to me was whether my needs mapped onto what would be required for teaching children and whether willing and interested guinea pigs such as myself might be useful in helping to develop educational programmes. Here engagement means effective use of technology and design of systems that will make our work and collaborations efficient.

Scifoo is always challenging, requiring that you re-think and re-examine many of the assumptions that your everyday work is built on. Many smart people with very different perspectives and experiences make a great environment to stress test your ideas, sometimes to destruction. The challenge can be actually applying those insights in the real world with limited resources and time. But it provides some goals to work towards and much food for thought.

October 28, 2008December 30, 2009

Call for submissions for a project on The Use and Relevance of Web 2.0 Tools for Researchers

The Research Information Network has put out a cal for expressions of interest in running a research project on how Web 2.0 tools are changing scientific practice. The project will be funded up to Â£90,000. Expressions of interest are due on Monday 3 November (yes next week) and the projects are due to start in January. You can see the call in full here but in outline RIN seeking evidence whether web 2.0 tools are:

â€¢ making data easier to share, verify and re-use, or otherwise

facilitating more open scientific practices;

â€¢ changing discovery techniques or enhancing the accessibility of

research information;

â€¢ changing researchersâ€™ publication and dissemination behaviour,

(for example, due to the ease of publishing work-in-progress and

grey literature);

â€¢ changing practices around communicating research findings (for

example through opportunities for iterative processes of feedback,

pre-publishing, or post-publication peer review).

Now we as a community know that there are cases where all of these are occurring and have fairly extensively documented examples. The question is obviously one of the degree of penetration. Again we know this is small â€“ Iâ€™m not exactly sure how you would quantify it.

My challenge to you is whether it would be possible to use the tools and community we already have in place to carry out the project? In the past weâ€™ve talked a lot about aggregating project teams and distributed work but the problem has always been that people donâ€™t have the time to spare. We would need to get some help from social scientists on process and design of the investigation but with Â£90,000 there is easily enough money to pay people properly for their time. Indeed I know there are some people out there freelancing already who are in many ways already working on these issues anyway. So my question is: Are people interested in pursuing this? And if so, what do you think your hourly rate is?