Reflections on Science 2.0 from a distance – Part II

This is the second of two posts discussing the talk I gave at the Science 2.0 Symposium organized by Greg Wilson in Toronto in July. As I described in the last post Jon Udell pulled out the two key points from my talk and tweeted them. The first suggested some ideas about what the limiting unit of science, or rather science communication, might be. The second takes me in to rather more controversial areas:

@cameronneylon uses tags to classify records in a bio lab wiki. When emergent ontology doesn’t match the standard, it’s useful info. #osci20

It may surprise many to know that I am a great believer in ontologies and controlled vocabularies. This is because I am a great believer in effectively communicating science and without agreed language effective communication isn’t possible. Where I differ with many is that assumption that because an ontology exists it provides the best means of recording my research. This is borne out of my experiences trying to figure out how to apply existing data models and structured vocabularies to my own research work. Very often the fit isn’t very good, and more seriously, it is rarely clear why or how to go about adapting or choosing the right ontology or vocabulary.

What I was talking about in Toronto was the use of key-value pairs within the Chemtools LaBLog system and the way we use them in templates. To re-cap briefly the templates were initially developed so that users can avoid having to manually mark up posts, particularly ones with tables, for common procedures. The fact that we were using a one item-one post system meant that we knew that important inputs into that table would have their own post and that the entry in the table could link to that post. This in turn meant that we could provide the user of a template with a drop down menu populated with post titles. We filter those on the basis of tags, in the form of key-value pairs, so as to provide the right set of possible items to the user. This creates a remarkably flexible, user-driven, system that has a strong positive reinforcement cycle. To make the templates work well, and to make your life easier you need to have the metadata properly recorded for research objects, but in turn you can create templates for your objects that make sure that the metadata is recorded correctly.

The effectiveness of the templates clearly depends very strongly on the organization of the metadata. The more the pattern of organization maps on to the realities of how substances and data files are used in the local research process, and the more the templates reflect the details of that process the more effective they are. We went though a number of cycles of template and metadata re-organization. We would re-organize, thinking we had things settled and then we would come across another instance of a template breaking, or not working effectively. The motivation to re-organize was to make the templates work well, and save effort. The system aided us in this by allowing us to make organizational changes without breaking any of the previous schemes.

Through repeated cycles of modification and adaption we identified an organizational scheme that worked effectively. Essentially this is a scheme that categorizes objects based on what they can be used for. A sample may be in the material form of a solution, but it may also be some form of DNA.  Some procedures can usefully be applied to any solution, some are only usefully applied to DNA. If it is a form of DNA then we can ask whether it is a specific form, such as an oligonucleotide, that can be used in specific types of procedure, such as a PCR. So we ended up with a classification of DNA types based on what they might be used for (any DNA can be a PCR templates, only a relatively short single stranded DNA can be used as a – conventional – PCR primer). However in my work I also had to allow for the fact that something that was DNA might also be protein; I have done work on protein-DNA conjugates and I might want to run these on both a protein gel and a DNA gel.

We had, in fact, built our own, small scale laboratory ontology that maps onto what we actually do in our laboratory. There was little or no design that went into this, only thinking of how to make our templates work. What was interesting was the process of then mapping our terms and metadata onto designed vocabularies. The example I used in the talk was the Sequence Ontology terms relating to categories of DNA. We could map the SO term plasmid on to our key value pair DNA:plasmid, meaning a double stranded circular DNA capable in principle of transforming bacteria. SO:ss_oligo maps onto DNA:oligonucleotide (kind of, I’ve just noticed that synthetic oligo is another term in SO).

But we ran into problems with our type DNA:double_stranded_linear. In SO there is more than one term, including restriction fragments and PCR products. This distinction was not useful to us. In fact it would create a problem. For our purposes restriction fragments and PCR products were equivalent in terms of what we could do with them. The distinction the SO makes is in where they come from, not what they can do. Our schema is driven by what we can do with them. Where they came from and how they were generated is also implicit in our schema but it is separated from what an object can be used for.

There is another distinction here. The drop down menus in our templates do not have an “or” logic in the current implementation. This drives us to classify the possible use of objects in as general a way as possible. We might wish to distinguish between “flat ended” linear double stranded DNA (most PCR products) and “sticky ended” or overhanging linear ds DNA (many restriction fragments) but we are currently obliged to have at least one key value pair places these together as many standard procedures can be applied to both. In ontology construction there is a desire to describe as much detail as possible. Our framework drives us towards being as general as possible. Both approaches have their uses and neither is correct. They are built for different purposes.

The bottom line is that for a structured vocabulary to be useful and used it has to map well onto two things. The processes that the user is operating and the inputs and outputs of those processes. That is it must match the mental model of the user. Secondly it must map well onto the tools that the user has to work with. Most existing biological ontologies do not map well onto our LaBLog system, although we can usually map to them relatively easy for specific purposes in a post-hoc fashion. However I think our system is mapped quite well by some upper ontologies.

I’m currently very intrigued by an idea that I heard from Allyson Lister, which matches well onto some other work I’ve recently heard about that involves “just in time” and “per-use” data integration. It also maps onto the argument I made in my recent paper that we need to separate the issues of capturing research from those involved in describing and communicating research. The idea was that for any given document or piece of work, rather than trying to fit it into a detailed existing ontology you build a single-use local ontology based on what is happening in this specific case based on a more general ontology, perhaps OBO, perhaps something even more general. Then this local description can be mapped onto more widely used and more detailed ontologies for specific purposes.

At the end of the day the key is effective communication. We don’t all speak the same language and we’re not going to. But if we had the tools to help us capture our research in an appropriate local dialect in a way that makes it easy for us, and others, to translate into whatever lingua franca is best for a given purpose, then we will make progress.

The trouble with business models (Facebook buys Friendfeed)

…is that someone needs to make money out of them. It was inevitable at some point that Friendfeed would take a route that lead it towards mass adoption and away from the needs of the (rather small) community of researchers that have found a niche that works well for them. I had thought it more likely that Friendfeed would gradually move away from the aspects that researchers found attractive rather than being absorbed wholesale by a bigger player but then I don’t know much about how Silicon Valley really works. It appears that Friendfeed will continue in its current form as the two companies work out how they might integrate the functionality into Facebook but in the long term it seems unlikely that current service will survive. In a sense the sudden break may be a good thing because it forces some of the issues about providing this kind of research infrastructure out into the open in a way a gradual shift probably wouldn’t.

What is about Friendfeed that makes it particularly attractive to researchers? I think there are a couple of things, based more on hunches than hard data but in comparing with services like Twitter and Facebook there are a couple of things that standout.

  1. Conversations are about objects. At the core of the way Friendfeed works are digital objects, images, blog posts, quotes, thoughts, being pushed into a shared space. Most other services focus on the people and the connections between them. Friendfeed (at least the way I use it) is about the objects and the conversations around them.
  2. Conversation is threaded and aggregated. This is where Twitter loses out. It is almost impossible to track a specific conversation via Twitter unless you do so in real time. The threaded nature of FF makes it possible to track conversations days or months after they happen (as long as you can actually get into them)
  3. Excellent “person discovery” mechanisms. The core functionality of Friendfeed means that you discover people who “like” and comment on things that either you, or your friends like and comment on. Friendfeed remains one of the most successful services I know of at exploiting this “friend of a friend” effect in a useful way.
  4. The community. There is a specific community, with a strong information technology, information management, and bioinformatics/structural biology emphasis, that grew up and aggregated on Friendfeed. That community has immense value and it would be sad to lose it in any transition.

So what can be done? One option is to set back and wait to be absorbed into Facebook. This seems unlikely to be either feasible or popular. Many people in the FF research community don’t want this for reasons ranging from concerns about privacy, through the fundamentals of how Facebook works, to just not wanting to mix work and leisure contacts. All reasonable and all things I agree with.

We could build our own. Technically feasible but probably not financially. Lets assume a core group of say 1000 people (probably overoptimistic) each prepared to pay maybe $25 a year subscription as well as do some maintenance or coding work. That’s still only $25k, not enough to pay a single person to keep a service running let alone actually build something from scratch. Might the FF team make some of the codebase Open Source? Obviously not what they’re taking to Facebook but maybe an earlier version? Would help but there would still need to be either a higher subscription or many more subscribers to keep it running I suspect. Chalk one up for the importance of open source services though.

Reaggregating around other services and distributing the functionality would be feasible perhaps. A combination of Google Reader, Twitter, with services like Tumblr, Posterous, and StoryTlr perhaps? The community would be likely to diffuse but such a distributed approach could be more stable and less susceptible to exactly this kind of buy out. Nonetheless these are all commercial services that can easily dissappear. Google Wave has been suggested as a solution but I think has fundamental differences in design that make it at best a partial replacement. And it would still require a lot of work.

There is a huge opportunity for existing players in the Research web space to make a play here. NPG, Research Gate, and Seed, as well as other publishers or research funders and infrastructure providers (you know who you are) could fill this gap if they had the resource to build something. Friendfeed is far from perfect, the barrier to entry is quite high for most people, the different effective usage patterns are unclear for new users. Building something that really works for researchers is a big opportunity but it would still need a business model.

What is clear is that there is a signficant community of researchers now looking for somewhere to go. People with a real critical eye for the best services and functionality and people who may even be prepared to pay something towards it. And who will actively contribute to help guide design decisions and make it work. Build it right and we may just come.

Call for submissions for a project on The Use and Relevance of Web 2.0 Tools for Researchers

The Research Information Network has put out a cal for expressions of interest in running a research project on how Web 2.0 tools are changing scientific practice. The project will be funded up to £90,000. Expressions of interest are due on Monday 3 November (yes next week) and the projects are due to start in January. You can see the call in full here but in outline RIN seeking evidence whether web 2.0 tools are:

• making data easier to share, verify and re-use, or otherwise

facilitating more open scientific practices;

• changing discovery techniques or enhancing the accessibility of

research information;

• changing researchers’ publication and dissemination behaviour,

(for example, due to the ease of publishing work-in-progress and

grey literature);

• changing practices around communicating research findings (for

example through opportunities for iterative processes of feedback,

pre-publishing, or post-publication peer review).

Now we as a community know that there are cases where all of these are occurring and have fairly extensively documented examples. The question is obviously one of the degree of penetration. Again we know this is small – I’m not exactly sure how you would quantify it.

My challenge to you is whether it would be possible to use the tools and community we already have in place to carry out the project? In the past we’ve talked a lot about aggregating project teams and distributed work but the problem has always been that people don’t have the time to spare. We would need to get some help from social scientists on process and design of the investigation but with £90,000 there is easily enough money to pay people properly for their time. Indeed I know there are some people out there freelancing already who are in many ways already working on these issues anyway. So my question is: Are people interested in pursuing this? And if so, what do you think your hourly rate is?