More on the science exchance – or building and capitalising a data commons

Image from Wikipedia via Zemanta

Following on from the discussion a few weeks back kicked off by Shirley at One Big Lab and continued here I’ve been thinking about how to actually turn what was a throwaway comment into reality:

What is being generated here is new science, and science isnâ€™t paid for per se. The resources that generate science are supported by governments, charities, and industry but the actual production of science is not supported. The truly radical approach to this would be to turn the system on its head. Donâ€™t fund the universities to do science, fund the journals to buy science; then the system would reward increased efficiency.

There is a problem at the core of this. For someone to pay for access to the results, there has to be a monetary benefit to them. This may be through increased efficiency of their research funding but that’s a rather vague benefit. For a serious charitable or commercial funder there has to be the potential to either make money, or at least see that the enterprise could become self sufficient. But surely this means monetizing the data somehow? Which would require restrictive licences, which is not at the end what we’re about.

The other story of the week has been the, in the end very useful, kerfuffle caused by ChemSpider moving to a CC-BY-SA licence, and the confusion that has been revealed regarding data, licencing, and the public domain. John Wilbanks, whose comments on the ChemSpider licence, sparked the discussion has written two posts [1, 2] which I found illuminating and have made things much clearer for me. His point is that data naturally belongs in the public domain and that the public domain and the freedom of the data itself needs to be protected from erosion, both legal, and conceptual that could be caused by our obsession with licences. What does this mean for making an effective data commons, and the Science Exchange that could arise from it, financially viable?

I start with a couple of premises:

Data belongs in the public domain
An effective and useful data commons requires well structured data
Preparing high quality data costs money, and the tools do not really exist to support this in general
Academic career progression depends at core on one thing. How much money you bring in.

Getting the data

So I started with the notion of paying researchers to make data available, originally phrased as ‘pay the journals to buy papers’. What I really meant was paying people to put research results somewhere useful. So let us imagine we can pay people to deposit data (we’ll figure out how later). We don’t want to be swamped with rubbish so the data has to be well structured, tagged up and machine readable. If we’re paying for it, we set the standards. We also want to encourage re-use of data, perhaps by paying a premium for the deposition of data that re-uses other data. And in turn, perhaps pay a premium to those whose data is re-used. The key point is that the presence of capital and cash flow will drive patterns of behaviour. By being a market maker you create the market, and to some extent control the behaviour of the market, whether that be in commodities like coffee, or scientific data.

The data has to be dedicated to the public domain. The value of having a central organisation, lets call it the Foundation, directing this is that this dedication can be centrally managed and effectively handled. In fact, such an organisation would probably outsource the work to an expert organisation such as Science Commons or Open Knowledge Foundation. The system could allow embargoes or delayed release, but this would be a timed release, not a ‘release on publication’ clause. The release cannot be indefinitely delayed. The fundamental principle must remain that researchers are being paid to support committing their data to the public domain.

Providing the systems

How do we get the big boys interested in this? What will drive Google, Microsoft, Amazon, IBM to put real money into supporting such a foundation, activity, or company. I think the answer to this lies in the commercial opportunities that are created by having an insider knowledge of the system that is indexing, collating, and crawling the data. The data is in the public domain, and probably not in a repository as we understand it, but on the cloud, or federated amongst multiple repositories. But insider knowledge of the form, structure, location, and tagging of the data is a commercially priviledged position.

We are only accepting well structured data, and this means providing systems that make structuring research data easy. I’m not talking here about a few webforms, or a slightly improved LIMS system. I’m talking about fully functional laboratory notebook systems that talk to instruments, that already have the API to most equipment built in, that will provide helpful hints on how to describe your data using the appropriate controlled vocabularies and above all something that looks and behaves like a combination of Word and Google home page. Something that is so easy to use that people actually do use it; that they use it because it is useful to them and the like using it. Something that is going to cost on order of $100M to build.

But once it is built, you’ve got the best research recording tool around. Something you can sell to commercial research organisation to run in their own systems. Running these systems in house will enable them to control their competitive research but will also make sharing of pre-competitive research more effective for them. ELN implementations in big pharma are budgeted at around the $10M mark. There is a market to recoup investments at the $100M level. Particularly if this can be used as the starting point for the next generation of generic enterprise data management tools.

Making it pay

The Foundation at the centre of this has priviledged access to the data through knowing and controlling the system that runs it. It will also have the most timely access. In addition through providing the tools (free at point of use for the basic service) that will enable data deposition it is providing a service to the community. That service may come with some conditions. First right of refusal to commercialise opportunities that the foundation identifies; the right to sell access to the products of processing the data; the right to process embargoed data to identify commercial opportunities.

Research funders may also be interested in using the Foundation to drive the availability of data. It is possible to imagine schemes whereby part of a research grant is funnelled through the organisation to pay for the deposition of data. Perhaps a 20% premium over what would be available through a conventional grant could be paid to the researcher on deposition of the data with perhaps 5% being paid to the Foundation. Funders are putting hundreds of millions into data centres that no-one is too sure what to do with. Maybe that money could be used more effectively to drive data deposition quality. Some funders may also see this as a good model for direct funding. Putting money in to drive the generation of specific data set. Channel funding through the foundation to pay groups to deposit the results rather than pay them to do the research. For small foundations or charitable concerns this may be a much more effective means of driving the outcomes they want.

Then there is the service model. This commons is not just for academic research. Commercial bodies will be interested as well. And they will want to exploit the insider knowledge the Foundation has of these systems. Premium services to search, not just for the right data, or the right connection, but the right person for an internal position, the right group to collaborate with on a specific problem. The Foundation will be able to provide premium services on top of the data and it will be able to charge for them. Scientific journals could buy access for the lodgement of supplementary data. Why hold it on their own servers, costing money, and not necessarily being much use when it could be provided in a more useful form for re-use and further exploitation? Or perhaps turn it on its head, could the foundation pay journals for the right to host supplementary data (including all the raw data associated with a paper). Could this be another route to funding Open Access journals?

Targeted advertising should not be rejected either. It works rather well for Google as a money raising venture so there is no reason why it shouldn’t here. Care is obviously required not to allow advertising to drive the process but as a complementary revenue stream it could be useful. Overall there are funding streams here that could be exploited to generate serious quantities of cash flow, running towards the $M100s per year, in markets that are much, much bigger, so incremental increases in efficiency are worth these sums.

Does it scale?

This can work on a small scale with small amounts of money. The amounts of data that are currently available in a useful form are vanishingly small restricting the potential for a flood of submissions. Structuring data is hard work and costs money so if the rewards are small and the tools are not readily available a flood is unlikely. As it grows there is a problem though. This is essentially a problem of liquidity, a large body of capital is required to support both the flow of information in, and the exploitation of that information. It is unlikely that the foundation will be able to fund the full costs of the research that generates data. It will be able to pay a market driven rate to support the care, structuring, and processing of that data. Essentially the Foundation is buying futures in the forward value of the data. The complication is that it is also the market maker, and driver for those futures. In some markets this is called insider trading. Here I don’t think anyone is necessarily losing out.

On a large scale as the dominant (or one of a set of dominant) index I believe the model is sustainable as long as there is sufficient liquidity. The problem lies in moving from the small scale to the large scale, while undertaking the tools development project. Defining the financial and commercial viability of that trajectory is beyond my expertise, but I’d be interested to hear what people think. The foundation can control its capital reserves by not accepting data, but this is likely to cause the equivalent of a run on a bank. The details of the business model need be looked at by people with proper business heads. But it seems to me that this is plausible, if properly planned and executed.

This will take large sums of money. And it may well not make money. But it will enable more effective research. And that is worth a lot of money, even if indirectly.