Skip to content

Science in the Open

The online home of Cameron Neylon

  • About
  • Biographies

Tag: metadata

Posted on October 19, 2015October 19, 2015

PolEcon of OA Publishing II: What’s the technical problem with reforming scholarly publishing?

English: Reproduction of a Charles Mills paint...
Reproduction of a Charles Mills painting (Photo credit: Wikipedia)

In the first post in this series I identified a series of challenges in scholarly publishing while stepping through some of the processes that publishers undertake in the management of articles. A particular theme was the challenge of managing a heterogenous stream of articles and their associated heterogeneous formats and problems, in particular at a large scale. An immediate reaction many people have is that there must be technical solutions to many of these problems. In this post I will briefly outline some of the charateristics of possible solutions and why they are difficult to implement. Specifically we will focus on how the pipeline of the scholarly publishing process that has evolved over time makes large scale implementation of new systems difficult.

The shape of solutions

Many of the problems raised in the previous post, as well as many broader technical challenges in scholarly communications can be characterised as issues of either standardisation (too many different formats) or many-to-many relationships (many different source image softwares and a range of different publisher systems, but equally many institutions trying to pay APCs to many publishers). Broadly speaking, the solutions to this class of problem involve adopting standards.

Without focussing too much on technical details, because there are legitimate differences of opinion on how best to implement this, solutions would likely involve re-thinking the form of the documents flowing through the system. In an ideal world authors would be working in a tool that interfaces directly and cleanly with publisher systems. At the point of submission they would be able to see what the final published article would like, not because the process of conversion to its final format would be automated but it would be unnecessary. The document would be stored throughout the whole process in a consistent form from which it could always be rendered to web or PDF or whatever is required on the fly. Metadata would be maintained throughout in a single standard form, ideally one shared across publisher systems.

The details of these solutions don’t really matter for our purposes here. What matters is that they are built on a coherent standard representation of documents, objects and bibliographic metadata that flow from the point of submission (or ideally even before) through to the point of publication and syndication (or ideally long after that).

The challenges of the solutions

Currently the scholarly publishing pipeline is really two pipelines. The first is the pipeline where format conversions of the document are carried out and the second a pipeline where data about the article is collected and managed. There is usually redundancy, and often inconsistency between these two pipelines and they are frequently managed by Heath Robinson-esque processes which have been developed to patch systems from different suppliers together. A lot of the inefficiencies in the publication process are the result of this jury-rigged assembly but at the same time its structure is one of the main things preventing change and innovation.

If the architectural solution to these problems is one of adopting a coherent set of standards throughout the submission and publication process then the technical issue is one of systems re-design to adopt these standards. From the outside this looks like a substantial job, but not an impossible one. From the inside of a publisher it is not merely Sysyphean, but more like pushing an ever growing boulder up a slope which someone is using as a (tilted) bowling alley.

There is a critical truth which few outside the publishing industry have grasped. Most publishers do not control their own submission and publication platforms. Some small publishers successfully build their entire systems (PeerJ and PenSoft) are two good examples. But these systems are built for (relatively) small scale and struggle as they grow beyond a thousand papers a year. Some medium sized publishers can successfully maintain their own web server platforms (PLOS is one example of this) but very few large publishers retain technical control over their whole pipelines.

Most medium to large publishers outsource both their submission systems (to organisations like Aries and ScholarOne) and their web server platforms (to companies like Atypon and HighWire). There were good tactical reasons for this in the past, advantages of scale and expertise, but strategically it is a disaster. It leaves essentially the entire customer facing (whether the author or the reader) part of a publishing business in the hands of third parties. And the scale gained by that centralisation, as well as the lack of flexibility that scale tends to create, makes the kind of re-tooling envisaged above next to impossible.

Indeed the whole business structure is tipped against change. These third party players hold quasi-monopoly positions. Publishers have nowhere else to go. The service providers, with publishers as their core customers, need to cover the cost of development and do so by charging their customers. This is a perfectly reasonable thing to do, but with the choice of providers limited (and a shift from to the other costly and painful), and with no plausible DIY option for most publishers, the reality is that it is in the interests of the providers to hold of implementing new functionality until they can charge the maximum for it. Publishers are essentially held to ransom for relatively small changes, radical re-organisation of the underlying platform is simply impossible.

Means of escape

So what are the means of escaping this bind? Particularly given that its a natural consequence of the market structure and scale? No-one designed the publishing industry to do this. The biggest publishers have the scale to go it alone. Elsevier took this route when they purchased the source code to Editorial Manager, the submission platform that is run by Aries as a service for many other publishers. Alongside their web platforms (and probably the best back end data system in the business) this gives Elsevier end-to-end control over their technical systems. It didn’t stop Elsevier spending a rumoured $25M on a new submission system that has never really surfaced, illustrating just how hard these systems are to build.

But Elsevier has a scale and technical expertise (not to mention cash) that most other publishers can only envy. Running such systems in house is a massive undertaking and doing further large scale development even bigger again. Most publisher can not do this alone.

Where there is a shared need for a scalable platform, Community Open Source projects can provide a solution. Publishing has not traditionally embraced Open Source or shared tools, preferring to re-invent the wheel internally. The Public Knowledge Project‘s Open Journal Systems is the most successful play in this space, running more journals (albeit small ones) than any other software platform. There are technical criticisms to be made of OJS and it is in need of an overhaul, but a big problem is that, while it is a highly successful community, it has never become a true Open Source Community Project. The Open Source code is taken and used in many places, but it is locally modified and there had not been the development of a culture of contributing code back to the core.

The same could be said of many other Open Source projects in the publishing space. Frequently the code is available and re-usable, but there is little or no effort put into creating a viable community of contribution. It is Open Source Code, not an Open Source Project. Publishers are often so focussed on maintaining their edge against competition that the hard work of creating a viable community around a shared resource gets lost or forgotten.

It is also possible that the scale of publishing is insufficient to support true Open Source Community projects. The big successful Open Source infrastructure projects are critical across the web, toolsets that are so widely used that its a no-brainer for big corporations more used to patent and copyright fights to invest in them. It could be that there are simply too few developers and too little resource in publishing to make pooling them viable. My view is that the challenge is more political than resourcing, more in fact a result of the fact that there are maybe one or two CTOs or CEOs in the entire publishing industry with deep experience of the Open Source world, but there are differences from those other web-scale Open Source projects and it is important to bear that in mind.

Barriers to change

When I was at PLOS I got about one complaint a month about “why can’t PLOS just do X” where X was a different image format, a different process for review, a step outside the standard pipeline. The truth of the matter is that it can almost always be done. If  a person can be found to hand-hold that specific manuscript through the entire process. You can do it once, and then someone else wants it, and then you are in a downward spiral of monkey patching to try and keep up. It simply doesn’t scale.

A small publisher, one that has completely control over their pipeline and is operating at hundreds or low thousands of articles a year, can do this. That’s why you will continue to see most of the technical innovation within the articles themselves come from small players. The very biggest players can play both ends, they have control over their systems, the technical expertise to make something happen, and the resources to throw money at solving a problem manually to boot. But then those solutions don’t spread, because no-one else can follow them. Elsevier is alone in this category, with Springer-Nature possibly following if they can successfully integrate the best of both companies together.

But the majority of content passes through the hands of medium sized players, and players with no real interest in technical developments. These publishers are blocked. Without a significant structural change in the industry it is unlikely we will see significant change. To my mind that structural change can only happen if a platform is developed that provides scale by supporting across multiple publishers. Again, to my mind, that can only be provided by an Open Source platform. One with a real community program behind it. No medium sized publisher wants to shift to a new proprietary platform which will just lock them in again. But publishers across the board have collectively demonstrated a lack of willingness as well as a lack of understanding of what an Open Source Project really is.

Change might come from without, from a new player providing a fresh look at how to manage the publishing pipeline. It might come from projects within research institutions or a collaboration of scholarly societies. It could even come from within. Publishers can collaborate when they realise it is in their collective interests. But until we see new platforms that provide flexible and standards based mechanisms for managing documents and their associated metadata throughout their life cycle, that operate successfull at a scale of tens to hundreds of thousands of articles a year, and that above all are built and governed by systems that publishers trust we will at most incremental change.

Posted on June 10, 2008December 30, 2009

The trouble with institutional repositories

A tag cloud with terms related to Web 2.I spent today at an interesting meeting at Talis headquarters where there was a wide range of talks. Most of the talks were liveblogged by Andy Powell and also by Owen Stephens (who has written a much more comprehensive summary of Andy’s talk) and there will no doubt be some slides and video available on the web in future. The programme is also available. Here I want to focus on Andy Powell’s talk (slides), partly because he obviously didn’t liveblog it but primarily because it crystallised for me many aspects of the way we think about Institutional Repositories. For those not in the know, these are warehouses that are becoming steadily more popular, run generally by unversities to house their research outputs, in most cases peer reviewed papers. Self archiving of some version of published papers is the so called ‘Green Route’ to open access.

The problem with institutional repositories in their current form is that academics don’t use them. Even when they are being compelled there is massive resistance from academics. There are a variety of reasons for this: academics don’t like being told how to do things; they particularly don’t like being told what to do by their institution; the user interfaces are usually painful to navigate. Nonetheless they are a valuable part of the route towards making more research results available. I use plenty of things with ropey interfaces because I see future potential in them. Yet I don’t use either of the repositories in the places where I work – in fact they make my blood boil when I am forced to. Why?

So Andy was talking about the way repositories work and the reasons why people don’t use them. He had already talked about the language problem. We always talk about ‘putting things in the repository’ rather than ‘making them available on the web’. He had mentioned already that the institutional nature of repositories does not map well onto the social networks of the academic users which probably bear little relationship with institutions and are much more closely aligned to discipline and possibly geographic boundaries (although they can easily be global).

But for me the key moment was when Andy asked ‘How many of you have used SlideShare’. Half the people in the room put their hands up. Most of the speakers during the day pointed to copies of their slides on SlideShare. My response was to mutter under my breath ‘And how many of them have put presentations in the institutional repository?’ The answer to this; probably none. SlideShare is a much better ‘repository’ for slide presentations than IRs. There are more there, people may find mine, it is (probably) Google indexed. But more importantly I can put slides up with one click, it already knows who I am, I don’t need to put in reams of metadata, just a few tags. And on top of this it provides added functionality including embedding in other web documents as well as all the social functions that are a natural part of a ‘Web2.0’ site.

SlideShare is a very good model of what a Repository can be. It has issues. It is a third party product, it may not have long term stability, it may not be as secure as some people would like. But it provides much more of the functionality that I want from a service for making my presentations available on the web. It does not serve the purpose of an archive – and maybe an institutional repository is better in that role. But for the author, the reason for making things available is so that people use them. If I make a video that relates to my research it will go on YouTube, Bioscreencast, or JoVE, not in the institutional repository, I put research related photos on Flickr, not in the institutional repository, and critically, I leave my research papers on the websites of the journal that published them, and cannot be bothered with the work required to put them in the institutional repository.

Andy was arguing for global discipline specific repositories. I would suggest that the lesson of the Web2.0 sites is that we should have data type specific repositories. FlickR is for pictures, SlideShare for presentations. In each case the specialisation enables a sort of implicit metadata and for the site to concentrate on providing functionality that adds value to that particular data type. Science repositories could win by doing the same. PDB, GenBank, SwissProt deal with specific types of data. Some might argue that GenBank is breaking under the strain of the different types and quantities of data generated by the new high throughput sequencing tools. Perhaps a new repository is required that is specially designed for this data.

So what is the role for the institutional repository? The preservation of data is one aspect. Pulling down copies of everything to provide an extra backup and retain an institutional record. If not copying then indexing and aggregating so as to provide a clear guide to the institutions outputs. This needn’t be handled in house of course and can be outsourced. As Paul Miller suggested over lunch, the role of the institution need not be to keep a record of everything, but to make sure that such a record is kept. Curation may be another, although that may be too big a job to be tackled at institutional level. When is a decision made that something isn’t worth keeping anymore? What level of metadata or detail is worth preserving?

But the key thing is that all of this should be done automatically and must not require intervention by the author. Nothing drives me up the wall more than having to put the same set of data into two subtly different systems more than once. And as far as I can see there is no need to do so. Aggregate my content automatically, wrap it up and put it in the repository, but I don’t want to have to deal with it. Even in the case of peer reviewed papers it ought to be feasible to pull down the vast majority of the metadata required. Indeed, even for toll access publishers, everything except the appropriate version of the paper. Send me a polite automated email and ask me to attach that and reply. Job done.

For this to really work we need to take an extra step in the tools available. We need to move beyond files that are simply ‘born digital’ because these files are in many ways still born. This current blog post, written in Word on the train is a good example. The laptop doesn’t really know who I am, it probably doesn’t know where I am, and it has not context for the particular word document I’m working on. When I plug this into the WordPress interface at OpenWetWare all of this changes. The system knows who I am (and could do that through OpenID). It knows what I am doing (writing a Blog post) and the Zemanta Firefox plug in does much better than that, suggesting tags, links, pictures and keywords.

Plugins and online authoring tools really have the potential to automatically generate those last pieces of metadata that aren’t already there. When the semantics comes baked in then the semantic web will fly and the metadata that everyone knows they want, but can’t be bothered putting in, will be available and re-useable, along with the content. When documents are not only born digital but born on and for the web then the repositories will have probably still need to trawl and aggregate. But they won’t have to worry me about it. And then I will be a happy depositor.


Related articles
  • SWAP and ORE [via Zemanta]
  • Slideshare Ramping Up – Leading Online Presentations App? [via Zemanta]
  • Powerset vs. Cognition: A Semantic Search Shoot-out [via Zemanta]


Zemanta Pixie

Posted on April 8, 2008December 30, 2009

Semantics in the real world? Part II – Probabilistic reasoning on contingent and dynamic vocabularies

Rendering of human brain.And other big words I learnt from mathematicians…

The observant amongst you will have realised that the title of my previous post pushing a boat out into the area of semantics and RDF implied there was more to come. Those of you who followed the reaction [comments in original post, 1, 2, 3] will also be aware that there are much smarter and more knowledgeable people out there thinking about these problems. Nonetheless, in the spirit of thinking aloud I want to explore these ideas a little further because they underpin the way I think about the LaBLog and its organization. As with the last post this comes with the health warning that I don’t really know what I’m talking about. Continue reading “Semantics in the real world? Part II – Probabilistic reasoning on contingent and dynamic vocabularies”

Posted on March 30, 2008December 30, 2009

Data models for capturing and describing experiments – the discussion continues

Frank Gibson has continued the discussion that kicked off here and has continued here [1, 2, 3, 4] and in other places [1, 2] along the way. Frank’s exposition on using FuGE as a data model is very clear in what it says and does not say and some of his questions have revealed sloppiness in the way I originally described what I was trying to do. Here I will respond to his responses and try to clarify what it is that I want, and what I want it to achieve. I still feel that we are trying to describe and achieve different things, but that this discussion is a great way of getting to the bottom of this and achieving some clarity in our description and language. Continue reading “Data models for capturing and describing experiments – the discussion continues”

Posted on March 26, 2008December 30, 2009

Responding to PM-R on the structured experiment

This started out as a comment on Peter Murray-Rust’s response to my post and grew to the point where it seemed to warrant its own post. We need a better medium (or perhaps a semantic markup framework for Blogs?) in which to capture discussions like this, but that’s a problem for another day…

Continue reading “Responding to PM-R on the structured experiment”

Posted on March 26, 2008December 30, 2009

The structured experiment

More on the discussion of structured vs unstructured experiment descriptions. Frank has put up a description of the Minimal Information about a Neuroscience Investigation standard at Nature Precedings which comes out of the CARMEN project. Neil Saunder’s has also made some comments on the resistance amongst the lab monkeys to think about structure. Lots of good points here. I wanted to pick out a couple in particular;

From Neil;

My take on the problem is that biologists spend a lot of time generating, analysing and presenting data, but they don’t spend much time thinking about the nature of their data. When people bring me data for analysis I ask questions such as: what kind of data is this? ASCII text? Binary images? Is it delimited? Can we use primary keys? Not surprisingly this is usually met with blank stares, followed by “well…I ran a gel…”.

Part of this is a language issue. Computer scientists and biologists actually mean something quite different when they refer to ‘data’. For a comp sci person data implies structure. For a biologist data is something that requires structure to be made comprehensible. So don’t ask ‘what kind of data is this?’, ask ‘what kind of file are you generating?’. Most people don’t even know what a primary key is, including me as demonstrated by my misuse of the term when talking about CAS numbers which lead to significant confusion.

I do believe that any experiment [CN – my emphasis] can be described in a structured fashion, if researchers can be convinced to think generically about their work, rather than about the specifics of their own experiments. All experiments share common features such as: (1) a date/time when they were performed; (2) an aim (”generate PCR product”, “run crystal screen for protein X”); (3) the use of protocols and instruments; (4) a result (correct size band on a gel, crystals in well plate A2). The only free-form part is the interpretation.

Here I disagree, but only at the level of detail. The results of any experiment can probably be structured after the event. But not all experiments can be clearly structured either in advance, or as they happen. Many can, and here Neil’s point is a good one, by making some slight changes in the way people think about their experiment much more structure can be captured. I have said before that the process of using our ‘unstructured’ lab book system has made me think and plan my experiments more carefully. Nonetheless I still frequently go off piste, things happen. What started as an SDS-PAGE gel turns into something else (say a quick column on the FPLC).

Without wishing to pick a fight, most people with a computer science background who lean towards the heavily semantic end of the spectrum are dealing with the wet lab scientists after the data has been taken and partially processed. I don’t disagree that it would help the comp sci people if the experimenters worked harder at structuring the data as they generate it, and I do think in general this is a good thing. The problem is that it doesn’t map well onto how the work is actually carried out. The solution I think is a mixture of the free form approach combined with useful tools and widgets that do two things: firstly they make the process of capturing the process easier; secondly the encourage the collection and structuring of data as it comes off. This is what the templates in our system do, and there is no reason in principle why they couldn’t be driven by agreed data models.

Actually the Frey group (who have done the development of the LaBLog system) already have a highly semantic lab book system developed during the MyTea project. One of our future aims is to take the best of both forward into a ‘semi-semantic’ or ‘freely semantic’ system. One of the main problems with implementing the MyTea notebook is that it requires data models. It was developed for synthetic chemistry but it would make sense, in expanding it into the biochemistry/molecular biology area to utilise existing data models with FuGE the obvious main source.

One more point: we need to teach students that every activity leading to a result is an experiment. From my time as a Ph.D. student in the wet lab, I remember feeling as though my day-to-day activities: PCR reactions, purifications, cloning weren’t really experiments […] Experiments were clever, one-shot procedures performed by brilliant postdocs to answer big questions […] Break your activities into steps and ways to describe them as structured data should suggest themselves.

This is very true, and harks back to my comment about language. A lot of the issues here are actually because we mean very different things by ‘experiment’. We probably should use better words, although I think procedure and protocol are similarly loaded with conflicting meanings. Control of language is important and agreement on meaning is, after all, at the root of semantics (or is that semiotics, I’m never sure…)

Posted on March 25, 2008December 30, 2009

The heavyweights roll in…distinguishing recording the experiment from reporting it

Frank Gibson of peanutbutter has left a long comment on my post about data models for lab notebooks which I wanted to respond to in detail. We have also had some email exchanges. This is essentially an incarnation of the heavyweight vs lightweight debate when it comes to tools and systems for description of experiments. I think this is a very important issue and that it is also subject to some misunderstandings about what we and others are trying to do. In particular I think we need to draw a distinction between recording what we are doing in the lab and reporting what we have done after the fact. Continue reading “The heavyweights roll in…distinguishing recording the experiment from reporting it”

Posted on March 23, 2008December 30, 2009

Proposing a data model for Open Notebooks

‘No data model survives contact with reality’ – Me, Cosener’s House Workshop 29 February 2008

This flippant comment was in response to (I think) Paolo Missier asking me ‘what the data model is’ for our experiments. We were talking about how we might automate various parts of the blog system but the point I was making was that we can’t have a data model with any degree of specificity because we very quickly find the situation where they don’t fit. However, having spent some time thinking about machine readability and the possibility of converting a set of LaBLog posts to RDF, as well as the issues raised by the problems we have with tables, I think we do need some sort of data model. These are my initial thoughts on what that might look like. Continue reading “Proposing a data model for Open Notebooks”

Posted on November 5, 2007December 30, 2009

Discussion with OpenWetWare people

This morning I got to sit down with Bill Flanagan, Barry Canton, Austin Che, and Jason Kelly and throw some ideas around about electronic notebooks. This is an approximate summary of some of the points that came out of this. This may be a bit of brain dump so I might re-edit later.

  1. Neither Wikis nor Blogs provide all the functionality required. Wikis are good at providing a framework that within which to organise information where as blogs are good at logging information and providing it in a journal format. Barry showed me a hack that he uses in his Wiki based notebook that essentially provides a means of organising his lab book into experiments and projects but also provides a date style view. In the Southampton system we would achieve this through creating categories for different experiments, possibly independent blogs for different projects.
  2. Feature requests at Southampton has been driven largely by me which means that system is being driven by the needs of the PI. At OpenWetWare the development has been driven by grad students which means it has focussed on their issues. The question was raised of where the best place to ‘promote’ these systems was. Is it the PI’s who, at least at the moment, will get the greatest tangible benefits from the system. Or is it better to persuade grad students to take this up as they are the end users. Both have very different needs.
  3. Development based on the needs of a single person is unlikely to take us forward as the needs of a specific person are probably not general enough to be useful. Development should focus on enabling the interactions between people, therefore the minimum size ‘user unit’ is two (PI plus researcher, or group of researchers).
  4. The biggest wins for these systems are where collaboration is required and is enabled by a shared space to work in. This is shown by the IGEM lab books and by uptake by my collaborators in the UK. This will be the best place to take development forward.

I need to add links to this post but will do so later.

Posted on October 29, 2007December 30, 2009

The Soton Lab Blog Book US Tour

Given that most people reading this probably also read the UsefulChem Blog I would guess that they have already figured out I am visiting the States. However as I am now here and due to jet lag have a few hours to kill before breakfast I thougt I might detail the intinerary for anyone interested. Continue reading “The Soton Lab Blog Book US Tour”

Posts navigation

Page 1 Page 2 Next page

Recent posts

  • Testing…testing…
  • Epistemic diversity and knowledge production
  • Thinking out loud: Tacit knowledge and deficit models
  • Against the 2.5% Commitment
  • 2560 x 1440 (except while traveling)

I am also found at...

  • Github
  • Mastodon
  • ORCID

License

CC0 To the extent possible under law, Cameron Neylon has waived all copyright and related or neighboring rights to Science in the Open. Published from the United Kingdom.

I am also found at...

  • Github
  • Mastodon
  • ORCID

Tags

  • academic credit
  • Academic publishing
  • ahm2007
  • business model
  • citation
  • communication
  • Creative Commons
  • data commons
  • data feeds
  • data formats
  • data model
  • e-notebook
  • ELN
  • ethics
  • Fb4Sci
  • Friendfeed
  • funding
  • identity
  • LaBLog
  • Meeting
  • metadata
  • open-science-charter
  • open-source
  • open access
  • open data
  • open notebook science
  • open research
  • open science
  • Open Science PSB09
  • osci-protocol
  • peer review
  • plos
  • policy
  • publishing
  • research network proposal for open science
  • scholarly-publishing
  • Scholarly communication
  • scifoo
  • semantics
  • Semantic Web
  • tagging
  • tools
  • twitter
  • web2
  • workshop

Recent posts

  • Testing…testing…
  • Epistemic diversity and knowledge production
  • Thinking out loud: Tacit knowledge and deficit models
  • Against the 2.5% Commitment
  • 2560 x 1440 (except while traveling)

Recent Posts

  • Testing…testing…
  • Epistemic diversity and knowledge production
  • Thinking out loud: Tacit knowledge and deficit models
  • Against the 2.5% Commitment
  • 2560 x 1440 (except while traveling)
Proudly powered by WordPress