Four ways to make a business out of research data

The recent announcement from Elsevier that it has implemented a data-sharing infrastructure across its materials science journals underlines the fact that publication of research data is a hot topic right now. To my mind, the details of this particular initiative indicate sensible incremental enhancements rather than anything particularly innovative or ground-breaking – however, as a positioning statement, it sends a powerful message.

Publisher reactions to funder mandates on open data so far have been less than enthusiastic, ranging from cynicism – as manifested at our New York dinner (‘like anyone’s going to do that’) – through bafflement (‘why would we want to do that?’) to anxious acceptance (‘how are we going to manage that?’). A generally negative stance that, personally, I find a little hard to understand.

Data is a real opportunity for publishers.

In this post I’d like to explain why – and sketch out four routes to making money out of it.

The data opportunity

In the first place, the basic business opportunity in data is not hard to spot. Funding revenue is being given to scientists to help them make their data publicly accessible. In the same way that publishers now charge Article Processing Charges (APCs) for authors to publish articles in (gold) Open Access journals, money that usually comes from funders, they could charge to handle the data too. So why haven’t publishers (so far) shown any interest in doing that?

It’s not as if this is a hotly contested space. So far as I can see there is only one publisher in it so far: Figshare, the startup incubated by Digital Science, which is owned by Holtzbrink (and, interestingly, was not part of the Springer/Macmillan deal).

Their vector is not to go to the author/researcher directly, but to publishers. Doing a deal with Figshare gives publishers a way of providing data publishing for their authors, but given that the company is owned by a rival publisher, it would seem strange if they were to let Figshare have the market entirely to itself – ceding the battle ‘without a shot fired’.

It is true that funders aren’t particularly keen for publishers to pick up this role (possibly because they think they have too much power already) but no-one on earth can stop them doing it.

Meanwhile the current provision in this area is hardly fit for purpose; small-scale, fragmented and poorly funded. There are probably in excess of 1,000 data repositories across the globe, many of which will inevitably disappear because they don’t have sustainable long-term funding. At a time when funders are progressively hardening up their requirements about data, and we are beginning to worry more and more as an industry about digital preservation, this state of affairs is surely not sustainable.

The novelty/refinement axis

But to get a sense of the wider, more strategic opportunity in data, take a look at the graph below (it always helps me to put things into a graph of some kind).

What I’ve done here is to list some of the different types of content that gets published in the scholarly world, and to arrange them along two axes. The horizontal axis shows the varying degrees of refinement or summarisation that these various content types have undergone, starting at the left with research data, the raw material, if you like, of research. The inferences and conclusions that are drawn from this data go into the next-door type, the research article. This content type is more refined in a sense that the raw data has been turned into knowledge. But that is just the beginning of the refinement process.

Review articles summarize the research in a given topic area, while monographs go further in summarising or ‘boiling down’ what is known in that topic area, and offer a particular view. Once domain knowledge has been reviewed and contested to the point where it is the subject of broad agreement – for its relevance, at least, if not its deeper meanings – then it becomes part of the subject matter for reference works; professional reference, in the first instance, covering defined areas, then more generalist reference works, such as Britannica or Wikipedia – encyclopaedias about everything.

At each step, information becomes gradually more and more distilled, more and more stable. And as it does so, it loses novelty: encyclopedic content for instance, at a late stage of refinement, contains no surprises at all for the specialist in a given field and – to the extent that this is humanly possible, supports no particular angle on a subject, but records the consensus view. For this reason, encyclopedic knowledge is rarely cited. It’s too vanilla.

Heading back up the vertical axis, however, we find more and more uncertainty, novelty and news-worthiness as we progress. And at the top the tree, novelty-wise, sits research data. The data set is not very refined, but it’s full of novelty; full of the potential for new insights to be made, new discoveries – including the very real process of new discoveries made from existing data.

Ergo it ought to follow that the publisher has an ability to add value here, as publishers do at each of the other steps in this progress towards ever greater refinement of information, by facilitating the process of refinement.

Four ways that publishers can make money out of data

Perhaps my argument itself is beginning to feel a little over-refined at this point; a bit abstracted from publishing reality. So here are the four ways I promised to lay out that publisher can make money out of data.

1. Share it

Publishers rightly want to draw a strong distinction between making data available online and publishing it. Eliding these two activities – something that often happens in the debate over making research data publically available – does a disservice to the value publishers add though the activities that sit under the verb ‘to publish’; curation, editing, peer review, and so forth. That having been said, there is no reason at all why publishers can’t provide an archival or hosting service to authors which, while less than publishing, still fulfils the terms of funder mandates. Route one is to enable authors in sharing their data in a way that is accessible, secure and safe for the long term.

2. Publish it

This route is where publishers performs the sort of activities with the shared data that they normally would around articles – most importantly, summarizing and peer review. Elsevier’s Data in Brief does something like this (although incongruously, to my way of thinking, the actual data sits somewhere else).

3. Visualise it

A further level of value-add is attained when publishers start to provide tools that allow researchers to interact with the data – for example, through graphical APIs (Figshare does this).

I am aware that we might need a better label for this category than data visualisation – it is about much more than giving us infographics to look at. Interactive graphs are a fairly widely used example of this type of interaction, and I have blogged in the past about developments in image processing, where the ‘research object’ is actually a piece of software, or an algorithm. In this category, novelty and reusability – the potential for fresh insights and discoveries – are at their highest, and publishers can play a role in facilitating researchers that goes well beyond the traditional one of disseminating their text narratives.

4. Mine it

Admittedly, while examples of routes one to three can be seen in the wild, Route Four is a little further in the future, as it has to do with the progress of semantic technologies. However if you are looking forward to where your business is going to be in five years time, this really ought to be on your radar. This one is a strategic play.

To explain it, I’ll redirect your attention to the graph further up the page. If you look at where novelty is lowest, and refinement and stability highest, encyclopaedic content, that is where we are beginning to see the application of semantic technologies. The obvious example here is Google’s knowledge graph (that box of facts that appears on the right hand of the SERP), which draws a lot of its data from Wikipedia. This is low-hanging fruit: here, it’s about very wide and not very deep.

As we move up the novelty axis where information is less refined and less stable, however, and more open to contention, it gets harder and harder with current tools to semantically mine data. However, this is definitely the future, and the technology is moving fast. And from a semantic mining point of view, there is more value up here, potentially.

Meanwhile Google is building a massive asset base of facts, which will increasingly be the IP engine it leverages to ensure its continued dominance. Similarly, Facebook is strikingly clear in its mission to build its own IP assets (somewhat cynically labelled “the Open Graph”). Data is the name of the game, and you have to be in it to win it.

Latest news and blog articles