Taxonomizing the future is not only a thankless but also at root, impossible task – which is perhaps why we have so many treacherously imprecise terms to describe what we do in digital publishing. These terms can muddy the waters and result in messy debates, because it’s not always clear exactly what somebody means when they use them. One such term is data mining.
In the cause of clarity, I’d like to tease out four themes that represent some of the different activities we often discuss under this rubric. And taking an Occam’s Razor approach, driven by the commercial imperatives publishers face, I hope to outline what use each can be to those who are trying to make digital pay, or at the very least what sort of threat each might be represent to an existing business.
Four data mining themes
At the risk of creating further taxonomic confusion, I’m going to assign temporary and provisional labels to each of the areas I want to talk about. No need to panic – I will attempt to explain what each of these means as we go along.
- Content enrichment
- The RDF/triples view of the world
- The robots’ view of the world
1. Content enrichment
I’m talking here about the idea that you can automatically analyse content and annotate it with useful extra information.
Take an example from ancient biblical studies: ‘useful extra information’ in this case could be as simple as identifying, for instance, all the place names in a series of publications to do with the field. Being able to identify and tag place names and distinguish them from the names of people (not always easy to do) would give you the ability to, for instance, place historical statements on a map, and work out times and places in a way that gave you more value than the sort of searching you might otherwise be able to do.
This is an example chosen more or less random, but the principle holds true for many different fields; rather than identifying people and places we might be tagging chemicals, say, or genetic structures. It’s all about identifying entities in text. This also gets called entity extraction, or entity recognition, but whatever we call it, this species of activity points very strongly towards discoverability.
It can strip away the ‘noise’ that dogs so much search activity; where searching for London, the city, will also bring up references to Jack London, the 19th Century author. Greater selectivity and contextual relevance in search means greater discoverability for relevant content.
There is a clear business case around making information discoverable. It translates directly into ‘eyeballs’ or ‘bums on seats’ – or whatever other technical marketing term you want to use; all of which mean, basically, that it has an impact on the bottom line.
Of course, scale is necessary to make this business case work. If you have only hundreds of documents about a given subject it is far less compelling than if you have millions.
There is another, less volume-dependent aspect to the value to be gained from content enrichment, however, and this is about creating a better user experience. On a technical level, now that we’ve identified and tagged all these interesting entities in our text, we can use that to link in a contextually relevant way to other resources, which might not even be resources under our control – linking antibodies, for instance, to an external database or crystallographic structures to an experimental data archive.
The business case for enhancing the user experience in this way is all about deepening the relationship with that proportion of your visitors who are going to be serious customers for you, and providing them with value-added services.
This leads onto a second important theme, which involves looking at the activities described under our first theme from a slightly different angle. I spoke there about external resources that can be linked to from a publisher site. Many of these, as I said, will be out of the publisher’s control; archives where scientists put their results. But there is also the opportunity for publishers who specialize in particular subject areas to build their own archives – including databases which might need to be curated – that can be transformed into resources for third parties through provision of an API. And then to monetize that API.
I have written elsewhere on this blog (see researchers are doing it for themselves) about the provisional, contingent nature of many of these researcher-led archival projects, and how publishers, with their seasoned expertise in such matters, have the opportunity to lead on quality in the business of digital archiving. I don’t want to repeat those arguments at length here: suffice it to say that data mining activities in this area also have a potential business case, which lies in the opportunity to build quality services for publisher customers.
3. The RDF/triples view of the world
Now we come to an area of data mining activities where a business case is perhaps harder to find.
There is another view of content enrichment that goes well beyond entity extraction and tagging as mentioned above and pushes it to a further extreme boundary. This is content enrichment turned up to eleven. Now that I’ve got all my Londons and Sodium Chlorides and
The result is a bit like a leaf skeleton, retaining all the ‘facty’ bits of the text and leaving out all the extraneous words and linguistic forms that authors use just to make their ideas intelligible to other humans. The idea is that facts and arguments can be machine-extracted from the flow of text and be made to stand alone, in a way that makes them readable by machines (rather than humans) – and it is a slightly contentious one.
This is what computer scientists mean when they talk about The Semantic Web – a “triples” or RDF based view of the world championed by the W3C. It’s a project that is a scientific work-in-progress, so nobody is quite clear, presently, on all of the technological and organisational pieces needed to eventually enable universal knowledge fields to be connected, visualised, accessed and shared, or what the business benefits might currently be.
Which is not to say that it might not eventually yield results – Watson’s win at Jeopardy pointing the way to many practical applications in decision-making such a semantic engine could have – just that it is an interesting area for research whose commercial deliverability is uncertain.
4. The robots’ view of the world
If there is a purpose behind my talking about the fourth theme I want to identify as an important area of data mining, it is to get publishers to take it out of the ‘threat’ quadrant of their SWOT charts.
Recent legislation in the UK has made it clear that researchers are allowed to data mine publishers’ works.
Publishers need to accommodate that because there is clearly a demand for it (though it is probably a very small demand, in reality) and it is now clear in UK law, at least, that the right customers have already purchased with their subscription to access a publisher’s content covers them for data-mining as well.
The clarity brought to the situation by this ruling reduces to a large extent the opportunity that was previously thought to exist in this area, but also reduces the imagined threat.
The typical scenario here, in reality, is potentially of a linguist studying usage of words across a very large corpus of text – and there shouldn’t be anything to prevent linguists doing that. In doing so they are unlikely to undermine your published product. The threat of some mad scientist being able to wave his magic semantic wand over your content and walking away with the crown jewels was, and is still, a pipe dream.
Neither, by the same token, do these users and their use of data mining on publisher content represent a huge commercial opportunity.
Untangling the world of data mining and semantics is an important part of the service we provide for publishers. We love taxonomies, content enrichment and discoverability. If these are issues that are affecting your business, and you’d like some help with working through them, then get in touch today.
Latest news and blog articles
Full-text HTML of preprints now available on medRxiv