Nothing is closer to a publisher’s heart than their content. In fact it can seem almost sacrilegious to call it by such a cold and technical term: publisher content is a valuable and often unique thing.
This uniqueness also means that it is highly diverse. And it comes in many different shapes and sizes. ‘Publisher content’ could refer to a life-saving new discovery in oncology, a revelatory new monograph on Shakespeare – or even a series of videos detailing primacy rituals among rodents.
Machines, however, don’t make qualitative distinctions in this area. Whether your content is about mice, men or medical research, doesn’t figure for them – to the machines it’s all just content.
This difference of ‘world view’ (if you like) between humans and machines can cause problems when it comes to designing a platform to make the most of publisher content. Our job, at HighWire, is often to manage the job of translation between the two, so we have a lot of experience in this area. Based on that experience, here are some tips for how to approach conversations about content, so that the best and most profitable outcome can be arrived at with the minimum confusion.
I spoke a business analyst at HighWire, whose role involves helping clients to understand better the opportunities and potential pitfalls in planned developments, and to maximize the value of their investment. One aspect of his job that makes the role so necessary is that the level of clients’ technical knowledge varies widely. There is an awful lot to know about web technology, and devils lurk in the detail.
Tip 1 – Things that seem simple can turn out to be complex … and vice versa
Where clients have a relatively low level of technical knowledge, they can come up with what seem to them like perfectly reasonable requirements, which on examination turn out to be difficult (i.e. requiring manual intervention) and therefore expensive.
An example that does the rounds among technical teams at HighWIre is a client from the natural sciences who wanted his platform’s onboard search to return results relating to animals in order of size, largest first – e.g. blue whales ahead of elephants. Since information about weight was not consistently recorded either in the text or metadata, this proved far from straightforward.
It could have been done with recourse to some extra work tagging work carried out by humans (expensive, and not budgeted for), or perhaps using an external look-up to some notional database of animals by weight – but the point is, it wasn’t as easy as the client supposed.
By contrast, other requirements are much simpler and cheaper to implement, but sometimes don’t even occur to clients as being possible, so don’t figure in the initial specification.
Tip 2 – It’s simplest to select on things that are already in the data
As a rule of thumb, things that flow with the data, or are in the data, tend to be simpler. Obvious things (to a machine) would be the date order of articles, for instance, since all articles are dated, using fairly stable formats, and author attribution (the latter made simpler, now, by ORCID). The most popular articles on a platform should be a simple thing to deliver, the most read, the most downloaded, and so on, because this information is routinely recorded by websites. Things which are consistently recorded in metadata, or in the text itself, can be used to order and facet content with relative ease.
Next after this, in terms of difficulty, come instances where machines are required to make inferences about the meanings within certain content items based on context.
Tip 3 – Machines can now make inferences about content based on context
An example of this, again from natural sciences, would be a zoologist searching for information about jaguars. If the content crawled contains a lot of references to tyre pressure, torque rations and ‘cornering like it’s on rails’ the machine might deduce that this is a low-scoring hit. If closely related terms in the text include words such as ‘savannah’, ‘predation’ and ‘waterhole’, on the other hand, the content item would be scored more highly, and feature nearer the top of the SERP.
Using the right software, together with thesauri or other ontologies, a similar approach based on context can be taken to machine-enriching content so that it is more crawlable, and content becomes more discoverable and/or more findable.
Let’s not complicate matters by thinking about buffalos (!). Suffice to say that this type of automation is now becoming far more mainstream. A publisher with limited technical knowledge might not be aware of this as an option. However it is equally the case that another publisher, who has perhaps been to some conferences about semantic enrichment and got fired up about the possibilities, might not be aware of certain practical limitations to do with the maturity of these technologies, and believe they can do more than is actually the case.
The point being made here is the importance of a good conversation – if the specification is too nailed down before you talk to your platform developer, you risk pitching your expectations either too high or too low.
Tip 4 – Expect a conversation, not just a tender response and a quote
Chris is ardent about ensuring the quality of this conversation, feeling it is HighWire’s job to guide clients to what can be simply implemented so as to deliver most value; not overlooking opportunities, but on the other hand not trying to shoot for the moon / boil the ocean / [insert your own cliché] …
It is not as simple as just going for the low-hanging fruit: more expensive features might deliver higher value. The trick is not to spend all your money on expensive features that only ten people use, or conversely to skimp on providing something that will help attract new customers or move existing ones up the value chain. It’s a cost/benefit decision, based on your business drivers.
Tip 5 – Make sure sample data is truly representative
Sometimes sample data is about as representative of what we finally get as are estate agents details of the houses and flats they describe. No accusations are being made here of deliberate mendaciousness: content can be a diverse beast, even within a single collection.
But it is easy for a developer to be wrong-footed by unrepresentative data. They will design on the basis of what the sample data shows them, attempting always to make things clearer for the users based on the picture they get of what the content is all about. It is not unknown, however, for a design based on a set of sample data to be thrown back, however, with words along the line of: ‘no that won’t work for these other bits of our collection’ … And that is fine, so long as the throwing back happens before expensive development work occurs and money is spent.
Often, and most helpfully, this becomes an iterative conversation. The client sends sample data, the developer draws up a set of assumptions about the design based on that sample and plays it back, then the client spots exceptions and perhaps sends a further, different data sample for the developer to work with, incorporating the new requirements indicated into a new design.
Clearly this cannot be the case where the client is at a later stage of development and certain things are set in stone – however the guiding principle should be to expose as much information as possible about the content at the earliest stage.
An example of such a conversation at work is given by some work we did on a platform about diseases. It was required that search should autocomplete on the names of diseases (which are often difficult to spell and type out in full); however our analysis of the day showed that maybe five synonyms would be given for any particular disease, meaning that much relevant content could be accidentally excluded from search.
The solution was to work with a single synonyms file built into the ‘plumbing’ of the platform so that a search for one of the disease names would also pick up all the synonyms.
This latter example also shows how a simple, practical feature can take a development into area of semantic enrichment – although this had not been considered as a part of the initial specification. Through this sort of detailed – but not over-lengthy or grindingly granular – dialogue, we can point out opportunities and advantages the client might not have seen at the outset.
Latest news and blog articles
Full-text HTML of preprints now available on medRxiv