The potential impact of Artificial Intelligence on the scholarly publishing ecosystem
Artificial Intelligence (AI) is a ‘hot’ technology in many different sectors from Financial Services to Retail to Transportation. When a technological innovation becomes the stuff of widespread media attention, the mainstream definitions tend to become malleable, and terms that describe different processes or applications quickly become synonymous.
So in considering the current and potential impact of AI within the scholarly publishing ecosystem, we will first clarify terms. The two most used phrases in this field are artificial intelligence and machine learning (ML). There’s a useful article explaining the difference here. In summary, AI is the general concept that machines may be able to act in ways that humans consider ‘intelligent’ and ML a subset of AI that describes the ability of machines to ‘learn’ from datasets and make ‘decisions’ accordingly.
Another useful term to clarify is Natural Language Processing (NLP): “a field of computer science, artificial intelligence concerned with the interactions between computers and human (natural) languages, and, in particular, concerned with programming computers to fruitfully process large natural language data”.
A look at the Gartner Technology Hype Cycle from 2017 shows that - probably unsurprisingly - Machine Learning is a technology situated at the exact peak of ‘inflated expectations’ and predicted to need another two to five years to work their way through to the plateau of productivity.
An application of AI that has had by far the most traction in the consumer sphere and is, therefore, driving both consumer expectation and understanding of AI is Amazon’s Alexa. Alexa is an excellent example of the application of NLP to aid machine learning. An NLP system like Alexa uses a cloud-based ‘brain’ to determine within a sentence both the intent and the attributes of the sentence that relates to the intent. So, if you say to Alexa ‘I would like a taxi to London’, Alexa will extract the intent as “find taxi” and the critical attribute as “Destination City: London”. Armed with this simple data derived from a complex sentence Alexa can then go and provide an answer.
However, Alexa is only as good as the skills she has been taught (in this case ‘taxi booking’), and Amazon’s masterstroke has of course been to opensource its code to enable custom ‘skills’ to be written by third-party developers. But ask Alexa a question that falls outside her ‘skills’ and she will very quickly reveal herself to be not so intelligent after all. In some ways until Alexa ups her game on the ML/NLP side (and there are plenty of developers working away to help her do this) then she will be little more than a novel new I/O device, a logical next step in I/O that started with the punch card. Indeed, to a large extent, the sheer novelty of her voice-operated, NLP-driven interaction is obscuring her lack of true ‘intelligence’. Considering Alexa’s success through the lens of ‘novelty’ helps explain why a leading study has shown that 97% of newly installed Alexa skills are left unused after one week. This probably isn’t very different from the track record of newly-downloaded apps on a smartphone.
That all said, and as David Smith points out in his recent post on this topic, there may still be some useful applications gained from applying a ‘scholarly publishing’ skill set for Alexa to train her to search and alert researchers to the publication of new and relevant research.
But there is of course more. The application of machine learning to scholarly content is not a new phenomenon. Semantic Enrichment and Semantic Search were pioneering examples of using computational power to extract usable data from a large ‘unstructured’ corpora. They were, in their time, exemplars of ‘hot’ technologies within our ecosystem. Advances in computational power and improvements in the ‘intelligence’ of the algorithms sent to work on the corpora are beginning to yield a new set of exciting applications that are directly relevant to both improving discoverability and improving workflows. One could make the argument that the vast quantities of structured content that make up the backbone of the scholarly ecosystem create an almost perfect testbed for innovation in AI (and in particular ML and NLP).
We, therefore, see it as business critical that we take an active and leading role in exploring the potential for AI within the scholarly ecosystem. As well as chairing a panel on this topic at APE 2018, we have convened a two day, invitation-only workshop at Stanford on 29/30th January. At this event, editors and publishers will actively explore pragmatic integrations of AI and other technologies within editorial workflows with two of our industry partners - Meta and Google.
Products and integrations that we see as potentially impactful are as follows.
- Installed at Stanford and some other institutions, Yewno uses ML and computational linguistics to underpin products including Discover and Unearth. Advances in AI enable such products to more intelligently mine content bringing benefits to both publishers and researchers. There will be new opportunities for publishers to monetize content (through the drawing together of previous disparate or discrete themes, data or concepts) and improved discoverability of content that will benefit undergraduates in particular as they explore a new academic subject and need help with unfamiliar vocabularies and complex hierarchies of terms.
- HighWire-hosted EMBO is prototyping SourceData. It creates a metadata structure that enables the description of scientific figures and provides a set of tools that allow this information to be searched, analysed and edited. The machine readability of the data is central to enabling Machine Learning to power the system to answer ever more complex queries. Ongoing improvements in ML and Natural Language Processing (NLP) will undoubtedly drive this product forward, likely enabling automated cataloging of image content; this is essential to the scalability of SourceData. HighWire and Meta, our AI development partner, look forward to further development in this direction.
- We are partnering with Meta to deploy Meta Bibliometric Intelligence. While some editors might look for initial triage and streamlined identification of potential high impact research, others may want to take advantage of reviewer recommendations first – we will learn editors’ priorities at the workshop mentioned above. We believe this is the most exciting application to date of AI within the editorial workflow component of the scholarly publishing ecosystem. Editorial workflow may be a high-impact area for AI innovations in the short to medium term.
- But it is vital to see beyond the ‘hype’ of technology at the peak of ‘inflated expectations’. To this end, we are working with Meta to ensure that results from algorithms are validated and that the integrations directly supports editorial teams in the specific areas of workflow where they require most support. We think it is unwise – especially in the context of scholarly communications’ focus on evidence – to trust results without validation.
In the long-term, AI will undoubtedly have a significant impact on all aspects of the scholarly publishing ecosystem: on readers, searchers, authors, editors, reviewers, publishers, librarians, etc. In the short to medium term AI (specifically ML) will impact by further enabling discoverability and driving efficiency and effectiveness in the editorial workflow. Few contentious issues will stem from the former - from the latter there will undoubtedly be many, including the biases of algorithms. A vision of the future that automates, or (more likely) augments such critical processes as Peer Review will, of course, create tension right at the heart of the ecosystem. Validation of results and understanding biases will be paramount.
Whatever happens one thing can be predicted for certain - standards will be required to ensure that content is structured in a way that enables the algorithms that make a system ‘intelligent’ to get to work. As Tahir Mansoori rightly says: “The decision to make the most of the breakthrough artificial intelligence toolbox stands with scholarly publishers, with an imminent need to reach a consensus on data sharing standards for ingestion by these algorithms and with the external innovators in this space”.