Augmented Intelligence in Scholarly Communication: The Machine as Reader
This past January I was in Berlin for APE 2016 (“Academic Publishing in Europe”). I was both a keynote speaker, and a first time attendee. Several others have blogged summarizing the meeting (Kent Anderson of Caldera Publishing Solutions has two posts, on Day 1 and Day 2; Fiona Murphy of Maverick has blogged for ALPSP on the meeting). There was a lively twitter flow, under the hashtag #APE2016. It was an excellent meeting for looking at both challenges and opportunities out well into the future. The perspectives on challenges and opportunities are different in Europe from in the US – perhaps surprising because scholarship and publishing are so internationalized. I look forward to the presentation videos being posted! [Update: the videos are now posted to the APE2016 videos page.]
My keynote, “Friction in the Workflow: Where are we Generating More Heat than Light?”, covers a wide range of friction-producing challenges that researchers hit in moving through today’s researcher workflow:
- Manuscript Submission
- Peer Review
- Barrier to Dissemination
- Missing Media
- Supplementary Information
- Grey Literature
- Reading is Hard
- Starting Up
- Keeping Up
- Distracted Driving
- Lifestyle Journal Sites
- Referencing and Reading
- Book-Chapter Indexing
- Scholarly-Image Indexing
- Off-Campus Access
At the end of my talk, I note what I think are going to be large forces – what I call “discontinuities” — that drive even more divergence between the workflow that publishers support, and the capabilities of the technologies that researchers are surrounded by as both researchers and consumers.
There are three discontinuities I highlight:
- A change in the architecture and the metaphor around scholarly communication, from treating scholarship as a library to look things up in and write, to a channel to act and interact, to question and answer.
- Pressure building for scholarly publishing to be more open to communication and conversation, not just the one-way “broadcast” flow typical of Web 1.0 styles.
- Machine learning, allowing machines to not only read, understand and interpret the literature but potentially to write it.
What is remarkable is that two other speakers – and there is no previous connection between us – highlighted the machine learning theme as well.
Barend Mons, chair of the High Level Expert Group on the European Open Science Butt, spoke about the European Open Science Butt – he was very clear that “open science” is not about “open access” — and talked about the FAIR Data initiative. As a speaker he seemed to enjoy standing things on their head: publishers should “publish data, with supplementary [text] articles”, and “don’t try to make text [readable] for computers, or make RDF [readable] for people”. His essential point is that the narrative form of the article is not machine-appropriate, and to enable greater discovery we need to be creating the “internet of data”. Using text as the universal language to communicate for both people and computers is “mixing tools for helicoptering with tools for digging” – the point being that nobody builds helicopters with big shovels on them (you can imagine the comical slide that went along with this point). To enable this data publishing will require “data stewardship for data discovery”. The data stewards will be found in science programs and labs.
Todd Toler, VP of Digital Product Management for Wiley, spoke to the title “Digital First: A Publisher’s Value Proposition in the age of Reproducible Science & Open Data”. While he used a different vocabulary than Barend did, you can see some parallel streams: “make network linked data the primary product, not the [article] XML”, “the current publication process breaks all the links by flattening the data into a publication”, “don’t break the linked data pipe as part of the publishing process”, and “expand [the publishing process] so that the paper is a linked data package that the user can install”. Todd described the tools for getting this going, and content engineers at publishing houses and platforms should look for the video of his talk to become available. Todd references the development of “Scholarly HTML”. [A high-level background is here.]
My own take is that we will have at least two levels of machine intelligence at work in scholarly communication. Just as we now have articles that are available in PDF only (plus metadata) alongside articles with fully-tagged XML, machines will be reading articles from the past — articles intended for humans – and reading “articles” that are optimized for machine mining. It will be too important for the future that we not ignore the past and act as if it that not-machine-optimized literature is “offline” to machines. In the early years of electronic publishing we built many journal sites that didn’t have pre-1995 content in them because that content wasn’t digitally available. But we quickly saw that “offline was out-of-mind” as researchers and especially students treated the offline content as if it didn’t exist. (The authors of that offline content were none too pleased to be digitally marginalized.) And so that back content was optically scanned, minimally-tagged, and put online. We will probably see this same lifecycle recapitulated with machine reading of the literature: we will be able to demonstrate the value by handling the literature prepared for machines, and then the value will be extended to the earlier literature.
What might some of the early applications/uses of machine intelligence be? In the consumer space we are already seeing some. I sat next to someone on Friday who was carrying on what seemed to be a conversation with Siri about Superbowl statistics. And we are seeing the beginnings of autonomous-acting vehicles (self-parking, and highway-self-driving cars). And there now seem to be a thousand flowers blooming:
Image courtesy of Shivon Zilis in “The current state of machine intelligence 2.0“
In scholarly communication, I expect that early applications will be targeted to either niche subject domains — databases that are focused on a specific area, e.g., drug interactions — or niche tasks — e.g., alerting on literature I should read, based on my specific interests. Google Scholar already has an excellent alerting system that builds its profile of my interests off of my what I have written. It might be a big step to expand its comprehension of my interests to cover what I read as well. Some are working on this problem.
In my wish list is a service that will not only read and select the literature I am interested in, but provide a specific-to-me summary of what I will find important in each day’s or week’s research publications — something like the equivalent of personalized medicine: tailoring a drug regimen to my specific genetic makeup. We do this alert-and-summarize manually every time we recommend a specific paper (or news item) to an individual colleague, pointing out why s/he should read it and what will be notable inside.
Well beyond these examples are the places where the data layers described by Mons and Toler can take us: a form of the scholarly literature that can be understood by machines, and understood well enough to have insight into important correlations. We already see presentations at publishing meetings by researchers at IBM about curation of the research literature integrated with Watson, for domains such as pharmaceuticals. And STM’s Future Lab has identified machine reading as a future trend as well. The combination of understanding and insight with the brute force of being able to examine vast amounts of the literature (and its data) will be a game changer.