Protecting Publisher Content in the Age of AI: Why Scholarly Publishing Must Confront the New Reality Now

Protecting Publisher Content in the Age of AI: Why Scholarly Publishing Must Confront the New Reality Now

As HighWire’s Director of Hosting Products, Josh Routh, put it in his opening remarks during the October Best Practice Webinar Protecting Publisher Content in the Age of AI, “publishing as we know it is changing, and has already changed.” His framing was stark. In a matter of months, the core expectations that once anchored digital publishing have become unreliable. Search referral patterns are shifting as AI systems answer questions directly, bypassing publisher websites entirely. Robots.txt files, once a foundational tool for signaling crawling rules, are frequently ignored by aggressive AI scrapers. Usage analytics, once a dependable indicator of engagement, now capture only a small fraction of actual readership because Large Language Model’s ingest and reuse content without triggering measurable events. And in this new environment, even the representation of content can no longer be trusted: LLM outputs often misquote, misattribute, or oversimplify scientific work, leaving authors and publishers struggling to recognize their own material.

The entire scholarly communication ecosystem depends on fidelity to the scientific record. Scientific information must be accurate, attributable, and verifiable. When AI systems remix and re-express that information, those norms must carry through. Yet generative models, designed to produce fluent language rather than traceable scholarship, often omit citations, conflate research findings, or hallucinate entirely new material. The consequences are not abstract. In fields like medicine, psychology, engineering, and public health, distorted or decontextualized information can directly affect human lives.

This first post in our four-part series examines the scale and nature of the challenge. It draws on the webinar’s insights from three experts approaching the problem from different perspectives: Pascal Hetzscholdt, who leads AI Strategy and Content Integrity at Wiley; Mark Hahnel, VP Open Research at Digital Science and founder of Figshare; and Josh Nicholson, Chief Strategy Officer at Research Solutions and co-founder of Scite. Their perspectives converge on a shared message: AI can strengthen scholarly communication, but only if publishers act decisively to shape how their content is used, represented, and licensed in the age of machine consumption.

A Digital World No Longer Governed by Familiar Rules

Routh’s introduction set the tone by cataloging four assumptions that publishers long took for granted, each now rendered obsolete by AI. For years, the industry operated on the simple premise that if a publisher optimized content for Google, discovery would follow. However, Google referrals are now falling and AI-generated answers now sit atop most results pages, providing self-contained responses that eliminate the need for users to click through to the underlying articles.

Even the barriers publishers once relied on to control crawling, robots.txt, rate limiting, bot detection, have become porous. Several large AI developers ignore robots.txt entirely, and publishers report having to dramatically scale up their infrastructure simply to handle the increased bot traffic. At the same time, analytics systems that once provided a clear lens into how content was being read no longer reflect reality; AI systems can ingest an entire corpus in one sweep and reuse that material without ever returning to the publisher’s site. Traditional COUNTER metrics and pageview analyses reveal only a small sliver of how the content is actually being consumed.

Perhaps most unsettling, researchers increasingly find AI tools summarizing their work incorrectly or weaving together elements from outdated, withdrawn, or misremembered sources. Citation lists are truncated or fabricated. Study details are distorted. The version of record is quietly replaced by the version the AI happens to have assembled. For a workflow built on trust, attribution, and accuracy, this constitutes a fundamental breach.

Why These Changes Are Not Just a Technical Problem

The broader stakes extend far beyond web traffic or licensing structures. AI systems now sit in a position of learned authority in the eyes of the general public. When a model interprets and re-expresses the scientific record, the values of scholarly publishing—attribution, verifiability, replicability, transparency, and accuracy—must inform that process. Otherwise, the outputs risk degrading the integrity of scientific communication at the moment it is most needed.

Pascal Hetzscholdt gave an example: patients are increasingly using ChatGPT not only before consultations but during them. Armed with AI-generated insights, they challenge clinicians using arguments whose provenance is invisible. The physician often cannot determine whether the information is drawn from high-quality sources, outdated literature, or outright hallucination. In some cases, patients trust the AI more than the trained professional standing in front of them. The risk to public trust, and to public health, is real.

Responsible AI: Wiley’s Push for Ethical, Explainable, and Traceable Models

Pascal outlined Wiley’s comprehensive approach to “Responsible AI,” built around the belief that AI should enhance, not replace, human expertise. The first pillar of that approach is the integrity of training data. Many major models have been trained on a corrupted data set which includes more than 41,000 retracted papers, and this undermines confidence in any downstream conclusions. Others have been trained on dark-web data, including breached personal information, another indicator of how little control publishers historically had over the use of their materials.

The second pillar is explainability. Scholarly publishing rests on traceability: readers must be able to see where a claim came from and how it was derived. But many AI models remain black boxes, making it difficult to identify the sources, data, or reasoning behind their outputs. Without such transparency, publishers lack the visibility needed to protect their content and ensure that users receive reliable, citation-backed information.

The third pillar is human oversight. Wiley requires authors to disclose how AI assistance was used in manuscript preparation, prohibits listing generative AI as an author, and stresses the need for human accountability in all editorial decisions. Rather than ceding authority to models, Pascal argued that experts must become AI-literate allies who help users interpret outputs, validate claims, and understand the limitations and failure modes of the tools they employ.

Quality, Provenance, and the AI Knowledge Flywheel: Mark Hahnel’s Perspective

Where Pascal focused on ethics and governance, Mark Hahnel explored the implications from the data side of the equation. AI, he argued, has the potential to accelerate scientific discovery to a degree unprecedented in modern research. His most striking example was DeepMind’s work on protein folding. Over fifty years, the Protein Data Bank amassed roughly 170,000 experimentally solved protein structures. DeepMind’s model, trained on that dataset, produced one million predicted structures in its first year and ultimately scaled to 200 million. Economists estimate that reproducing the PDB’s experimental output alone would cost more than $20 billion, making the model’s contribution not merely impressive but transformative.

But Mark’s broader argument was that this kind of acceleration only works when AI systems ingest high-quality, well-structured scientific information. Although the volume of content is exploding, for example Elsevier alone received 3.5 million submissions and published 700,000 papers in a single year, the rise of preprints, the proliferation of AI-generated manuscripts, and the flood of uncurated datasets have created a landscape in which “garbage in, garbage out” is a serious risk.

Machine readability is the key factor determining whether AI systems can use content correctly. Yet full-text XML adoption has plateaued among smaller publishers, abstract openness sits at around 52%, and PDF accessibility falls short: in one study of 20,000 PDFs, fewer than 4% met accessibility standards. Structured references, schema markup, and FAIR metadata are unevenly implemented across the industry.

Mark’s conclusion was unambiguous: if publishers want AI to treat their content responsibly, they must make it legible to machines. This means structured XML, enriched metadata, open references, semantic tagging, and consistent accessibility practices. Without this foundation, even the best-intentioned AI models will misinterpret the scholarly record.

Grounding AI in Evidence: Josh Nicholson’s Model for AI-Ready Citations

Josh Nicholson brought the conversation from theory into implementation. His work at Scite has long centered on extracting meaning from citations, identifying whether a study’s claims have been supported, challenged, or merely mentioned by subsequent research. Over nearly a decade, his team built machine-learning pipelines capable of classifying citation context at scale, now applied to millions of articles.

In the LLM era, this work takes on newfound relevance. ChatGPT and similar tools often fabricate citations or misattribute findings. Josh argued that if AI is to serve as a trustworthy research assistant, it must be grounded in verifiable evidence. The solution, he suggested, lies in exposing AI-ready snippets or “chunks” of articles that allow models to perform retrieval-augmented generation. Instead of relying on whatever text the AI may have scraped months earlier, the model would retrieve curated, structured fragments supplied directly by publishers.

Building on this idea, Josh outlined a Pan-Publisher Model Context Protocol (MCP). In this architecture, a user’s question prompts the AI to search across internal corporate documents, the web, open datasets, and, most importantly, publisher-provided content snippets. The system assembles these fragments, generates a grounded answer, and directs the user back to the version of record. This approach not only improves accuracy and attribution but also enables new licensing models in which publishers can charge for AI rights at the article, portfolio, or library level. It also opens the door to tracking how often AI systems interrogate specific publications, restoring some of the visibility lost in traditional analytics.

A Converging Vision for the Future

Although Pascal, Mark, and Josh approached the issue from different angles, their messages converge cleanly. Together they paint a picture of a scholarly ecosystem in which publishers must no longer see themselves solely as custodians of content but as active participants in the AI supply chain. Pascal emphasized the need for ethical, licensed, explainable AI built on trustworthy data. Mark stressed that for AI to function responsibly, publishers must deliver content in high-quality, machine-readable formats. Josh demonstrated that publishers can go further still, creating structured pipelines that allow AI models to retrieve fragments of the scholarly record in ways that preserve context, accuracy, and attribution.

The combined message is clear: AI will shape the future of research discovery, but publishers must shape the role AI plays within it. With coordinated efforts across licensing, standards, metadata enrichment, and structured delivery, publishers can ensure that AI becomes a powerful ally in advancing scientific knowledge, not a threat to its integrity.

As Pascal concluded in his presentation, “AI won’t replace publishers—but publishers who use AI responsibly will replace those who don’t.”

– By Tony Alves

best

– By Tony Alves

Read the next part

Latest news and blog articles