When the conversation turns to AI and scholarly publishing, it is common to focus on the spectacular: the sudden breakthroughs, the dramatic shifts in workflow, the existential questions about authorship or originality. But in the October Best Practice Webinar on Protecting Publisher Content in the Age of AI, Mark Hahnel, VP of Open Research at Digital Science and founder of Figshare, offered a different lens.
He asked the participants to pause, take a breath, and step back from the headlines. Before worrying about what AI tools can produce, he suggested we must understand what they consume. Because in the end, the only reason generative AI is so powerful in scientific contexts is that it ingests the scholarly record, a record publishers have spent decades curating, validating, and enriching. If the quality or structure of that record is compromised, AI will not compensate for it. It will amplify the weaknesses.
Mark’s central argument was: AI models can accelerate scientific discovery at astonishing speed—but only if they are built on trustworthy, high-quality, machine-readable content. Without that foundation, AI becomes a generator not of insight, but of noise. And in a world already overflowing with information, noisy models could cause real harm. This is why the scholarly ecosystem must take metadata, machine readability, and provenance seriously, not just for human researchers, but for the machines that increasingly mediate access to knowledge.
The Flywheel Effect: When Good Content Enables Exponential Scientific Progress
To illustrate the transformative potential of AI when paired with high-quality scholarly data, Mark began with a case study that has already reshaped biological research: DeepMind’s work on protein folding.
For half a century, scientists painstakingly solved the three-dimensional structures of proteins through expensive, laborious experiments. By 2020, the Protein Data Bank (PDB) held about 170,000 experimentally determined structures. The economic replacement cost of that difficult, and resource-intensive archive exceeded $20 billion.
DeepMind then trained its model, AlphaFold, on that corpus of high-quality, meticulously curated data. Within a year, AlphaFold released one million predicted structures. Its successor models expanded this to 200 million, covering virtually all known proteins on Earth. What had taken the scientific community fifty years to build was suddenly expanded by orders of magnitude, all because the AI model could rely on trusted data with clean metadata, controlled vocabularies, and standardized formats.
This was the Flywheel Effect in action: high-quality, semantically rich data begets better AI; better AI begets new knowledge; new knowledge enhances future research outputs; and the cycle accelerates with each rotation. As Mark put it, “the cycle continues, with each rotation producing more valuable knowledge.”
This is one of the most inspiring developments in contemporary science. But it also contains a warning. The Flywheel Effect magnifies errors just as quickly as truths. If poor-quality material is fed into the system, or if high-quality content is not structured in ways that machines can understand, then AI will accelerate confusion instead of understanding. This is why the scholarly community must pay attention now, not when the Flywheel has already spun out of control.
A System Flooded With Content—But Not Always With Quality
Mark then shifted to the more sobering reality: the scholarly communication system is producing more content than at any point in history, but the quality and provenance of that content vary widely. He cited a striking statistic from Elsevier: the company fielded 3.5 million submissions in a single year and published about 700,000, an increase of 600,000 submissions year over year. If submission numbers at established publishers are rising this quickly, the source cannot be solely human productivity. Something else is happening.
Some of this growth is benign. Preprints have become a standard part of the scientific conversation. Datasets are being published at rates that dwarf article output. Open science has expanded the definition of publishable research objects.
But a substantial portion of the increase is undeniably due to AI-assisted or AI-generated manuscripts. Many of these papers are legitimate scholarly contributions; others are thinly rewritten summaries of existing work; still others are fraudulent, part of the growing global challenge of paper mills and fabricated scholarship.
From an AI training perspective, this is extremely dangerous. LLMs cannot easily distinguish between high-quality experimental work, questionable manuscripts, and outright fabrications. If the models train on a mix of all three, the Flywheel Effect breaks.
This is why provenance, knowing where data came from, who authored it, and whether it has been validated, is not a trivial metadata exercise but the backbone of responsible AI.
Machine Readability: The New Prerequisite for Discovery
As more content is produced, researchers increasingly rely on AI to help them navigate the it. But AI cannot correctly interpret scientific publications without structure, metadata, and consistency. And here, Mark showed, the industry’s progress is uneven.
For instance, PDF accessibility remains dismal. In a study of more than 20,000 PDFs, fewer than 3.2% were fully accessible; roughly 75% failed all accessibility checks. Poor PDF extraction is one of the primary reasons LLMs misinterpret equations, conflate figure legends, or lose essential methodological detail. If humans struggle with poorly structured PDFs, AI struggles an order of magnitude more.
Meanwhile, abstract openness, though improving, still leaves large gaps. In 2020, only about 21% of abstracts were openly available; by 2024, the number had grown to around 52%. But this means that nearly half of published scientific abstracts remain invisible to AI tools that rely on open metadata.
Structured references also remain inconsistent. Even though Crossref made open references the default in 2022—and 88% of references were open even before the rule change—variability persists between publishers, particularly among smaller societies.
Machine-readable formats like JATS XML and semantic markup (schema.org, JSON-LD) are essential for making content discoverable to both humans and machines. Yet adoption has slowed in recent years, especially outside biomedicine. In many cases, the XML exists but is incomplete or lacks semantic depth, limiting its usefulness for text-and-data mining.
Machine readability is no longer an operational nicety. It is the prerequisite for being included in the AI-mediated future of research. If content is not FAIR for machines, it will not exist in the environments where researchers increasingly work.
From Metadata to Meaning: How Semantic Enrichment Unlocks Knowledge
Mark then turned to the practical question: what can publishers gain from improving machine readability? The benefits, he explained, extend far beyond discoverability. Once content is normalized and enriched, it becomes input for semantic document processing, allowing AI tools to extract meaning instead of merely text.
During the webinar, he demonstrated how semantic annotation can identify domain-specific entities like proteins, diseases, or chemical compounds, mark their roles, and detect patterns across publications. He showed how ontological search can surpass keyword-based search by recognizing synonyms, contextual relationships, and disambiguating abbreviations.
Semantic enrichment effectively turns the scholarly record into a structured knowledge graph. And knowledge graphs, as Mark emphasized, “can be the base of AI models.” A publisher with enriched content could theoretically build its own domain-specific AI model, one grounded not in general web knowledge but in a corpus of peer-reviewed, validated research.
This represents a profound shift in the publisher’s role. Rather than simply hosting content, publishers could become stewards of domain-specific AI systems that outperform general-purpose LLMs in accuracy, reliability, and scientific rigor.
The Threat of “Garbage In, Garbage Out”: AI Cannot Fix the Data It Trains On
Mark was careful not to present AI as a panacea. He warned that without quality controls, AI will amplify weaknesses rather than solve them. The danger of “garbage in, garbage out” is particularly acute in scholarly communication because much of the value of scientific literature lies not in its linguistic shape but in its methodological rigor, evidentiary grounding, and contextual nuance.
An LLM can rewrite sentences, but it cannot distinguish between a methodologically flawed study and a gold-standard randomized trial unless those distinctions are encoded into the metadata or citation network. Nor can it detect when a paper has been retracted unless publishers mark that status reliably and consistently, and even then, AI developers must ingest and respect those signals.
These challenges explain why AI hallucinations are so common in scientific domains. When the information space is large, inconsistent, and poorly structured, models learn to approximate patterns rather than recall verified truths.
Mark’s solution is straightforward but demanding: publishers must take responsibility for the clarity, completeness, and machine readability of their content. If they do not, AI systems will misrepresent their material, and the Flywheel will accelerate misinformation instead of insight.
Zero-Click Search: A Future Publishers Must Prepare For
Mark also discussed the looming shift in user behavior. For two decades, researchers discovered content by typing queries into search engines and clicking through to publisher sites. Now, AI tools like Perplexity, ChatGPT, and Claude increasingly answer questions directly, summarizing scholarly content without requiring a click.
This “zero-click search” model is already reshaping traffic patterns. As Mark noted, BMJ recently reported that 2.8% of its traffic now comes from AI tools, and that number is expected to grow dramatically. The question for publishers is no longer “How do I optimize my content for search engines?” but “How do I make sure AI systems treat my content fairly, accurately, and with attribution?”
The answer begins with structured, machine-readable metadata, but it does not end there. It also requires:
- licensing frameworks that govern AI training and use
- machine-readable signals indicating allowable uses
- infrastructure for delivering content to AI through APIs or gateways
- provenance metadata that asserts the trustworthiness of the version of record
Publishers who fail to prepare for zero-click search may find that AI intermediaries become the de facto source of truth for their content, without the guarantees that make the scholarly record reliable.
A Vision for the Future: FAIR for Machines as a Strategic Imperative
Mark closed his presentation with a call to action. FAIR principles, Findability, Accessibility, Interoperability, and Reusability, have long been associated with open science and human readability. Now, he argued, they must be extended to machines. Machine readability is not simply a technical best practice, it is foundational in an AI-driven research ecosystem. Publishers who invest in semantic enrichment, open references, structured formats, and provenance metadata will find that their content is not only better understood by AI but more valuable to it.
In this future, publishers are not losing control, they are gaining a new kind of influence. They shape the ontologies, the data structures, and the quality standards that guide scientific AI models. They define what trustworthy information looks like. They protect the scholarly record not by walling it off, but by making it legible in a responsible, controlled way. The scholarly record needs structure, consistency, and clarity. If publishers provide it, AI can become a powerful ally. If they do not, the Flywheel may spin in dangerous directions.
– By Tony Alves
Read the previous part