For decades, scientific communication has relied on one of the most fundamental but misunderstood structures in the entire research ecosystem: the citation. Citations bind claims to evidence, give credit to authors, allow readers to trace an argument back to its sources, and signal whether research builds upon, contradicts, or merely acknowledges previous work. They are the connective tissue of the scholarly record.
Yet in the age of generative AI, citations face a crisis of relevance, not because they are less important, but because AI systems often cannot or do not use them correctly. Tools such as ChatGPT, Perplexity, and Claude routinely present scientific narratives without citations, or with citations that are fabricated, outdated, or misleading. In an era where many users increasingly treat AI as their first point of contact with scientific information, this breakdown threatens the essential link between knowledge and evidence.
In the Protecting Publisher Content in the Age of AI webinar, if Wiley’s Pascal Hetzscholdt warned that AI must use trustworthy content, and Digital Science’s Mark Hahnel argued that content must be machine-readable, Josh Nicholson, Co-founder and Chief Strategy Officer at Research Solutions and co-founder of Scite, delivered the next piece of the puzzle: AI must be grounded in citations if it is to be trusted in scientific contexts. In his presentation, Josh laid out a compelling vision for how publishers can strengthen the relationship between AI and the scholarly record, not by fighting generative tools, but by giving them the building blocks they need to produce accurate, verifiable, evidence-based answers.
The Reproducibility Crisis: The Origin Story of Scite
To understand why citations are so central to Josh’s thinking, one must return to the early 2010s, before today’s LLMs existed. Josh was then a PhD student working on cell biology and an early witness to a crisis unfolding across biomedical research: the growing realization that many published findings could not be reproduced.
In 2011 and 2012, major pharmaceutical companies like Amgen and Bayer reported that they were unable to replicate the majority of landmark cancer studies they attempted to validate. Amgen successfully reproduced only 6 out of 53 key studies; Bayer reproduced 21 out of 67. These failures cast doubt on the reliability of the scientific record and intensified calls for tools that could help the research community identify which claims were robust and which were fragile.
Around this time, a group of scholars proposed the R-factor, a metric based on whether a study’s claims had been supported or challenged by subsequent research. This idea became the intellectual seed from which Scite grew.
Josh and his colleagues realized that while traditional citation metrics count how many times a paper is cited, they ignore how it is cited. A citation may support a finding, contradict it, or simply mention it in passing, yet all three appear identical in citation counts. Scite set out to change that by using machine learning to analyze citation context at scale.
Over nearly a decade, Scite built a pipeline capable of extracting citation statements from full-text articles, classifying them as supporting, contrasting, or mentioning, and displaying them in a way that clarifies the scholarly conversation around a research claim. This work laid the foundation for the role Scite now plays in bridging the worlds of scholarly publishing and AI.
LLMs Break Citations—Unless We Fix Them
When ChatGPT launched in 2022, Josh was among the first to test whether the model could cite sources correctly. The results were disastrous. ChatGPT routinely invented citations that looked plausible but did not exist. In scientific contexts, these hallucinations were not merely embarrassing but actively harmful, as they could mislead researchers, students, and journalists into believing that evidence supported claims that never appeared in the literature.
But as Josh explained in the webinar, the problem is deeper than fake citations. Even when an AI system links to real papers, it may misrepresent them. It can conflate study results, attribute findings to the wrong authors, or blend the conclusions of multiple papers into a single, unsourced narrative. These distortions are hard to detect, especially when AI outputs appear polished and authoritative.
This is why citations are indispensable to any AI system that claims to answer questions about science. Citations are not decorative footnotes, they are the mechanism through which truth is traced. And so, the question becomes: How do we give AI systems the context and access they need to cite correctly? Josh’s answer involves rethinking not only how AI accesses content, but how publishers prepare and deliver that content to AI systems.
Chunking the Literature: Why AI Needs Article Fragments, Not Entire PDFs
One of the most innovative parts of Josh’s talk centered on the idea of “chunking” scholarly articles. AI models, he argued, do not need entire PDFs. In fact, giving models full articles often leads to worse results, as LLMs struggle to parse long, unstructured documents.
Instead, AI systems need small, contextual snippets, self-contained “chunks” of text that describe specific findings, claims, or statements, each linked directly to the version of record. This mirrors the structure Scite has worked with for years: breaking papers down into the sentences where citations occur, extracting meaning, and classifying how those sentences relate to the cited claim.
Wiley has already begun implementing this strategy at scale. As Josh noted, the publisher is now delivering AI-ready chunks of its articles, even articles hosted on other publishers’ platforms, to AI companies in structured, machine-readable formats. These fragments can be used in retrieval-augmented generation (RAG), a method in which LLMs ground their answers in verified external data.
Chunking accomplishes three critical things:
- It reduces hallucinations by anchoring AI outputs in authoritative textual evidence.
- It preserves attribution and directs users back to the version of record.
- It protects publishers from having their full content indiscriminately scraped while still enabling legitimate AI use cases.
This concept forms the backbone of a more transparent relationship between AI tools and scholarly publishers.
The Pan-Publisher Model Context Protocol (MCP): A Gateway for Responsible AI
If chunking represents the building blocks, the Pan-Publisher Model Context Protocol (MCP) represents the architecture.
Josh presented a diagram illustrating how an AI system equipped with MCP could answer research questions responsibly. It begins with a user prompt—something simple, like “What is the evidence that treatment A is more effective than treatment B in prostate cancer?”
Traditional LLMs would answer using whatever they already absorbed during training—a black box. But in this model, the AI performs several retrieval steps:
- It searches internal corporate documents (for industry users).
- It searches public websites, patents, and clinical trial registries.
- It searches trusted open-access sources like PubMed.
- And most importantly: it queries the Pan-Publisher Gateway, retrieving AI-ready content snippets directly from participating publishers.
The AI then combines these retrieved fragments into a structured, explainable answer—one that cites its sources and links back to the version of record.
This gateway achieves what today’s LLMs cannot:
- verifiable provenance
- controlled reuse
- correct attribution
- transparent reasoning
- alignment with publisher rights and licensing frameworks
It also gives publishers visibility into how their content is being interrogated by AI systems, a form of analytics that traditional COUNTER reports cannot provide. Josh emphasized that this is not science fiction. It is happening now. Wiley is doing it. Research Solutions is doing it. More publishers will need to do it.
The Business Model: AI Rights, Not AI Risk
While much of the webinar focused on ethics and accuracy, Josh did not shy away from the economic dimension. For publishers, AI is not only a threat to content integrity, it is also a new revenue opportunity.
Today, many corporate users access research content through Research Solutions’ document delivery platform. Josh explained that these users increasingly want AI rights attached to the content they purchase—rights that allow them to summarize, analyze, or extract insights from full-text documents using internal AI tools.
To meet this need, Research Solutions has created licensing structures in which AI rights can be purchased alongside articles. These rights may apply at the:
- individual article level,
- library subscription level, or
- enterprise-wide level across large corporate R&D environments.
Publishers can also license chunks, gateways, or RAG-compatible metadata to AI developers directly, something Wiley has already done with Perplexity, Anthropic (Claude), and others. As generative AI tools become central to research workflows, publishers will need to sell not just reading rights but machine-use rights, and they will need to track not just human usage but AI engagement.
This shift is comparable to the rise of digital licensing in the 2000s. At the time, many publishers feared that digital distribution would cannibalize print revenue. Instead, it created new models, site licenses, bundled packages, perpetual access, that ultimately strengthened the industry. AI rights, Josh suggested, may follow a similar trajectory.
Smart Citations as the Bridge Between AI and Evidence
The final part of Josh’s talk concerned how Scite’s contextual citation database could be used to strengthen AI outputs. Imagine an AI-generated answer that not only cites relevant papers but shows whether those papers have been supported or challenged by subsequent studies. Imagine a model that could say:
- “This finding has been replicated in three independent studies,” or
- “This conclusion has been contested and may not be reliable,” or
- “This treatment has strong preclinical evidence but limited clinical validation.”
This is possible only when AI systems have access to structured citation context at scale, data that Scite has spent years building and that publishers can help disseminate through partnerships.
Josh argued that contextual citations might become the foundation of a more responsible AI ecosystem. Just as Google’s PageRank algorithm transformed how the web was navigated by using citations (links) to evaluate authority, AI systems that use citation context could transform how scientific reasoning is mediated digitally. In a world where hallucinations are common and trust is fragile, this kind of infrastructure is not just useful, it is essential.
Why Publishers Must Build the Plumbing, Not Just Guard the Walls
Throughout his presentation, Josh returned to a theme echoed by the other speakers: publishers must move from a defensive posture (“how do we block AI?”) to a constructive one (“how do we shape AI’s use of our content?”).
Blocking may feel safe, but it is only partially effective. AI developers have already ingested vast amounts of publisher content—sometimes legally, sometimes not. Blocking also prevents legitimate uses of AI that could strengthen discovery, improve research rigor, and enhance the value of the version of record.
The more strategic path, Josh argued, is to build the infrastructure, gateways, chunking pipelines, citation networks, licensing frameworks, that guides AI tools toward responsible use. Publishers cannot stop AI from interacting with science. But they can determine how it interacts and under what terms.
This requires technical investment, legal clarity, industry coordination, and a willingness to rethink long-standing assumptions about content distribution. But the alternative is to allow opaque models to shape the scientific narrative without oversight, accountability, or attribution.
Conclusion: AI Is Here—Now Let’s Teach It to Respect the Scientific Record
Josh Nicholson’s presentation offered a practical, implementable blueprint for aligning AI tools with the values of evidence-based science.
His message can be summarized simply:
- AI cannot be trusted with science unless it is grounded in citations.
- Citations cannot guide AI unless they are structured, contextual, and machine-readable.
- And machine-readable content cannot reach AI responsibly without publisher-built gateways and licensing models.
Together, these ideas form a vision of scholarly publishing in which publishers are not victims of AI disruption but architects of a more accurate, transparent, and accountable AI ecosystem. They are stewards of the version of record in an age when the version of record is no longer what most readers see directly.
The future of scientific communication will be shaped by how well we integrate AI into the research lifecycle. If publishers embrace their role as providers of high-quality, machine-ready content, and as partners in building responsible AI pipelines, then AI can become a powerful ally, not a threat, to the integrity of science.
With this final post, the four-part blog series has explored the problem, the ethical framework, the structural requirements, and the technical solutions needed to protect publisher content in the age of AI.
– By Tony Alves
Read the previous part