Online Indexing of Scholarly Publications: Part 1, What We All Have Accomplished

“Let no one tell you that ‘Scholarly communication hasn’t changed’”

HighWire conducted its first extensive user studies in 2002. Since then, several things have completely altered the workflow of the researcher:

  • full text of most current journal articles is centrally indexed;
  • back archives of a significant fraction of the full text research literature is online, and centrally indexed as well.

“Centrally indexed” was a watershed point.  In 2002, Google’s web search (i.e., started indexing the full text of journal literature — including the portion behind paywalls – starting with HighWire and its publishing partners. HighWire saw the use of journal article content go up by one and in some cases two orders of magnitude following this! And then, in 2004 Google Scholar was born,, recognizing that the workflow and goal of a researcher is not best-supported by a general-purpose internet search engine, no matter how good its ranking algorithms are.

Now, a decade after our first user studies, users report to us that “Finding is easy; reading is hard.”

This transformation in discovery – and its consequences – was the topic of the opening keynote at the September 2015 ALPSP Annual Meeting. Anurag Acharya – co-founder of Google Scholar – spoke and answered questions for an hour.   That’s forever in our sound-bite culture, but the talk was both inspirational — about what we had collectively accomplished — as well as exciting and challenging – about the directions ahead.   Anurag’s talk and the Q&A is online as a video and as audio in parts one and two

This post is in two parts: the present Part One covers Anurag’s presentation of what we have accomplished. Part Two, to be posted on Monday, October 12, covers the consequences. Anurag has agreed to address questions that readers put in the comments.

Here is my take on the key topics from Anurag’s talk.

Search is the new browse

Prior to the introduction of web-based search engines in the 1990s, researchers would select their reading by looking at issue tables of contents – an article list selected by the journal editor. Reference lists in an article were also browsed; this was an article list selected by the article’s authors. A few fields featured primitive indexing via publications like Current Contents.

In the mid-1990s, these tables of contents began to be emailed, which saved many trips to libraries. And even today “eTOC” reading is still a common part of the researcher workflow.   The eTOC allows researchers to skim perhaps 8-10 journals regularly, where earlier we heard anecdotally that researchers were able to cover only 3-4 journals regularly in print.

By the mid-2000s, the editor- or author-assembled list of articles in TOCs and references was replaced by a list assembled dynamically in response to a search engine query that suited the individual’s requirement at that time. Search became the new browse.

In the mid-1990s, the scope of what you could cover in your current-awareness was essentially limited to what you could scan and recall from what your library, and your personal or departmental/lab subscriptions, made available. But ten years later, these limits were gone. Now you browse relevance-ranked (not just date-ranked) search result lists.  It is possible (though Anurag did not speak to this) that the “Just in Case” scanning of journal tables of contents to stay informed on your subject is now being replaced generationally by “Just in Time” scanning of search results.

High-quality relevance ranking that understands ‘scholarly filtering’ was a huge step forward.   But relevance ranking is not all the story.

Full text indexing of current articles plus significant backfiles joined with relevance ranking to change how we searched and what we did with the results.

Sometimes it takes a combination of factors to change a workflow. (E.g., Uber would not have made much of a difference in urban transportation unless it had mobile phones to run on.)   Broad search engines of the 1990s indexed only abstracts for the most part; full text indexing allowed searches for details such as methods, conclusions, specific assertions (‘x catalyzes y in the presence of z”), and drug interactions to be found. As Anurag said, “full text indexing allows all parts of articles to rise.”

Huge backfiles of a significant number of journals went online in the early/mid 2000s, in part because their discovery became possible. This, combined with scholarly relevance ranking, effectively allowed historical portions of the research literature to rise from the previous bias of most-recent-first ranking.

One wonders whether this combination of backfiles and full text indexing would have made a difference in some fatal situations in clinical trials in 2001.   We don’t often think of our work as involving life and death; but perhaps it does.

“Articles stand on their own merit”

The ‘democratizing’ effect goes beyond full text allowing articles to ‘rise above their abstracts’ and back literature to rise above current articles. This effect also enables “articles to stand on their own merit” in a search result list, not primarily on the merits of the venue (journal) in which they appear – the “distribution pyramid has flattened,” as Anurag said.

“Bring all researchers to the frontier”

A further ‘democratizing’ effect of the freely-available and comprehensive scholarly-content search engine is that it “helps bring all researchers to the frontier”.   That is, scholars beyond the world’s premier research institutions can now discover what they should read.

“So much more you can actually read”

Of course, finding isn’t reading (even Cliff’s Notes don’t claim that, much less Google’s snippets).   There is “so much more you can actually read”, as a benefit of free back archives (which about 300 HighWire-hosted journals provide), preprints, repositories, open access journals and open access articles within subscription journals, and ‘big deal’ licenses. And where there are multiple copies of a work online, Scholar’s “subscriber links” and Open URL “library links” can help an institution-based reader find the available copy.

Anurag concluded his historical view by saying “Let no one tell you that ‘scholarly communication hasn’t changed’”.

In the Part Two of this post, I will cover Anurag’s view of “What Happens When Finding Everything is So Easy?”

As noted above, Anurag has agreed to address questions that readers put in the comments.

Latest news and blog articles