This year sees HighWire’s 25th anniversary, a huge milestone in our history. Founded by Stanford University during the early days of the web, HighWire pioneered the online revolution in scholarly publishing.
In this blog post, Anurag Acharya – co-creator of Google Scholar – looks back at how HighWire collaborated with Google to drive new discoverability in academia.
Anurag: In the early 2000s, my role at Google was running web indexing: the system that crawls the web, making pages and content discoverable and accessible through search. Nowadays, there’s an assumption that looking for something via Google searches everything, but that wasn’t the case in the early days. Part of my role was to expand the index by reaching out to many different types of organizations – government, business, publishers – to make sure their web sites were included in the index.
A key group among these was scholarly publishers hosting journals and conferences. Having grown up on a university campus, scholarly articles had been all around and I wanted to make sure that they were as easy to find as everything else.
As a part of this, I reached out to HighWire to explore the possibility of indexing the hosted journals. I remember our first call in the Fall of 2002 with John Sack, Todd McGee and several others. A few quick calls, a couple of meetings in person and we were off.
Unlike all other groups I was reaching out to, HighWire was just a few miles down the road at Stanford. This proximity was a key enabler of our collaboration over the years. Together, we have worked on many indexing related features for scholarly articles that have since been widely adopted.
HighWire Metatag Set
One of the challenges in indexing journal articles, as distinct from web pages, is the detailed bibliographic metadata associated with scholarly articles. Citations, sources and journals are key elements of the scholarly ecosystem which need to be taken into account when analyzing and indexing articles. Getting bibliographic metadata right has always been a challenge in scholarly publishing. However, it was even more of a problem for Google Scholar, which was trying to build a single search across all disciplines. Different fields and journals had different conventions, and there was a huge diversity in format, layout, and structure of texts.
Traditionally, libraries and the scholarly communication world had kept ‘metadata’, the bibliographic information about articles, separate from the ‘data’, the articles themselves. This approach worked well enough in the pre-web days when the ‘metadata’ was online or in card catalogs and the ‘data’ was on the shelves. Google Scholar’s goal, however, was to index the fulltext of articles. Keeping the bibliographic information separate from the fulltext of the article meant the indexing system would often run into mismatches between the two.
To solve this problem, John, Todd and I got together with Alex Verstak, the co-creator of Google Scholar and my co-conspirator in almost everything I have done at Google. The four of us came up with a way to embed bibliographic metadata for scholarly articles as HTML metatags. This approach is now used widely and the set of metatags we came up with is today known as ‘The HighWire tag set‘.
Indexing digitized backfiles
HighWire worked with many of the hosted publishers to digitize their backfiles. Scanned papers are very large, though, and fetching them for indexing taxed the network so much that there would sometimes not be enough bandwidth left to crawl recently published articles. This is where Google and HighWire’s geographical proximity really came in handy. We augmented the new and magical powers of the Internet with the familiar and reliable powers of the SneakerNet! Todd would periodically load journal backfiles onto a batch of disk drives. Alex or I would swing by Highwire on our way to Google in the morning, pick up the disk drives and feed them into indexing once we got to work. Once indexing was complete, we would take the drives back to Todd and pick up new ones that he had filled in the meanwhile. My old Honda Civic had long helped me out with good mileage. During this time, it also helped me out with good bandwidth.
Seamless off-campus access
Many years later, John came to speak to the Google Scholar team as part of the Scholar Talk Series. The focus of his talk was ‘Friction in the Workflow’. John described structural issues and limitations that created friction in the research workflow and slowed down researchers. One of these was the number of hoops researchers have to jump through to access the collections their institutions had subscribed to for them.
After John’s presentation, Alex and I brainstormed for a while and came up with an idea that would extend Google Scholar’s Subscriber Links system to enable seamless off-campus subscribed access. The next time we met John, we described the idea to him and CASA (Campus Activated Subscriber Access) was born. CASA builds on the Subscriber Links program which provides direct links in the search interface to subscribed collections for on-campus users. With CASA, a researcher can start a literature survey on campus and resume where she left off once she is home, or traveling, with no hoops to jump through. Her subscribed collections are highlighted in Google Scholar searches, and she is able to access articles in exactly the same way as on campus.
Today, CASA has been widely adopted. What was initially created to fix an inconvenience, a friction, has come to play a critical part during the COVID-19 pandemic, enabling researchers worldwide to continue working from home during shelter-in-place guidelines.
Breaking down barriers: free access to developing economies
Researchers from the developing world have long been hampered by limited access to the published literature. Growing up on campus in India, I saw this all around me. As an undergrad, I myself ran into blind alleys while trying to track down papers about the design of chips for digital arithmetic.
Over the years, several initiatives, including HINARI, AGORA, OARE and others have tried to help researchers in developing countries get access to scholarly literature. However, the administrative complexity of most of these approaches meant that only a few researchers and students were able to take advantage.
HighWire spearheaded a different approach by allowing publishers to select which countries would be granted free access. Using IP mapping, the software would then automatically detect the country a user was connecting from and grant access accordingly, with no administrative or manual barriers for the end-user.
This was and to this day remains the most effective way of making research available to developing countries. At Google Scholar, we worked with HighWire to ensure that researchers from a listed country had equally seamless access to the articles as their brethren from well-resourced institutions.
As somebody who comes from a developing country, and who grew up seeing limitations in access to information, this initiative was particularly close to my heart.
HighWire continues to work closely with Google, working together to drive discoverability of academic content, reduce barriers and friction within the workflow, and create more seamless experiences for researchers.
More in our series of HighWire 25th Anniversary retrospectives:
Latest news and blog articles
Full-text HTML of preprints now available on medRxiv