What I Learned at NISO Plus 2023 Part1: Metadata collection, curation and usage

What I Learned at NISO Plus 2023 Part1: Metadata collection, curation and usage

About our SVP of Product Management, Tony Alves’ experience and learnings at NISO Plus.

In this three-part series I am going to discuss what I learned from NISO Plus, an online conference hosted by the National Information Standards Organization, featuring presenters from all over the world, and from every part of the scholarly communications ecosystem. This is a three-day conference with multiple concurrent sessions spanning time zones and continents. I selected sessions based on three main themes, Metadata, Persistent Identifiers, and Measuring Open Access Usage. This first installment looks at three sessions on metadata collection, curation and usage.

COLLECTING METADATA

The first session entitled “Telling a Story with Metadata” featured two presenters who advocate for good curation of metadata early in the content collection process. Julie Zhu, who manages discovery partners, like search engines, link resolvers, and proxy service at IEEE, reminds us that metadata displays the key components of articles like author related information, publication and article information, access information, funding and more. This metadata flows through the entire ecosystem, through various pipelines that feed all of the systems within the scholarly publishing infrastructure. It is important to keep this data flowing, and Zhu points out that content providers, discovery services and libraries are “data plumbers” – building and fixing the metadata plumbing to ensure that this information is moving frictionlessly through the system.

Authors are the first to provide metadata, generally during the submission process. IEEE has created some best practices for authors to follow such as, “think like a search engine” and use important terms that are most relevant to the article, be succinct, use synonyms to broaden searchability, avoid special characters and abbreviations. The burden does not fall just on the author; Zhu reminds us that just because one team creates metadata correctly doesn’t mean that others downstream will maintain that metadata correctly. Monitoring and encouraging teams to communicate about proper metadata hygiene is important for ensuring good clean metadata throughout the scholarly communication pipeline.

The second presenter, Jenny Evans, who is responsible for scholarly communications, research integrity policy and information systems at the University of Westminster, talked about non-traditional scholarly outputs and non-text outputs. She discussed a concept called “Practice Research”, a methodology in which knowledge is gained via the doing of something, rather than reading about it, or from inquiring what people know about the subject matter through interviews and other non-textual means. Capturing metadata for this sort of scholarly output is difficult, especially for machine-readability, because the metadata might first appear verbally or visually, and thus it is up to interpretation when it gets recorded into a system or database. Also, for non-text, live event capture, the process is as significant as the final product, meaning the metadata has to reflect more than just facts about the content. For example, it is often time-based, location-oriented, and there are usually a range of contributors that do not fit neatly into scholarly categories.

Evans describes a community effort to develop a database to adequately collect Practice Research metadata. Working with the community, they surveyed those conducting Practice Research to understand their needs, rather than taking a technology-first approach. They attempted to use many of the current standards created by organizations like ORCID, Datacite, RAiD and Crossref. She emphasized that they don’t want to develop new standards, but rather, they have provided suggestions for expanding those existing standards to better fit the Practice Research model. She also points out that repositories need to be able to capture process, not just serve as a retrospective archive.

MULTILINGUAL METADATA

The second session on metadata that I want to highlight was called “Multilanguage Metadata”. This session had four presenters who all work within organizations that deal with multilingual information gathering, processing and/or dissemination. Scholarly communication is a global endeavor, and though much of it is conducted in English, there are many efforts seeking to be sure that metadata, and the systems that support metadata collection and distribution, can handle non-English and non-Romanized content.

The first presenter, Juan Pablo Alperin of the Public Knowledge Project, discussed how their submission system, Open Journal Systems (OJS) encourages multilingual metadata and why it matters. OJS is used across languages and cultures, and unlike other systems, it’s not just the user interface that is multilingual, the database fields also accommodate multiple languages. Alperin emphasizes that the language these journals are publishing in is important, and that it is essential that the article metadata that is sent out to the rest of the publishing ecosystem isn’t restricted to just English language outputs.  

Alperin also described a project that PKP is working on called Metadata for Everyone. This project looks at what are the metadata quality problems that exist pertaining to language and culture. Some finding so far include: naming conventions are cultural; non-roman characters are hard to capture; metadata is in a different language from the article; language of the article is often misstated; multiple languages are captured in the same field; and there are often missing translations. Alperin cited some interesting statistics related to multilanguage publishing collected from their user-base: smaller publishers publish in more languages than do larger publishers; the language of the article is left out of the metadata 20% of the time; 88% of multilingual records have English as one of their languages.

The second presenter was Hideaki Takeda, a professor from the National Institute of Informatics in Japan, who addressed multilingual issues of scholarly publishing in Japan. Takeda pointed out that there is a mix of domestic and international scholarly activities in Japan, and he outlined the current state of multilingual metadata and how it is different from discipline to discipline. For example: for natural science English is the major scholarly language; for engineering, medicine, agriculture, there is a mix of English and Japanese; and for social science and literature, it is mostly Japanese. Metadata may be in two or more languages. Japanese publications often have English metadata as well as Japanese metadata. This is important in order to ensure Japanese literature and science is included in international scholarly activities.

Takeda credited Crossref for driving the expansion of multilingual accommodation in scholarly communications, noting that most systems can now handle multiple languages in their metadata fields. However, searching and discoverability can still a problem since many discovery services accommodate a single language. His organization offers language mapping to systems that don’t accommodate multiple languages. Takeda also noted that automatic translation is an emerging issue with some drawbacks to consider, such as: the accuracy of translation; has the translation been authorized; is it a formal or informal translation; did the author expect that translation to take place and did it cause ethical issues.

Third up was Jinseop Shin from the Korea Institute of Science and Technology Information (KISTI), who talked about, KoreanScience, a platform that distributes Korean research worldwide. KoreanScience displays metadata of articles in both English and Korean. However, the articles are generally written in either English or Korean. He also talked about the Korean DOI Center which manages and assigns DOIs in South Korea. When DOIs are assigned, they encourage the submitter to provide both English and Korean metadata.

Shin described a really interesting project designed to provide automated translation of research articles to make research more accessible. KISTI employed 2000 people during COVID-19 to convert 500,000 articles to HTML. They then used those articles to train an AI model to automatically convert PDF’s to HTML. This AI tool allows people to convert PDFs to html and then translate those articles to their preferred language using Google Translate.

The final speaker was Farrah Lehman Den, and index editor and instructional technology producer at the Modern Language Association in New York. Den discussed “Working with the Hebrew alphabet, Metadata and Bidirectionality”. Den started by discussing how difficult it is to search databases using transliterated Hebrew because of differences in transliteration rules and customs. There are different transliteration authorities that focus on different types of Hebrew, for example, modern versus Biblical. This means that many scholars prefer to search using Hebrew characters, rather that Romanized characters.

Den also discussed the problems that bidirectionality can cause when searching databases that include Hebrew characters. Because Hebrew and Arabic are written and read from right to left, many systems have problems accommodating research created and published in these languages. Systems also struggle with the use of these languages when words, phrases and characters are used in Romanized metadata because a mix of left to right and right to left directionality can cause complications. For example, an article title that has a mix of English and Hebrew must preserve correct directionality for the various parts in order for it to make sense. Google Translate does not recognize Hebrew and Arabic when it is mistakenly rendered backward, and will render it as gibberish.

SETTING NATIONAL METADATA POLICIES

The third session I want to highlight is entitled, “Aligning national priorities when it comes to open science metadata requirements”. Key to a consistent national policy on open science is the setting of standards for the easy findability, accessibility, interoperability and reusability of content and research data. Consistent rules for metadata is a very important part of building a research infrastructure that makes this possible. The US federal government through the White House Office of Science and Technology Policy (OSTP) has put forward a policy to mandate many open science practices for all federally funded research. The mandate does not set any specific standards or policies, but rather it leaves individual agencies to set up on their own policies. This can become very problematic considering the fact that there are numerous government funding agencies that could potentially come up with disparate rules. The fact that there are numerous government funding agencies is complicated by the fact that there are also many other industry and philanthropic organizations that provide funding, as well and hundreds, if not thousands of other entities like publishers, institutions, repositories and systems, and all of these have already-established processes and standards.

The first presenter, Dan Valen, Director of Strategic Partnerships at FigShare discussed the state of open data in the time of new funder policies. Figshare has been running a survey, called State of Open Data, for 7 years, looking at trends in the open data space. There are over 25,000 respondents from over 192 countries, providing a sustained look at the state of open data over time. Valen focused on how researchers’ opinions align with the recent OSTP mandate, citing researchers’ concerns around data misuse, cost of submitting data, and not getting credit for the data. Despite these concerns, approximately 70% of respondents said that funders should make data sharing a requirement for getting grants, and should withhold funding if data is not shared. The 2022 State of Open Data survey shows 72% of researchers would rely on internal resources to help manage and make data openly available; 4/5 are in favor of making research data openly available as a common practice; 75% of researchers say they receive too little credit for sharing their research.

Although the sentiment is positive for open science and open data, researchers still tend to shy away from actually doing it. This is because managing and maintaining research data is complicated and potentially expensive. In an effort to show the benefits of data management plans, Valen cited a paper published in PLOS which showed that papers that are linked to their supporting data had a 25% increase in citations. He also noted that in 2022, all National Institutes of Health supported research needed to have a data management plan, and the costs of data management could be part of the grant request. The OSTP memo required all federal agencies to issue a plan for making data and research funded by their agencies openly available. Including data, and not just the research reports, in the mandate shows that data is an important part of the research article.

The second presenter, Nokuthula Mchunu, Deputy Director for the African Open Science Platform, talked about a continent-wide implementation of open science practices in Africa and the development of a federated network of computer facilities to support African researchers. The purpose of these facilities is to provide access to data intensive science and cutting-edge resources and to create a network of education and skills to ensure that there is a dialogue between scientists, thus promoting open science and open access. For example, the Open Science Platform and the computer facilities have helped create a regional Space Science and Technology Strategy, connecting space science researchers and allowing them to share their research. Similarly, the SADC Cyberinfrastructure Weather and Climate is now networked so that weather and climate science can be shared – this results in the development of early warning systems, as well as shared observations, modelling and dissemination of information.

The third presenter, Jo Havemann, an open science consultant, continued the focus on Africa by discussing AfricArXiv, an African-centric preprint server. She also described the effort to refine Africa-specific metadata schemas for continent-wide inclusive discoverability. Havemann pointed out that the establishment of a regionally hosted repository system, AfricArXiv, that uses existing digital scholarly infrastructure can address problems such as: low visibility of African scholarly output; restricted access to funding; and underrepresentation of African scholars in research networks. The infrastructure they are using include open-science systems like Zenodo, ScienceOpen, Qeios, figshare, PubPub and Open Science Foundation. They are also building infrastructure with the help of local institutions and universities. 54 African countries are getting their own profile so they can display research about and from their country stored in their institutional repositories, making the research output from research institutions in those countries more discoverable and encouraging collaboration across the continent.

In my next installment of “What I Learned from NISO Plus!” I will summarize what I learned from a collection of session on Persistent Identifiers (PIDs). Those sessions include, “One identifier to rule them all? Or not?”; “National PID strategies and what they mean for the NISO community”; and “PID Innovations and Developments in Scholarly Infrastructure”.

By Tony Alves

Latest news and blog articles