Improving Information Retrieval: The Arctic Data Center Unveils New Semantic Search Product

Cross-posted from the ARCUS website

Improving Search Across Concepts

The Arctic Data Center is a repository for all National Science Foundation-funded Arctic research data. The wide range of Arctic research communities represented in the repository often leads to a lack of cohesion in semantics: how terminology is used and defined. To address this challenge, the Arctic Data Center released a refined semantic search interface in September 2019 as part of their data discovery platform.

“In ecology, as well as other environmental and social sciences, there is little standardization in how variables are named, the methods used to gather them, and indeed, clarity about what was actually measured. So that is why there is a semantic challenge–to better understand the contents and implications of the data.” – Mark Schildhauer, Co-Investigator at the Arctic Data Center

Archiving data for different disciplines often entails managing diverse formats and descriptions of data. The new approach of using semantic annotation to data files offers a way to standardize the descriptions of data by attachment of terms from controlled vocabularies, thereby providing definitions of concepts and showing the relationships between different terminology. So, why is it important to have archived data and to be able to search through it anyway?

“The primary reason to archive data is so that other scientists can use that data – (they) can get a better understanding historically, or synthesize them with other data that are complementary, and gain new scientific insights.” – Mark Schildhauer

As Matt Jones, a Principal Investigator at the Arctic Data Center puts it, “We are here to build on the work that came before us, so that we can ‘stand on the shoulders of giants.'”

We search so that we can discover. Let’s dig into what is happening behind the scenes in the search browser:

When a user enters a term (or a string of characters), in the general search box, the search engine uses the inputted character string to scour the metadata (documentation that describes the data) for matches. This generally does not provide comprehensive, nor “semantically-relevant”, results. In contrast, the improved annotation field in the left-hand panel of search options allows users to search across concepts—based on defined terms from controlled vocabularies—rather than solely the character strings found in the metadata, leading to an increase in relevant results (Figure 1).

During the initial stages of building this semantically based search improvement, the semantics team at the Arctic Data Center has focused on constructing a controlled vocabulary and annotating the datasets related to carbon measurements. For example, if you were to search for the character string, “carbon dioxide flux”, in the general search box, not all relevant results will be shown due to varying vocabulary conventions across disciplines—only datasets containing the exact words, “carbon dioxide flux”, are returned (see Figure 2). However, if you search for the concept of “carbon dioxide flux” under the annotation search feature instead, additional data packages will show up, such as this one here (see Figure 3). Notice that the string “carbon dioxide flux” does not appear anywhere in that package’s metadata.

(Note that if you were to interact with the site and explore the results of Figure 2, the dataset in red of Figure 3 will not appear in the typical search for “carbon dioxide flux.”)

Why do the results differ?

Using the annotation search feature to search for “carbon dioxide flux” expands the search to include kinds of carbon dioxide fluxes, such as “carbon dioxide diffusion flux” and “stomatal conductance”, as illustrated in Figure 4 below.

Explore Further with Semantic Search Features

There are two different components to the semantic annotation search feature: (1) semantic annotation browsing, where users can navigate through the term hierarchy and select a term to search (see Figure 4); and (2) the semantic annotation search box, where users type in a search term and then select a term of interest (see Figure 5).

With the improved annotation interface, synonymous terms are displayed as well as subclasses of a term which are automatically searched and included in the results. For example, if “carbon dioxide” is searched, the results will also reveal data annotated with the synonymous term “CO2.” With respect to subclasses, if a user searches “carbon flux”, the interface will additionally display datasets that are tagged with “carbon dioxide flux” because the latter is a subclass of the former.

Moving Forward

As mentioned above, the semantics team has initially focused on datasets related to measurements of carbon. With continued activity the team expects to annotate all of the data in the Arctic Data Center, however this also depends on the existence of well-constructed controlled vocabularies, or “ontologies” as they are called in the semantic web community. To support this effort, we encourage all principal investigators (PIs), data submitters, and users to explore the search interface and provide feedback to the development team by email to: support@arcticdata.io. Because the Arctic Data Center has already annotated a number of datasets, we encourage PIs to review their own datasets for annotation accuracy. Eventually, annotations will be allowed at the data submission stage—stay tuned for updates

If you are curious to explore hundreds of thousands of Arctic data files using refined semantics in the annotation browser, please visit the Arctic Data Center. Additionally, Arctic Data Center staff are Exhibit Hall at the 2019 AGU Fall Meetings held in San Francisco, California.

Written by Cézanna Semnacher and Steven Chong