While we were all hoping the curve of COVID19 would flatten enough to host AGU in person this year, unfortunately that won’t be the reality. Instead, we’re looking forward to the opportunity to reach more folks virtually at the AGU 2020 Annual Meeting! The meeting will be held online everywhere from 1 December to 17 December.

Arctic Data Center staff will be convening the following sessions and giving the following talks, with bold indicating presenter status. Plus, we’ll be answering questions on Twitter as an expert with the #DataHelpDesk – stay tuned for talks, questions, and more to be released with the two hashtags #DataHelpDesk and #AGU20.

Hope to see you there!

Arctic / Arctic Data Center focused:

C016-08: The Permafrost Discovery Gateway: A web platform to enable knowledge-generation from big geospatial data
Dr. Amber E. Budden, Matthew B. Jones, Christopher S. Jones
Tuesday 8 December, 7:28-7:32 Pacific Time

Permafrost thaw has been observed at several locations across the pan-Arctic in recent decades, yet the pan-Arctic extent and potential spatial-temporal variations in thaw are poorly constrained. Thawing of ice-rich permafrost can be inferred and quantified with satellite imagery due to the subsequent differential ground subsidence and erosion that in turn affects land surface cover. Information contained within existing and rapidly growing collections of high-resolution satellite imagery (Big Imagery) is here extracted across the Arctic region through a collaboration between software engineers, computer- and earth scientists. More specifically, we are a) developing geospatial data down to sub-meter resolution, and also b) enabling discovery and knowledge-generation through visualization tools. This cyberinfrastructure platform, the Permafrost Discovery Gateway (PDG), is being designed with input from users of the PDG, e.g. primarily the Arctic earth science community but also the general public. The PDG builds upon other NSF supported data management resources (Arctic Data Center and Clowder) and the Fluid Earth Viewer. We are additionally actively engaging with the user-community to ensure that the PDG becomes useful, both in terms of the type of data contained within the PDG and the design of the visualization tools. The PDG has the potential to fill key Arctic science gaps, such as bridging plot to pan-Arctic scale findings, while also serving as a resource informing decisions regarding the economy, security, and resilience of the Arctic region.

IN008-09: Arctic Data Center: Archiving Arctic Data for Preservation and Re-use
Erin McLean, MS | Community Engagement and Outreach Coordinator
Tuesday 8 December, 10:54-10:57 Pacific Time

Serving as the primary data and software repository for the Arctic section of NSF Polar Programs, the NSF Arctic Data Center’s mission is to help the Arctic research community reproducibly preserve and discover all data, metadata, and software products of NSF-funded science in the Arctic. In fulfilling this mission, the Arctic Data Center has developed tools and resources to support and engage researchers in their data submission. The Arctic Data Center recently launched a data portals service in which researchers can curate collections of data in the catalog that are relevant to their research. The development team also created a provenance web tool to track and attribute the origin of derived data. To ensure proper data citation, submitters are required to use an ORCiD and datasets are given unique digital object identifiers (DOIs). We will discuss these tools and resources the Arctic Data Center is employing to support and encourage researchers to share their data to increase awareness, usage, and facilitate further collaboration for addressing pressing Arctic research questions.

IN015-05: Challenges in Assessing Data Citation and Reuse in Arctic Research Repositories
Maya Samet | Data Fellow
Wednesday 9 December, 17:46-17:50

As journals, funding agencies, and researchers increasingly acknowledge the importance of making data publicly available, the ability to track the impact and use of published datasets is an important step in quantifying the effect of open data practices, and imperative to the duly crediting of impactful data creators and repositories in an open science landscape. The aim of the Data Citation and Reuse project at the NSF Arctic Data Center is to produce accurate citation and research impact metrics for data housed in the Arctic Data Center that can inform researchers about data reuse, as well as to assess the culture and condition of data citation practices in the Arctic science community. The project takes a multi-pronged approach to capturing citations, including programmatic queries to publisher APIs, text mining dataset abstracts for references to related academic work, and investigative case studies of known impactful datasets. We have also developed tools to automate discovery of data citations, including the `scythe` R package. These methods capture a higher number of citations and references than existing citation aggregators do, since researchers do not consistently cite datasets in a way that is captured by these services, and publishers often do not report dataset citations in the same way as they report article citations. In this contribution we will present results comparing the number of citations of Arctic Data Center datasets captured by different methods and discuss future directions to continue capturing data citations and improving data citation practices.

ED042-09: Using Data Repositories to Transform Undergraduate Learning
Sarah Erickson | Data Fellow
Monday 14 December, 17:55-17:58 Pacific Time

There are numerous obstacles (i.e. funding, class load, technical ability, racial or social inequities, etc) for undergraduates to obtain field research experience, especially to remote study sites. The COVID-19 quarantine has further exasperated these barriers. The associated transition to virtual learning is challenging many undergraduate instructors to rethink the structure of their course(s). Together, these challenges present an excellent opportunity for educators to draw on the growing body of open data stored in public data repositories for data-focused educational activities. Using data from long-lived, publically accessible archives in lessons not only allows educators to incorporate real datasets into their curriculum, but also introduces students to essential data management skills. As data-intensive environmental and multidisciplinary research grows, it is increasingly necessary to equip students with the skills and knowledge they need for success after graduating. Over the past year, the Arctic Data Center has developed a set of resources and modules for undergraduate educators to use. These modules incorporate publicly available datasets from our repository that allow students to explore a range of concepts in the biological, earth, and environmental sciences. In this session, we will discuss these materials and how instructors can use them to incorporate authentic data-analysis and Arctic science into their courses.

ED054-06: Redesigning an Intensive, Interactive Data Science Training for Remote Participation
S. Jeanette Clark, MS | Projects Data Coordinator
Wednesday 16 December, 16:24-16:28 Pacific Time

The Arctic Data Center has consistently delivered high quality, data intensive short courses for the Arctic Research Community. These one-week courses support the community in developing skills necessary for conducting research in an open and reproducible manner. A combination of instruction, demonstration, live-coding, practice, and peer mentorship results in a highly engaging experience that enables participants to cover a large amount of content in short duration. Additionally, participants have the opportunity to network and form collaborations with aligned researchers. With changes to travel, capacity limits, and policies surrounding indoor events following COVID-19, it was necessary to transition to a remote format. While many conferences and discussion based workshops had successfully transitioned to an online environment, there were limited examples of online courses that provided the small group, code-based instruction and problem solving that our training seeks to deliver. In this presentation, we discuss the process used in course development, changes that were required, and report on the challenges and successes of the activity from instructor and participant perspectives.

Data repository focused:

IN015-04: Publishing to the DataONE Network of Repositories for Improved Discovery, Assessment, and Interoperability
Matthew B. Jones | PI
Wednesday 9 December, 17:42-17:46 Pacific Time

Publishing data to repositories is now well recognized as a critical component of the scientific process, but determining which repositories are best suited for your data remains challenging. The DataONE network (https://dataone.org) of repositories makes this process simpler by improving cross-repository interoperability and cross-repository search, and thereby lessening the impact of which repository is chosen. DataONE repositories serve multiple disciplines and constituencies, including earth and environmental science, ecology, geoscience, hydrology, Arctic research, social science, archaeology, and myriad other domains. Repository members of DataONE benefit from the ability to expose their holdings across the broader network, and researchers can search the nearly 50 repositories in DataONE from a single interface, rather than visiting each repository individually. DataONE provides researcher-focused services, including custom cross-repository data portals for researcher data collections, as well as institutional and thematic collections. In addition, DataONE provides advanced metrics across all of the network so both researchers and repositories can understand data usage and citation trends, assess the quality of holdings against the FAIR principles, and ensure preservation of data through cross-repository data replication. New value-added services enable smaller organizations like field stations and libraries to showcase their own collections as part of DataONE Plus, or to affordably host a repository that is well-connected to the rest of the network. We will present characteristics of the repositories in DataONE, and an analysis of holdings across the network from the perspective of the FAIR principles, data usage, and data citation.

IN015-08: The Knowledge Network for Biocomplexity (KNB) – A data repository for ecology and environmental science data
S. Jeanette Clark, MS | Projects Data Coordinator
Wednesday 9 December, 17:58-18:02

The Knowledge Network for Biocomplexity (KNB) is an international repository intended to facilitate ecological and environmental research. The KNB was launched in 1998 with a grant from the National Science Foundation (NSF), with the purpose of being the long-term home for synthesis datasets and research products generated by working groups at the National Center for Ecological Analysis and Synthesis (NCEAS). Since then, NCEAS has continued to operate the KNB not only as an archive for NCEAS working group products, but also for the broader ecology and environmental science community. The KNB accepts all environmental or ecological related data and publishes datasets with Digital Object Identifiers for the express purpose of ensuring long-term access to these datasets. We strive to abide by FAIR (findable, accessible, interoperable, reusable) principles of data sharing and preservation.

IN008-02: Connecting Environmental Systems Science and Digital Library Practices
Matthew B. Jones, Christopher S. Jones
Tuesday 8 December, 10:33-10:36 Pacific Time

The U.S. Department of Energy’s (DOE’s) Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) stores and publicly distributes data from observational, experimental, and modeling research funded by the DOE’s Environmental Systems Science activity. The diversity of data and interdisciplinary nature of projects presents challenges in developing recommendations for data management, reporting, and publication. As representatives of Environmental Systems Science researchers, we can also provide valuable feedback within the informatics community and influence existing practices to better support interdisciplinary science. In this presentation, we demonstrate a community-focused approach in connecting our scientists with best practices for data curation and publication developed in broader informatics and digital library communities. We explore other challenges encountered as a broad, interdisciplinary repository, such as efficiently curating interdisciplinary data types, ensuring that data is FAIR and of high quality, and that authors receive appropriate credit for contributing quality datasets. Overall, the success of our repository relies on our ability to support specific community needs, and incorporate practices that help maximize the value of Environmental Systems Science data now and in the future.

IN008-03: The ESS-DIVE repository and next steps toward a usable, trusted, and FAIR repository
Matthew B. Jones, Christopher S. Jones
Tuesday 8 December, 10:36-10:39 Pacific Time

The US Department of Energy’s (DOE) Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) data repository focuses on three areas of development: expanding adoption and use by ESS Users, standardization of data, and support for projects providing data to the repository. The priorities of the repository are continually revised and refined based on input from the community. Our current focus is on expanding the user-base and functionality of ESS-DIVE through five key innovations: (1) understand user needs; (2) support for early data archiving by projects; (3) reaching a broader portion of the ESS community; (4) support search of extracted ESS-DIVE data with a fusion database; and (5) federation with other repositories. We are focused on providing a scalable, robust repository and long-term curation of ESS data that adhere to Findable, Accessible, Interoperable, and Reusable (FAIR) principles, with the goal of increasing the ease and capacity of storing data in the repository. We are working with our community to evaluate the available methods of providing usable citations for large subsets of the data from a project. Our end goal is to have a repository that is trusted by the community and that is the preferred storage facility for data generated by the DOE ESS program and the preferred provider of ESS data. One challenge is that FAIR principles are designed to address the needs of the data user, and largely ignore the needs of the data provider. As publishers move to require CoreTrustSeal certification, we expect to see increased pressure to obtain the certification.

IN015-07: Letting the community lead the way to data integration: Data standards and documentation developed by domain experts and the ESS-DIVE repository
Matthew B. Jones, Christopher S. Jones
Wednesday 9 December, 17:54-17:58

Many repositories, including the US Department of Energy’s (DOE) Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) repository, see data integration and synthesis as a key step in harnessing the power of the large datasets contained within the repositories. However, the lack of standardization in data contributed by users can prohibit data reuse and integration. To kickstart the generation of reporting standards, the ESS-DIVE repository funded six community partners from national labs around the US to develop 7 metadata/data related standards. We begin by describing how our community partners achieved consensus on standards for some of the most common data types uploaded to ESS-DIVE. One challenge community partners faced was providing robust documentation so that any data producer could adopt the standards prior to uploading their data to ESS-DIVE. Documentation also needed to be dynamic so that when standards required modifications it was relatively easy to do so. To overcome this challenge, ESS-DIVE has begun to implement a software versioning-style framework to allow for data standards to be transparently developed and updated. Data uploaded to the ESS-DIVE repository that adhere to these community standards will be more interoperable and reusable, facilitating synthesis across datasets. These standardized data contributions to ESS-DIVE would then enable a deeper integrated search of the individual data files within the repository through the ESS-DIVE “fusion database”. Ultimately, by developing standards, providing clear documentation, and a transparent way of updating standards, ESS-DIVE provides a sustainable path toward data integration through community-driven standard development.

IN047-09: Optimizing the Efficiency of Metadata Curation in Large Scale Data Repositories
Matthew B. Jones, Christopher S. Jones
Thursday 17 December, 4:24-4:27

The Environmental System Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) data repository stores highly diverse Earth and environmental science data generated by projects funded by the U.S. Department of Energy (DOE). A system of metadata quality standards was developed through extensive community collaboration to ensure the data submitted to ESS-DIVE remain findable, accessible, interoperable, and reproducible (FAIR) for data users. However, ongoing implementation of these checks requires a metadata review process capable of scaling with the growth of the repository as increasing emphasis is placed on the importance of data archival within the environmental sciences. To address this challenge, ESS-DIVE created a robust data package review workflow incorporating both automated and manual checks for each data package submitted for publication. A suite of automated metadata quality FAIR checks was developed by the National Center for Ecological Analysis and Synthesis (NCEAS) and tailored to fit ESS-DIVE’s needs through research into metadata best practices, review of journal metadata requirements, and community feedback. The results are compiled into Metadata Quality Reports, which provide instantaneous feedback to both the data contributor and ESS-DIVE reviewers on problem areas within the metadata. Reviewers then carry out manual checks focused on metadata content and complete post-review assessments that collect the length of time each review takes. Standardized feedback responses are generated by both series of checks and are used by the reviewer to collaborate 1:1 with contributors until all standards are met and the data package is eligible for publication. This system has improved the quality of ESS-DIVE data while decreasing review time by ~60% from the start of implementation. This system of metadata review will sustain and support higher volumes of publication requests, ensuring that metadata quality standards are enforced throughout the continued growth of the ESS-DIVE repository.

IN047-10: Increasing visibility of historical datasets through modern repository practices
Christopher S. Jones, Matthew B. Jones
Thursday 17 December, 4:27-4:30

The Environmental Systems Science Data Infrastructure for a Virtual Ecosystem (ESS-DIVE) repository preserves, expands access to, and improves usability of Earth and environmental science data. Amongst several efforts to improve the visibility of ESS-DIVE data, we’ve adapted the “Portals” feature from the National Center for Ecological Analysis and Synthesis (NCEAS) Metacat platform, providing our repository users a space to showcase their custom data collections. Here, ESS-DIVE demonstrates the utility of Portals for data discovery using the legacy data collection of Carbon Dioxide Information Analysis Center (CDIAC) datasets. CDAIC was a DOE climate-change data archive containing high-value fossil fuel emission and vegetation response data that ceased operations in 2017. When ESS-DIVE took on the responsibility of maintaining these decades worth of vital climate change data, we had the opportunity to increase the discoverability of these datasets using a modern, manageable user interface. In collaboration with the DOE’s Office of Science, Technology and Information (OSTI), we enhanced the CDIAC metadata previously obscured from users and coupled the datasets and metadata into packages on ESS-DIVE. Then, using the NCEAS feature, we created a portal to easily view all CDIAC datasets within the ESS-DIVE repository and transferred project information into portal landing pages, providing an archive-centric view of CDIAC data. Portals are a permanent feature in ESS-DIVE that any user can leverage to create custom, branded landing pages about their research topic with any related datasets published on ESS-DIVE.

Sessions convened:

IN013 – Best Practices and Realities of Research Data Repositories: Which One Should I Choose to Publish My Data? I
Wednesday 9 December, 4:00-5:00 Pacific Time
IN015 – Best Practices and Realities of Research Data Repositories: Which One Should I Choose to Publish My Data? II
Wednesday 9 December, 17:30-18:30 Pacific Time
IN008 – Best Practices and Realities of Research Data Repositories: Which One Should I Choose to Publish My Data? III eLightning
Tuesday 8 December, 10:30-11:30 Pacific Time
Dr. Amber E. Budden (Convening)

In recent years, the number of Earth and environmental research data repositories has increased markedly, as has their range of maturities and capabilities to integrate into the ecosystem of modern scientific communication. The FAIR Data Principles, the CoreTrustSeal Certification and the Enabling FAIR Data Commitment Statement have all raised expectations on the capabilities of repositories. As funders and publishers increasingly require that research data be made publicly accessible, researchers are challenged to learn where and how to publish their data. How do researchers know which repositories meet these benchmarks and future expectations? This session will showcase the range of practices in research data repositories, data publication and the integration of data, software and samples into the scholarly publication process. It invites repositories to discuss challenges they are facing in meeting community best practice. This session will also help researchers to answer “Which repository should I choose to publish my data?”

ED054 – Inclusive Research Collaboration and Learning in a Virtual Environment I
Wednesday 16 December, 16:00-17:00 Pacific Time
ED051 – Inclusive Research Collaboration and Learning in a Virtual Environment II Posters
Dr. Amber E. Budden (Convening)

Broad, collaborative, interdisciplinary research provides unique opportunity for novel insights at a global scale. While there are extensive benefits to convening diverse groups, coordination across distributed teams can be challenging. Variation in seasonal fieldwork, extensive travel, and funding availability limit opportunities for in-person engagement and remote participation can be stymied by time zones and availability of infrastructure at the local level. At a time where society is increasingly concerned with its global carbon footprint and is grappling with the consequences of the COVID-19 pandemic, researchers have had to fast-track use of virtual technology to support their work and collaborations. There are many lessons being learned along the way and in this session we will highlight the leading practices that have emerged, in addition to areas for improvement. Bringing together in-person presentations and remote delegates, we will ‘practice what we teach’ through an interactive, engaging session focusing on supporting the Earth science community in developing methodologies for working with diverse, interdisciplinary groups in a productive and inclusive virtual environment.