Avoiding Data Risk: Strategies Protecting From Data Loss at the Arctic Data Center

The NSF Arctic Data Center, the primary data and software repository for the Arctic section of NSF Polar Programs, is now CoreTrustSeal certified. Our established data preservation practices and infrastructure are now endorsed by an international, community-based non-profit organization. 

CoreTrustSeal certification involves meeting 16 different requirements, which are intended to ensure the reliability and durability of the repository itself so that the data can be used, shared, and preserved over a long timespan. The process to become certified is similar to the acceptance of an article in a peer-reviewed journal: the repository must supply evidence that they’re meeting the 16 requirements, and then that evidence is reviewed by community peers and external professionals. Comments are provided if any of the requirements is lacking, and the repository corrects any oversights before gaining the certification. We are proud to report we meet all 16 requirements, outlined below, and have achieved CoreTrustSeal certification:

 Requirement Description
1The repository has an explicit mission to provide access to and preserve data in its domain.
2The repository maintains all applicable licenses covering data access and use and monitors compliance.
3The repository has a continuity plan to ensure ongoing access to and preservation of its holdings. 
4The repository ensures, to the extent possible, that data are created, curated, accessed, and used in compliance with disciplinary and ethical norms.
5The repository has adequate funding and sufficient numbers of qualified staff managed through a clear system of governance to effectively carry out the mission.
6The repository adopts mechanism(s) to secure ongoing expert guidance and feedback (either in-house, or external, including scientific guidance, if relevant).
7The repository guarantees the integrity and authenticity of the data.
8The repository accepts data and metadata based on defined criteria to ensure relevance and understandability for data users.
9The repository applies documented processes and procedures in managing archival storage of the data.
10The repository assumes responsibility for long-term preservation and manages this function in a planned and documented way.
11The repository has appropriate expertise to address technical data and metadata quality and ensures that sufficient information is available for end users to make quality related evaluations.
12Archiving takes place according to defined workflows from ingest to dissemination.
13The repository enables users to discover the data and refer to them in a persistent way through proper citation.
14The repository enables reuse of the data over time, ensuring that appropriate metadata are available to support the understanding and use of the data.
15The repository functions on well-supported operating systems and other core infrastructural software and is using hardware and software technologies appropriate to the services it provides to its Designated Community.
16The technical infrastructure of the repository provides for protection of the facility and its data, products, services, and users.

In addition to seeking certifications like CoreTrustSeal, data repositories are able to mitigate data risk in other ways. A new paper out from Mayernik et al. (2020) discusses a framework for assessing the various risks that come with preserving and archiving data. 

How can data, one might wonder, be at risk? Anything that limits current or future data use can be considered a risk to the preservation and longevity of the data. There are many factors that can put data at risk, which Mayernik and his collaborators collected in the following table: 

With many of these risk factors, efforts can be taken to mitigate the chance of the data becoming inaccessible. Basic steps researchers can take might include ensuring that complete metadata are recorded at the time the data is collected. A more complex and labor intensive step might be a full data rescue initiative, such as the coordinated efforts taken by grassroots organizations to pre-emptively “rescue” data after the 2016 US Presidential election. The bottom line, though, is that it takes effort to guard against data loss.

While some of the risks associated with the loss of data utility have to be addressed by the researchers collecting the data, there are a number of the risk factors identified by Mayernik et al. (2020) that can be mitigated by data repositories. Establishing that a repository has systems in place to minimize these risks is an important part of generating trust between researcher and repository. One way to effectively communicate these efforts is for repositories to seek external certification – as we have done at the Arctic Data Center by fulfilling all of the CoreTrustSeal requirements. 

Additionally, we’re doing our part to help mitigate data loss from other repositories. The International Arctic Research Center (IARC) is winding down their operations as a data repository, and the Arctic Data Center was identified as the place where the data would go in order to avoid data loss. We’re currently working with the folks over there to add their collection of datasets – 260 in total – to the holdings of the Arctic Data Center so that they can continue to be used, shared, and preserved.

Bibliography

Mayernik, Matthew S., et al. “Risk Assessment for Scientific Data.” Data Science Journal 19.1 (2020).