Data Submission Guidelines

The National Science Foundation Office of Polar Programs (NSF OPP) mandates that metadata, full datasets, and derived data products be stored in a long-lived and publicly accessible archive.

To meet these requirements, the Arctic Data Center was established with NSF funding, serving as this archive specifically for the Arctic Sciences Section (ARC) data and metadata. The center ensures that ARC-related information is preserved and readily available for public access.

Who Must Submit?

The principal investigator from any NSF OPP funded project is required to submit and publish all relevant metadata and data onto a publicly accessible archive.

Data from ARC-supported scientific research should be deposited in long-lived and publicly-available archives appropriate for the specific type of data collected (by default, the NSF-supported Arctic Data Center or others where appropriate). Metadata for projects, regardless of where they are archived, should be submitted to the Arctic Data Center for centralized access and discoverability.

For all ARC supported projects, see the NSF OPP Data Management and Data Reporting Requirements, which include the following conditions:

  • Complete metadata must be submitted to a national data center or another long-lived, publicly accessible archive within two years of collection or before the end of the award, whichever comes first.
  • All data and derived data products that are appropriate for submission (see exceptions below), must be submitted within two years of collection or before the end of the award, whichever comes first.

For all ARC supported Arctic Observing Network (AON) projects, NSF also requires:

  • Real-time data must be made publicly available immediately. If there is any question about what constitutes real-time data, please contact the appropriate NSF Program Officer.
  • All data must be submitted to a national data center or another long-lived publicly accessible archive within 6 months of collection, and be fully quality controlled.
  • All data sets and derived data products must be accompanied by a metadata profile and full documentation that allows the data to be properly interpreted and used by other researchers.

For sensitive social science data:

  • NSF policies include special exceptions for Arctic Social Sciences (ASSP) awards and other awards that contain sensitive data, including human subjects data and data that are governed by an Institutional Review Board (IRB) policy. These special conditions exist for sharing social science data that are ethically or legally sensitive or at risk of decontextualization.
    • If you are unfamiliar with the IRB, the Arctic Data Center has a set of resources which can help serve as a guide to navigate the application process and plans for data post collection. 
  • In these cases, NSF has requested that a metadata record be created to document non-sensitive aspects of the project and data, including the title, contact information for the dataset creators and contacts, and an abstract and methods description summarizing the data collection methodologies that does not include any sensitive information or data.
  • Please let us know when submitting your record that your data contains sensitive information so that we can adjust our review process accordingly.
    • The Arctic Data Center has data tags found early in the data submissions process which serve as a guide for how to proceed. Each data tag indicates the level of sensitivity and/or restriction of the data. 
  • Please contact your NSF Program Manager if you have questions about what to submit or what is required for any particular award.

Please write to support@arcticdata.io with any questions or for clarifications, and we will help to clarify these policies to the best of our ability. Ultimately, NSF makes the final policy decisions on these data submissions.

Organizing Your Data

What is a Data Package?

Data packages on the Arctic Data Center are simply defined as a collection of related data and metadata files. Each data package should contain, when possible, all of the relevant data and metadata from a specific research project (or sub-project/project component).

Depending on the size of a research project, multiple data packages may be associated with a single research project.

  • For example, if a research project consists of field sampling at several distinct sites or over several distinct sampling seasons, each site/season may have its own unique data package.
  • When submitting to the Arctic Data Center, it is up to the best judgment of the submitting researcher how their research should be organized.
  • If multiple data packages are needed for a research project, they can be made by individually going through the Arctic Data Center website submission tool for each subset.
  • After submitting multiple data packages, a data portal can be created to allow the related data packages to be discovered together.

File Guidelines

All of the observations belonging to one type should go in ONE file unless there’s a compelling reason to do otherwise.

You should consider splitting your data into multiple files only in the following cases:

  • The data are too big (e.g. > 1 GB) so segmenting them makes access or upload more convenient, or
  • The data are collected incrementally so new files will need to be added monthly or annually.

File Content

In order to optimally document and share a project’s output, using best practices for data management is required. Some resources for these best practices can be found here and here.

The following are a few guidelines that are encouraged for file organization for projects that plan to submit to the Arctic Data Center. Following these guidelines should help ensure a project’s outputs are easy to access and understand.

  • All files should have short, descriptive names.
  • Only letters, numbers, hyphens (“-“), and underscores (“_”) should be used in file names. Always avoid using spaces and specialized ASCII characters when naming files.
  • All files should be stored in open, ubiquitous, and easy-to-read file formats (see File Format Guidelines).This will help ensure the long-term reusability of your data without the need for proprietary software. 
  • All data should be tidy in order to be accessible and reusable by future researchers. The Arctic Data Center support team will not edit submitted data, however, can guide researchers on how to tidy their data. 
  • Tabular data should be submitted in a long (versus wide) format if possible. Long file formats will make documentation of column attributes (variables) much easier, and it will allow future users of the data to more easily process the data programmatically.
  • For models/scripts, all files necessary to run the code should be included and organized in a manner that makes running the code as accessible as possible. If outside dependencies (software, hardware, or otherwise) are needed to run code which cannot be submitted to the Arctic Data Center, details of these dependencies should be made clear within the metadata description of the code files as well as within the method’s metadata. For large models, see Guidelines for Large Models.

An example of tidy data for optimal reproducibility:

Artwork by Allison Horst

File Format

The Arctic Data Center primarily supports and encourages the upload of open, ubiquitous, and easy-to-read file formats. Examples of such formats are Comma Separated Values (CSV) files; text (TXT) files; PNG, JPEG or TIFF image files; R or Python scripts; and NetCDF files, among many others.

For projects that plan to submit to the Arctic Data Center, we strongly advise researchers to incorporate plans for creating files using open data formats in the initial stages of project development (i.e., within the data management plan of the project proposal). 

If your data is in proprietary formats such as Excel workbooks or MATLAB files (<v7.3), plan ahead to convert them into open data formats before submitting to the Arctic Data Center, as these can provide a barrier to reuse by others in the future if they don’t have the required proprietary software needed to process them. If you must submit files in a proprietary format, the Center may request an explanation before publishing the files and will ask you to provide an open-source software alternative that can process the proprietary files.

Open source file format recommendations:

  • For tabular data, we advise researchers to use common file formats such as CSV or TXT.
  • For image files, we advise researchers to use common file formats such as PNG, JPEG, or TIFF.
  • For GIS files, it is acceptable to submit the de-facto standard ESRI shapefiles, or GeoJSON files.
  • For MATLAB or other matrix-based programs, we advise researchers to export NetCDF files.
    • We encourage researchers to use the NetCDF format when large numbers of uniform matrices or arrays are being archived. See here for more information about NetCDF files.

For transparency and ease of automated parsing, we advise researchers to upload files individually and avoid uploading zipped archives.

  • Zipped archives make it difficult for researchers to view or access specific files and often researchers may not prefer to download entire archives. Additionally, zipped archives present challenges in accurately assigning metadata to nested files. Exceptions apply when compatibility with widely used software is necessary or when it is essential for files to stay together such as with ESRI shapefiles. 

Metadata

The Arctic Data Center primarily stores metadata in structured, XML-based files. When submitting a data package through the Arctic Data Center website, a structured, XML metadata file is automatically created. However, we advise researchers to create complete, metadata records in a format convenient to them prior to submitting to the Arctic Data Center. Ideally, plans to create and store metadata records should be made during the initial stages of project development (i.e., within the data management plan of the project proposal).

The goal of metadata is to document a project’s output so that another scientist will be able to understand and use all the components of the output without any outside consultation. The following components represent a non-exhaustive list that are typically expected within metadata records submitted to the Arctic Data Center:

  • A descriptive title that includes the topic, geographic location, dates, and, if applicable, the scale of the data.
  • A descriptive data package abstract that provides a brief overview summarizing the specific contents and purpose of the data package.
  • Funding information (typically the NSF award number).
  • A list of all people or organizations associated with the data package, with at least one person or organization acting as a creator and one acting as a contact (these can be the same). See the Identification Guidelines for more information about listing people within Arctic Data Center metadata records.
  • Full records of field and laboratory sampling times and locations, including a geographic description interpretable by a general scientific audience.
  • Full records of taxonomic coverage within the data package (if applicable).
  • Full descriptions of field and laboratory sample collection and processing methods.
  • Full descriptions of any hardware and software used (including make, model, and version, where applicable).
  • Full attribute/variable information for all data.
  • Quality control procedures.
  • Relevant explanations for why the particular components detailed above were chosen for the project.
  • Additional guidance for specific metadata cases is included below.

Additional guidance for specific metadata cases is included below.

Tabular and Spatial Data

Submitted metadata should contain detailed metadata for every attribute collected.

Attributes in tabular data (e.g. age, sex, length of fish encoded in a CSV file) are often referred to as variables and are arranged in either columns or rows. Note that storage of data in a long versus wide format will allow for more succinct metadata (see File Organization Guidelines).

In spatial vector data (e.g. lake features encoded in a shapefile), attributes describe a feature characteristic (e.g. lake area). In spatial raster data, the attribute of interest is encoded as a cell value for a given location (e.g. Advanced Very High Resolution Radiometer Sea Surface Temperature (AVHRR SST) encoded in a NetCDF matrix).

The following components are needed to describe each attribute:

  • A name (often the column or row header in the file). Like file names, only letters, numbers, hyphens (“-“), and underscores (“_”) should be used in attribute names. Always avoid using spaces and specialized ASCII characters when naming attributes.
  • A complete definition. The definition should fully clarify the measurement to a broad scientific audience. For example, a definition like “%C” may always be interpreted within a certain discipline in a uniform way. However, it might always be interpreted within another certain discipline in a different uniform way. A full technical definition such as “percent soil carbon by dry soil mass” helps to limit possible confusion.
  • Any missing value codes along with explanations for those codes (e.g.: “-999 = instrument malfunction”, “NA = site not found”).
  • For all numeric data, unit information is needed (e.g.: meters, kelvin, etc.). If your unit is not found in the standard unit list, please select “Other / None” and we will change this to the appropriate custom unit.
  • For all date-time data, a date-time format is needed (e.g.: “YYYY-MM-DD”).
  • For all spatial data, the spatial reference system details are needed.
  • For text data, full descriptions for all patterns/codes are needed if the text is constrained to a list of patterns or codes (e.g. a phone number would be constrained to a certain pattern and abbreviations for site locations may be constrained to a list of codes).

Software

For data packages with software (including models), submitted metadata should contain the following components, among others:

  • Instructions on how to use the software.
  • Version information.
  • Licensing information.
  • A list of software/hardware used in development.
  • A list of software/hardware dependencies needed to run the software.
  • Information detailing source data for models.
  • Any mathematical/physical explanations needed to understand models.
  • Any methods used to evaluate models.

The Arctic Data Center does not have data package or file size limitations for NSF-funded projects. Many multi-terabyte data packages have been archived on the Arctic Data Center. In most cases, all data and metadata relevant to each project should be archived regardless of total file size (note, non-NSF funded projects may be subject to a one-time processing fee depending on the total data package size). The Arctic Data Center website can handle the upload of multiple large files simultaneously. However, researchers with slow internet connections or those that experience any trouble uploading any file through the website should contact the Arctic Data Center support team at support@arcticdata.io. The Arctic Data Center support team has many options to upload large data packages when connection speed is limited or files are exceptionally large.

Large Number of Files

When a dataset has a large number of files (around 1000+ files), uploading these files through our web editor can take a very long time. In this case, we ask that you email our support team so that we can guide you on uploading the files directly to our server. We will provide you with credentials to remotely connect and upload your data directly to our server, These files will be stored in a web-accessible location on our server that will be referenced in your data package.

We will then ask you to submit a data package on our web editor without adding any files. We will ask you to provide the file path and file naming convention used for your dataset, a brief 1-2 sentence description for each different file type, and attribute definitions and units when applicable.

Large Models

When models produce extensive outputs, the resulting data packages can become very large. We advise researchers to archive the model output if it is crucial to the project. However, in some cases, the model output can be regenerated from the archived input data and model code. Therefore, depending on the specifics of a project, it may be reasonable to archive only the code, model inputs,  and a clear workflow on how to recompute the model output.

Considerations for larger models:

  • If re-running the model is resource intensive (e.g. multiple compute cycles, specialized or expensive software and/or hardware), the output will need to be archived.
    • We recommend this if the output is a valuable data package. Example: Climate model outputs are difficult for the average scientist to recompute, but are valuable for various downstream uses. 
  • If models are stochastic, their outputs can vary with each run. Therefore, interpretation of subsequent products are dependent on specific model runs.
    • We recommend the archival of the model output to ensure consistency in results and reduced variability in model outputs.

The Submission Process

The following sections provide details on accessing the Arctic Data Center using an ORCID account, the licensing requirements for distributing data and metadata, the publication process, and the currently available tools for submission.

Identification Guidelines

The Arctic Data Center requires submitters to have ORCID iDs for proper identification and attribution to each data package. ORCiDs are not required for all associated parties (contacts, additional creators, etc.), but are strongly encouraged, especially for the primary creator. Only individuals with ORCiDs can be granted editing access to data packages. Therefore, we advise researchers to register and record ORCID iDs for each individual involved with a project during the initial stages of project development.

Licensing and Data Distribution

All data and metadata will be released under either the CC-0 Public Domain Dedication or the Creative Commons Attribution 4.0 International License, with the potential exception of social science data that have certain sensitivities related to privacy or confidentiality. In cases where legal (e.g., contractual) or ethical (e.g., human subjects) restrictions to data sharing exist, requests to restrict data publication must be requested in advance and in writing and are subject to the approval of the NSF, who will ensure compliance with all federal, university, and Institutional Review Board (IRB) policies on the use of restricted data. 

As a repository dedicated to helping researchers increase collaboration and the pace of science, the Arctic Data Center needs certain rights to copy, store, and redistribute data and metadata. By uploading data, metadata, and any other content, users warrant that they own any rights to the content and are authorized to do so under copyright or any other right that might pertain to the content. Data and facts themselves are not covered under copyright in the US and most countries, since facts in and of themselves are not eligible for copyright. That said, some associated metadata and some particular compilations of data could potentially be covered by copyright in some jurisdictions.

By uploading content, users grant the Arctic Data Center repository and the University of California at Santa Barbara (UCSB) all rights needed to copy, store, redistribute, and share data, metadata, and any other content. By marking content as publicly available, users grant the Arctic Data Center repository, UCSB, and any other users the right to copy the content and redistribute it to the public without restriction under the terms of the CC-0 Public Domain Dedication or the Creative Commons Attribution 4.0 International License, depending on which license users choose at the time of upload.

Publication

The Arctic Data Center provides a long-lived and publicly accessible system that is free for other researchers to obtain data and metadata files. Complete submissions to the Center must meet the requirements set by the NSF OPP, which require that metadata files, full data sets, and derived data products be deposited in a long-lived and publicly accessible archive.

Submission of data packages to the Arctic Data Center is free for all NSF-funded projects. Data packages from projects not funded by NSF are permitted to submit to the Center but may be subject to a one-time processing fee, depending on the size and processing needs of the data package. Additionally, to be published on the Arctic Data Center, data packages from projects not funded by NSF should cover relevant Arctic science research. Contact us with any questions on these submissions. Please note that submissions of NSF-funded data packages are prioritized in our processing queue.

  • Researchers will submit data to the Arctic Data Center through the Arctic Data Center website (see Submission Support if needed).
  • The Center’s support and curation team will review the initial data submission to check for any issues that may need to be resolved as quickly as possible. Any extensive corrections might delay processing time, therefore, it is critical researchers communicate early and diligently.
    • All communication between the Center’s team and submitter’s ORCID iD will be primarily corresponded via email from support@arcticdata.io to the email address registered with the submitter’s ORCID iD. 
  • We advise researchers to submit data packages well before deadlines. 
  • For most ARC-funded projects, all data and metadata submissions are due within two years of collection or before the end of the award, whichever comes first. The data submission deadline is stricter for Arctic Observing Network (AON) projects, with real-time data to be made publicly available immediately and all data required to be fully quality controlled and submitted within 6 months of collection. Please see below for the submission requirement exceptions for sensitive social science data.   

The publishing process can take around 2 weeks for datasets with a few files, and it may be longer depending on the complexity and size of the dataset. Long processing times typically occur due to incomplete metadata, poorly organized files, or lack of responsiveness to follow-up emails. Compliance with the guidelines detailed here should ensure quick processing times. Well-organized and complete data packages can potentially be published within one business day. After the review process, each data package will be given a unique Digital Object Identifier (DOI) registered with DataCite using the EZID service, and is discoverable through various data citation networks, including DataONE.

Depending on the complexity of the data package and the quality of the initial submission, the review process can take anywhere from a few hours to several weeks. Long processing times generally occur when initial submissions have incomplete metadata and/or poorly organized files and/or the submitter is not responsive to follow-up emails. Compliance with the guidelines detailed here should ensure quick processing times. Well organized and complete data packages can potentially be published within one business day. After the review process, each data package will be given a unique Digital Object Identifier (DOI) that will assist with attribution and discovery. The DOI is registered with DataCite using the EZID service, and will be discoverable through multiple data citation networks, including DataONE and others.

Once the data package is published with the Arctic Data Center, it can still be edited and updated with new data or metadata. Additionally, the original data and metadata will remain archived and available to anyone who might have cited it. Updates to data and metadata can be made by clicking the green “Edit” button on the website of any data package (researchers will need to log in and have edit access to see the green button).

Each data package DOI represents a unique, immutable version, just like for a journal article. Therefore, any update to a data package qualifies as a new version and therefore requires a new DOI. DOIs and URLs for previous versions of data packages remain active on the Arctic Data Center (i.e., they will continue to resolve to the data set landing page for the specific version they are associated with), but a clear message will appear at the top of the page stating that “A newer version of this data package exists” with a hyperlink to the latest version. With this approach, any past uses of a DOI (such as in a publication) will remain functional and will reference the specific version of the data package that was cited, while pointing researchers to the newest version if one exists.

Submission Tools

The materials from Section 5.5, “Publishing Data from the Web” from our Arctic Data Center training can help guide you through the submission process.

Most researchers submit to the Arctic Data Center through the Arctic Data Center website. However, there are several alternative tools to submit data packages to the Center.

Researchers can also submit data inside an R workflow using the DataONE R package

Considerations with Submission Tool Editor:

  • Our Submission Tool editor has six different sections which focus on the scope of your dataset. Each section will have required information needed to process your dataset. Once submitted you will work with our data curators to ensure all required fields are ready for publication. 
  • Once your submission is completed you are able to make any changes to the dataset before and after publication. 
  • We would urge researchers to closely pay attention to the Ethical Research Practices field and describe how and the extent to which data collection procedures followed community standards. If guidance is needed, we urge researchers to consider if there were any IRB approvals needed, consent waivers, data sovereignty, and other issues addressing reproducible and ethical research.
    • The CARE Principles serve as an additional guide to help researchers consider any ethical concerns in their research practice. 
    • The Arctic Data Center developed an Ethical Arctic Research Practices Guide to aid researchers in providing information in their Ethical Research Practices Statement. For more information, please visit our data ethics
  • For further information, please visit our “Publishing data from the web” reference.

Developers: REST API

In addition to the web and data tools shown above, the Arctic Data Center provides the ability to access and submit data via the DataONE REST API. This allows the community to use many programming languages to add data and metadata to the repository, search for and access data, and automate any process that might otherwise be highly time consuming. Most useful to groups with repetitive tasks to perform, such as submitting many data files of the same type, the REST API can be a real time saver. For more details, please contact us at support@arcticdata.io.The Arctic Data Center currently encodes science metadata in the Ecological Metadata Language (EML). Packaging information (how metadata files are associated with data files) is currently encoded in resource maps using the Open Archives Initiative Object Reuse and Exchange specification. Please contact support@arcticdata.io for detailed help with programmatically producing these XML-based files.

Submission Support

If, for any reason, support with a submission is needed, contact support@arcticdata.io and an Arctic Data Center support team member will respond promptly.

If you have a large volume of files to submit or the total size of your data is too large to upload via the web form, please first submit your complete data package description (metadata) through the Arctic Data Center website without uploading any data files, and then write to support@arcticdata.io to arrange another method for the data transfer. The support team has multiple options for transferring large amounts of data, including via Google Drive or our SFTP service.

💡

Have questions not answered in this guide? Please see the Frequently Asked Questions section of the Center’s website, or contact support@arcticdata.io and a member of the Arctic Data Center support team will respond promptly.

All icons are derived from the Noun Project or Canva.