Skip to content

Latest commit

 

History

History
147 lines (128 loc) · 29.9 KB

RFC_001_data_citation_identifier.md

File metadata and controls

147 lines (128 loc) · 29.9 KB

A. Title [insert version and date, version 0.9 is the first stable version for public comment, version 1.0 is the first approved version.]

  1. Registration of DOIs for data citation
  2. version 0.9.1
  3. 2023-01-03

B. Summary [A concise articulation of what issues the guideline will address, what organizations are and should be adopting it, necessary context or background, and the overall impact on NASA individuals and organizations]

  1. NASA SMD’s information policy (SPD41a) states that SMD-funded data collections shall be citable using a persistent identifier, but does not require a particular implementation of identifiers, although Digital Object Identifiers (DOIs) are listed as an example in the appendix. The purpose of this document is to describe how NASA data repositories will comply with the SPD-41a policy requirement.
  2. DataCite is an organization that was founded specifically to support the creation of DOIs and metadata associated with data sets and to aid in referencing and discovery of research data. DataCite also meets NASA data citation policy requirements. Several NASA data repositories already have a relationship with DataCite and most register DOIs for at least some of their data holdings, but data citation is still an evolving process.
  3. The guidelines described in this document recommend that NASA data repositories use DOIs as persistent identifiers to facilitate citation of data in the scientific literature. The guidelines provide an overview of the scenarios in which NASA data repositories may need to register a DOI and the mechanisms through which they can do so. They provide guidance on the registration process and appropriate roles. The intent is to identify services and define a process that allow data repositories the flexibility they need to respond to the needs of their user communities while also enabling the creation of a central NASA registry of DOIs managed through the Scientific and Technical Information (STI) Program Office.
  4. Much of the document provides background, rationale, context, and definitions. The primary guidance (what a repository should do) is in Section H: Guideline Description. Implementation details are in Section K Adoption Plan and Timeline.

C. Problem Statement [Why is a guideline needed? What specific problem is being addressed?]

  1. The 8 November 2021 Draft SMD Information policy (SPD-41a) states:
    1. SMD-funded data collections shall be citable using a persistent identifier. SMD should encourage that data users to cite the sources of the information used to conduct peer-reviewed, published research.
    2. SMD-funded data collections shall be indexed as part of the NASA catalog of data.
  2. While the SMD policy does not include a definition of data collection, we interpret it as a logical assemblage of data that can be cited as a whole. What defines a logical assemblage is very contextual and is usually determined by the data repository in collaboration with the data provider.
  3. In some communities, the assignment of persistent identifiers such as DOIs (digital object identifiers) is still a relatively new technique for referencing research products other than traditional publications. The DOI metadata used to describe traditional publications is not generally applicable to research data (volume, journal name, page number, etc.), yet research data is a valuable resource that requires time, funding, and significant effort to produce. To acknowledge the debt that researchers and analysts owe to those who produce high-quality research data, and to enable more transparent research, publishers have begun to require that analysts cite their data (Stall et al. 2019). This has created a related requirement on data repositories to provide persistent identifiers to ensure that future researchers can find the data products that contributed to the scientific findings published. Because this is a recent development, data repositories need guidance in understanding when and how to assign persistent identifiers, finding sources for persistent identifier services, and providing sufficient metadata to meet the goals expressed in SPD-41a. Moreover, the requirement that data be indexed requires a coordinated approach across SMD.

D. Value Proposition [A specific description of who will benefit from the adoption or implementation of the guideline and what tangible impacts should result.]

  1. The beneficiaries of these guidelines will be:
  • NASA Programs. By making data citable and encouraging proposers and authors to cite their data, NASA funded programs will be able to track generations of research returns on the investments made in basic research, exploration, and data analysis.
  • Data Repositories. Data repositories will have clear procedures and contacts for DOI services (obtaining DOIs, updating the associated metadata, etc.) and will be able to respond to the needs of data users who are increasingly required to cite the data they analyze. Where automation of DOI production is possible, the guidelines will inform the requirements on the automation, enabling efficient and consistent generation of new DOIs. The visibility of a data repository’s holdings will be elevated as the metadata will be visible in the DataCite’s DOI metadata catalog and throdugh the catalogs of other organizations that harvest or aggregate metadata from DataCite (like ADS and CrossRef).
  • Data Providers. By clearly defining what is expected and required and the processes involved, data providers will be able to efficiently supply appropriate metadata and make another important aspect of their research output more visible, and in particular citable. Data identified by a DOI can and will generate citation counts and impact statistics alongside the more traditional research publications.
  • Data Users. Consumers of research data will, first and foremost, be able to cite their source data formally, leading to improved reproducibility/replicability of research results. Through the DOI metadata, they will have improved search options to locate data through the aggregated DOI metadata databases as well as the archives, repositories, and data portals they know now. Additionally, inclusion of the formal licensing notice and the associated URI ensures that data users can be confident they have permission to reuse the data however they might want to, without restriction. (Licensing requirements for NASA-funded data are defined elsewhere).

E. Scope of Impact [What NASA entities will be affected? What external entities will be involved?]

  1. NASA (and NASA funded) data repositories (institutional data stewards).
  2. Scientific authors using NASA data.
  3. Scientific publishers requesting data citation of NASA funded data.
  4. DataCite, which is currently the only DOI registration agency for data sets.

F. Assumptions [Please list any assumptions you have made that constrain the nature of the guideline. This might include the technologies of those affected, the intent of a policy, the nature of data models, or many other things that could affect the adoption and interpretation of the guideline. As appropriate, indicate your strategy if the assumption proves false. It may also be necessary to define specific terms]

  1. Terminology
    1. Note that in this document, “data” is being used in the sense defined in SPD-41a, section I: “Data includes any scientifically or technically relevant, electronically stored information.” Data resulting from SMD funding can be voluminous, so in this document the term “data collection” is used to refer to an assemblage of “data” that is identified by a single persistent identifier. DataCite defines a data collection as “"An aggregation of resources, which may encompass collections of one resourceType as well as those of mixed types. A collection is described as a group; its parts may also be separately described."
    2. We use the general term “data repository” for SMD-funded archives and data centers which have an institutional commitment to preserve and maintain the data to which they assign a DOI. Levels of curation and long-term stewardship may vary depending on the mission of the data repository and the data in question.
  2. The policy states that data should be citable with a “persistent identifier”. We assume that means a “globally unique and resolvable persistent identifier”.
    1. We accept this OSTP/SOS Definition - Persistent Identifier: A Persistent identifier is a digital identifier that is globally unique, machine resolvable, and with an associated metadata schema that identifies an entity (e.g., individual researcher, digital research object) in perpetuity and is used to disambiguate as well as build associations between entities.
    2. We recognize that persistence requires ongoing social and institutional commitment. It is not solely a technical problem.
  3. Data repositories will be able to define what is an appropriately citable object based on the norms and expectations of their discipline.

G. Background and Context [Provide any background necessary to understand the application of the guideline along with the general context in which the guideline is to be applied. Be clear on how NASA sees the issue which may be somewhat distinct from a more generic view. Similarly, clarify how NASA is currently or has historically approached the problem (if it has).]:

  1. Background and current practice

    1. Data citation can have multiple purposes including enabling reference to a particular data set, improving the transparency and verifiability of research, credit attribution for data producers, tracking the reuse and potentially the impact of a data set, and more. The use of persistent, resolvable identifiers specifically aids in the implementation of four of the eight Force11 Data Citation Principles[1]: 4) Unique Identification; 5) Access; 6) Persistence; and 7) Specificity and Verifiability.
    2. Through the aggregation of the associated metadata in public databases, persistent identifiers for data can also serve purposes beyond citation, such as: tracking provenance; graphing interconnections between data, software, funding, authors, etc.. The purpose here is to focus specifically on the application of persistent identifiers for citation of data in the scientific literature.
    3. NASA-funded data comes from a wide variety of sources (observatories, Earth-orbiting satellites, remote sensing missions, laboratory studies, analytical derivations, etc.) and in a wide range of bulk sizes - from a single table of results, to collections of millions of individual observational products collected in one phase of observations, to accumulating data stores that add data at regular cadences over the course of years. Persistent identifiers are a tool available to the curators of all these types of data to augment the curation and dissemination of the data located in these various data repositories. A DOI, in particular, with its accompanying descriptive metadata and its general acceptance by the research publishing community for use in formal referencing, makes it possible to integrate research datasets into the literature. With a DOI, data can be uniquely identified, linked to its source for distribution and reuse, and discovered by search engines that mine the DOI metadata aggregated in public databases.
    4. In addition to these immediate consequences, the DOI can link the data to the greater body of literature, making it possible to trace connections from the funding agency (identified in the DOI metadata), through the original data products, to the publications and derived data products resulting from analysis of the source data, and down through subsequent generations of analysis, synthesis, and publication. The DOI can enable the data repository to trace the impact of its holdings, and for the original funding agency to point to tangible return on investment in the basic research that gathered the original data. The same metrics allow the authors of research data products to demonstrate the value of the contributions they have made to their community over and above the traditional publications on their CVs. Furthermore, an increasing number of scientific publishers are encouraging, if not requiring, data be cited and available before publication of results. Practices are still evolving rapidly, but typically the publisher will require the data to be referenced with a persistent identifier, usually a DOI.
    5. Consequently, there is increasing demand on data repositories, from both data creators and data users, to provide DOI minting and maintenance services. Standards and leading practices are still in the development stage at this point, and not all data repositories have extensive experience either creating or managing DOIs.
    6. The task group that developed these guidelines conducted a survey of 33 major NASA-supported data repositories asking about any existing data citation guidance and their use of persistent identifiers and permanent links. Although data citation is an evolving practice, the data citation environment at NASA is quite mature. Practices may be ad hoc, but they are very professional and detail oriented. There is solid consensus on the use of DOIs at some level, almost all of which are registered through DataCite (ADS is the only one that uses CrossRef as well but that’s because they register documents). Some DOIs are registered at the division level, some by the archives. Responsibilities for DOI maintenance are variable. A few archives are still exploring the data citation concept.
    7. Guidelines are needed to outline basic procedures and identify sources of DOIs services available to NASA data repositories and to establish expectations and requirements that will ensure a consistent approach to gathering and documenting the metadata that makes the DOIs itself a robust link to the growing body of digital knowledge.
    8. DOI Request Scenarios: Because of the diversity in data sources, volumes, and applications, there is no one-size-fits-all approach to DOI services. For the purposes of these guidelines, there are three scenarios that represent the bulk of DOI service requests that data repositories are likely to be expected to respond to:
      1. A Planned Submission. Some data repositories know in advance what data is likely or planned to be submitted. Reserving a DOI for data still in preparation or as it is initially received enables the data repository to provide DOIs as the data become publicly available or as needed in advance of associated publications or proposals to data analysis programs. Most, but not all, NASA data repositories have processes for this situation and have identified some level of aggregation or collection to which they assign a DOI.
      2. A Provider Request. Some data repositories will need to be able to respond to requests for DOI assignment by data providers depositing data in the center. The provider may either want to reference the data in the near term, or wish to be able to track the impact of the data submission through reference tracking databases. This scenario will likely become more common under the new information policy, as more individual researchers and teams submit data produced through NASA grants.
      3. A User Request. A data repository may receive a request from a data user for a DOI to be assigned to a particular set of data so that it can be used to reference the data in a publication. The request may be for an already defined data set or collection which has not been assigned a DOI, or it may be for a custom subset of dynamic data or aggregation of data from multiple sources. This may currently be the least common scenario, but is likely to become more common, and these requests typically arrive with some degree of urgency due to planned publication of the findings.
    9. Each data repository will need to develop DOI policies and procedures appropriate to their holdings, operations, and user communities; consistent with the NASA-wide guidelines described in this document.
  2. Relationship with the DOI Registration Authority

    1. Currently, the DataCite DOI Registration Authority is the only one with metadata designed specifically to support data products, and thus is the de facto standard for data-related DOI services.
    2. DataCite is a membership-driven organization. It currently defines three types of members with different fee structures (see https://datacite.org/become.html):
      1. Member-only: Does not create DOIs, but actively participates in DataCite governance and cooperative endeavors.
      2. Direct Member: Creates and otherwise uses DOI services, as well as participating in Member activities.
      3. Consortium Members: A group of five or more similar organizations with a designated lead that collectively act as a Direct Member. Costs are distributed (or not) by agreement with the Consortium lead.
    3. The potential benefits of joining a consortium for individual repositories would be working with an experienced lead organization that might provide a shared interface to DOI services; and potentially some cost-saving depending on the funding and cost-recovery model of the consortium sponsor.
    4. NASA currently has a consortium membership managed through the Scientific and Technical Information (STI) program while some data repositories are themselves individual members of DataCIte[2]. DataCite membership fees scale with the number of DOIs created each year. The NASA STI consortium-membership in DataCite makes it possible to manage costs of DOI creation without requiring each individual data repository to acquire an individual DataCite membership to access DOI services.
    5. There are pros and cons for data repositories to participate in the consortium or to become direct members. In general, DataCite would prefer that NASA consolidate its multiple memberships, but some individual repositories may wish to remain or become direct DataCite members.

H. Guideline Description: [This is the core of the document. It should provide a specific and detailed description of the recommended guidance, including how to access relevant standards and how they will be used within NASA. This could include links to existing guidance, but any unique or agreed aspects of how the guideline will be used within NASA SMD should be described in detail.]

  1. Primary Guidance:
    1. NASA SMD-funded data intended or used for citation in the scientific literature should have a DOI registered through a NASA-designated DOI registration authority.
      1. While most, if not all, data repositories assign identifiers that are unique and persistent to some degree, the value of the DOI is the international network of supporting organizations that are committed to preserving the resolvability of DOIs in perpetuity - past the lifetime of any single DOI Registration Authority or DOI creator. The DOIs are also typically defined by metadata that describes the target object in terms that are contextually relevant, increasing the value of the DOI by linking it directly to searchable databases of metadata. Assigning DOIs to research data incorporates the data into the “published literature”, the network of knowledge that serves as the foundation for future discoveries and achievements. An identifier, persistent or not, that will not outlive the organization that created it cannot offer the permanence and interoperability of a DOI. For research data products, the DOI has become the community standard for identification and resolution. Consequently, these guidelines recommend and assume the use of DOIs, created through a designated Registration Authority with a metadata schema appropriate to describing data, as the appropriate avenue for assigning persistent identifiers to SMD-funded data. Currently, DataCite provides the only DOI Registration Authority and metadata schema suitable for data.
    2. All DOI requests for SMD-funded data should be processed through the NASA data repository that is responsible for archiving and distributing the digital object for which the DOI is being requested. Thus, PIs, science teams, and other data providers should contact the appropriate data repository to request a DOI. No data repository will be expected to issue a DOI for data that it does not manage or preserve.
    3. Metadata requirements for the registration of a DOI should be met by the data repository in collaboration with data providers.
    4. The data repository should be responsible for maintaining the metadata for the digital object as well as the landing page and resolvability of the DOI, as determined by leading practices in their discipline (see the “existing guidance” in the references).
    5. Data repositories should establish their own guidelines on what is required to register and maintain a DOI including how changes to data, different versions of data, and deprecated data will be addressed.
    6. Data repositories that do not currently issue DOIs will need to establish a path to add DOI services to their workflow. There are three options appropriate for different scenarios:
      1. Contact NASA STI and request DOI services (see https://nasa.sharepoint.com/sites/dpiservices - accessible to NASA employees). This would be the recommended solution for data repositories needing very few DOI services in a year - few enough that a manual process for creating or updating DOIs is not onerous.
      2. Join the SMD-wide DataCite Consortium Membership (see https://nasa.sharepoint.com/sites/dpiservices - accessible to NASA employees). Data repositories expecting a higher volume of DOI requests and looking to incorporate DOI services into automated workflows should consider this option. The majority of data repositories would most likely benefit from this option of working with an organization that has broad experience in supporting DOI services through DataCite.
      3. Become a Direct Member of DataCite. Data repositories should consider this option only if they are also planning to become active participants in DataCite governance because it will give them a specific voice in DataCite and its myriad activities and partnerships beyond the general NASA consortium membership.
    7. SMD should coordinate with the STI Program Office to develop and implement an agency registry of DOIs.
  2. Supporting guidance
    1. NASA SMD should consider providing additional guidance on the use of DataCite metadata beyond the minimum required by DataCite. This SMD-wide guidance should focus on administrative and bibliographic information not scientific information which is discipline specific.

I. Guideline Maintenance: [What organization provides the official version of the standards the guideline deploys? This could be an external organization like ISO or W3C, but it could also be an internal body within NASA.]

  1. The NASA Science Data Officer will provide the official version of these guidelines, and will update them as needed. Questions about this guideline should be directed to [generic email address TBD]
  2. Updates may be required in the event that any of the technologies or organizations used or referenced by these guidelines are significantly modified or supplanted by newer technologies or entities. These include:
    1. The DOI specification (ISO 26324) is defined and maintained by the DOI Foundation. - https://www.doi.org/. The specification has been stable since its ISO designation in 2012.
    2. The DataCite organization (https://datacite.org/) was founded in 2009 as a not-for-profit to support data citation as a DOI Registration Authority specializing in providing DOIs and metadata for research data output. They provide open access to the metadata they curate, and work closely with other DOI Registration Authorities and members to promote citation, discovery, and reuse of research data.
    3. The NASA Scientific and Technical Information (STI) Program Office is a consortium-member of DataCite and provides DOI services for many NASA projects and organizations. NASA can
    4. The DataCite Metadata Schema includes metadata specific to research data (size, file format, licensing information, contributor roles, etc.). This enables discoverability by providing relevant metadata to raise the visibility of research data to those who seek it in the DataCite database - including the Astrophysics Data System Abstract Service, which harvests metadata from DOI databases.
    5. Individual data repositories should maintain the metadata and a resolvable URL for the landing page of any object identified by a DOI they created/requested.

J. Level of adherence [Describe the anticipated impact on relevant parties and the level and nature of adherence required. Is it normative or just informative? We do not have prescribed categories at the moment, but the scope document may provide some guidance.]

  1. This is a normative guideline. All NASA SMD-funded data repositories which provide data to be cited in the scientific literature should follow these guidelines.

K. Adoption Plan and Timeline: A specific plan for spreading adoption or implementation of the WG Recommendation and other outcomes across relevant NASA organizations.

  1. STI will coordinate the adoption process. STI has the vendor relationship with DataCite for the NASA Consortium Membership and will work with the data repositories who provide the metadata and assign the DOIs to specified data collections. Many data repositories are already implementing these guidelines to some degree. STI needs to connect with those repositories unaware of STI services or the NASA Consortium DataCite membership. STI will need assistance from the Mission Directorates to identify those repositories.
  2. Data repositories not currently assigning DOIs will need to develop processes to do so. STI can help with this.
  3. Data repositories currently assigning DOIs need to consider their relationship with DataCite and consider taking advantage of the NASA Consortium Membership.
  4. STI will to coordinate with SMD to compile a defined set of NASA SMD repository accounts and prefixes.

L. Acknowledgements

  1. This guideline was developed by an expert technical team representing data repositories from the five SMD Divisions as well as the STI Program Office:
  • Dan Berrios, BPS
  • Robert M. Candey, HPD
  • Mitch Gordon, PSD
  • Nathan James, ESD
  • Steve Joy, PSD
  • Mark Parsons, NASA IMPACT, University of Alabama in Huntsville
  • Josh Peek, ASD
  • Anne C. Raugh, University of Maryland, College Park
  • Aaron Roberts, HPD
  • Gerald Steeman, STI

M. References

  • DataCite website and metadata schema.
  • SMD information policy
  • Following are existing broad reviews, principles, and guidelines on data citation:
    1. Task Group on Data Citation Standards and Practices, CODATA/ICSTI. 2013. “Out of Cite, Out of Mind: The Current State of Practice, Policy, and Technology for the Citation of Data.” Data Science Journal 12 (0): CIDCR1–CIDCR75. https://datascience.codata.org/articles/abstract/253/ .
    2. Data Citation Synthesis Group. 2014. Joint Declaration of Data Citation Principles. Martone, Maryanne (eds.). San Diego, CA: Force11. https://doi.org/10.25490/a97f-egyk
    3. Rauber, A., B. Gößwein, C. M. Zwölf, C. Schubert, F. Wörister, J. Duncan, K. Flicker, K. Zettsu, K. Meixner, L. D. McIntosh, R. Jenkyns, S. Pröll, T. Miksa, and M. A. Parsons. 2021. “Precisely and Persistently Identifying and Citing Arbitrary Subsets of Dynamic Data.” Harvard Data Science Review Issue 3.4, Fall 2021: https://doi.org/10.1162/99608f92.be565013.
    4. Stall, S., L. Yarmey, J. Cutcher-Gershenfeld, B. Hanson, K. Lehnert, B. Nosek, M. Parsons, E. Robinson, and L. Wyborn. 2019. “Make scientific data FAIR.” Nature 570 (7759): 27–29. https://doi.org/10.1038/d41586-019-01720-7.
    5. ESIP Data Preservation and Stewardship Committee. 2019. Data Citation Guidelines for Earth Science Data, Version 2. Earth Science Information Partners. https://doi.org10.6084/m9.figshare.8441816.v1.
    6. NOAA Data Citation Procedural Directive: US Dept of Commerce/NOAA Environmental Data Management Committee (EDMC) - NOAA PD Data Citation
    7. Fenner, M., Crosas, M., Grethe, J.S. et al. A data citation roadmap for scholarly data repositories. Sci Data 6, 28 (2019). https://doi.org/10.1038/s41597-019-0031-8
    8. A Model for Data Citation in Astronomical Research Using Digital Object Identifiers (DOIs) https://ui.adsabs.harvard.edu/abs/2018ApJS..236...20N/abstract

Existing Guidance for data repositories on assigning DOIs to data:

Some guidance for data users on how to reference data from NASA data repositories that can be used along with any guidance provided by publishers:


[1] Data Citation Synthesis Group: Joint Declaration of Data Citation Principles. Martone M. (ed.) San Diego CA: FORCE11; 2014 https://doi.org/10.25490/a97f-egyk [2] See DataCite membership at https://datacite.org/members.html.