This R package provides a workflow and tools to manage and publish research datasets to online repositories. It includes functions to harvest metadata from a metadata database, or "Metabase", create EML documents, and publish datasets to the EDI data repository. The Metabase is expected to follow the LTER-core-metabase schema.
jerald
depends on a few required R packages:
- MetaEgress to access the Metabase
- aws.s3
- EDIutils
It also requires credentials to use the Metabase and web resources.
The package was developed by the Jornada LTER, so jerald
is a bit specific to LTER sites and their research data workflows. It should, however, be somewhat extensible for other sites and purposes.
Install the GitHub version with devtools
.
devtools::install_github("jornada-im/jerald")
The requirements listed above should be pulled in at install time if you don't already have them.
Once jerald
is installed you will need to direct it to stored credentials for accessing the metadata database and the necessary web resources, which include a data repository (usually EDI) and an s3 bucket. Ask your lead IM or administrator for help with this.
jerald
has several user-facing functions, described below, that will help accomplish most data management and publication tasks performed by information managers at the Jornada. A template script that uses these functions to update an example dataset (id=210000000) on EDI is provided when making a new dataset directory using the instructions below, or is available at inst/template/build_eml.210000000.R
. This script gives a demonstration of a basic Jornada workflow.
jerald
includes a number of templates (see inst/template/
) for setting up a data management project directory and scripts. To create a new dataset directory and populate with recommended files and subdirectories, run:
template_dataset_dir(datasetid)
where datasetid
is a unique identifier for the dataset in the Metabase and at EDI.
An existing dataset directory formatted for use with EMLassemblyline (EAL) can be migrated to jerald
's template format using:
migrate_eal_dir(eal.dir, jerald.dir)
where eal.dir
is the path to the EAL directory, and jerald.dir
is the path to a new dataset directory created with template_dataset_dir
. All metadata and data files from eal.dir
should be copied into a new EAL_archive
directory in jerald.dir
. Be aware that there is the potential for data or metadata loss here, so check that all necessary metadata has been copied before deleting the old directory.
jerald
doesn't do this directly yet, but there is a template script called build_dataset.210000000.R
that demonstrates how to prepare a tabular data entity. This template script will be created in any new jerald
dataset directory (using template_dataset_dir
) or can be found in inst/template/
.
There are several user functions for publishing data to EDI. The first step is to load credentials for your Metabase and web resources:
load_metabase_cred(mb.pathname)
load_destination_cred(dest.path)
where mb.pathname
is the path to a Metabase credentials file and dest.path
is the directory path containing a jerald_destination_keys.R
file. Again, your lead IM or system administrator can provide templates for the Metabase credentials file, and will likely set up the destination credentials file for you.
Once the credentials have been loaded you can create and upload datasets at EDI (or possibly other repositories in the future...). To update an existing dataset on EDI use:
update_dataset_edi(datasetid, mb.name, mb.cred, edi.cred)
where datasetid
is the unique identifier for the dataset in your Metabase and EDI; mb.name
is the name of your Metabase and is returned by load_metabase_cred
; mb.cred
is a list of credentials for Metabase and is returned by load_metabase_cred
; edi.cred
is a list of credentials for EDI and is returned by load_destination_cred
. For safety, the default arguments for this function will either do a "dry-run" of the process (no update to EDI), or will update the data package at the EDI "staging" repository. You can change both behaviors using the edi.env
and publish
arguments once you are sure you are ready to publish. Note that a dataset is called a data package in EDI terms.
To create a new dataset on EDI use:
create_dataset_edi(datasetid, mb.name, mb.cred, edi.cred)
where the arguments are the same as above. Note that this will be revision "1" of the data package, so make sure your EML reflects this before publishing.
There are some lower-level functions that might occasionally be useful... (will document these later)