In computationally demanding data analysis pipelines, the targets R package maintains an up-to-date set of results while skipping tasks that do not need to rerun. This process increases speed and enhances the reproducibility of the final end product. However, it also overwrites old output with new output, and past results disappear by default. To preserve historical output, two major enhancements have arrived in the targets ecosystem. The first enhancement is version-aware cloud storage. If you opt into Amazon-backed storage formats and supply an Amazon S3 bucket with versioning turned on, then the pipeline metadata automatically records the version ID of each target. That way, if the metadata file is part of the source code version control repository of the pipeline, then the user can roll back to a previous code commit and automatically recover the old data, all without invalidating any targets or cueing the pipeline to rerun. The second enhancement to the ecosystem is gittargets, an alternative cloud-agnostic data version control system. The gittargets package captures version-controlled snapshots of the local data store, and each snapshot points to the underlying commit of the source code. That way, when the user rolls back the code to a previous branch or commit, gittargets recovers the data contemporaneous with that commit so that all targets remain up to date. With cloud versioning and gittargets, the targets package now combines the virtues of both Airflow-like and Make-like tools.
-
Notifications
You must be signed in to change notification settings - Fork 0
useR! 2022 talk
License
wlandau/user-conf-2022
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
About
useR! 2022 talk
Topics
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published