Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use DVC to version control big data expensive to wrangle/download #18

Open
maxheld83 opened this issue Nov 3, 2021 · 0 comments
Open
Assignees

Comments

@maxheld83
Copy link
Contributor

There's three sources of data diffs to be version controlled, each too big for git:

  1. changes in large datasets (for example: ISSN/ISSN-L) (already covered by document raw data version control best practice #13)
    These changes can have a big impact on reproducibility of downstream results.
    This is separate from a substantive interest in longitudinal data (for example: cr dumps), where the change over time may be interesting in/of itself.
    For ISSN/ISSN-L at any given point in time, we care only about the current mapping, we have no interest in how these changed historically.
    For cr dumps, we may (e.g. development of HOAD) at any one point in time be interested in changes up to that point.
    (actually cr dumps for any given month/year can also change after the fact, so that's a source of diffs, too 😐).
  2. (git diffed) changes in how we wrangle data; resulting objects can be so expensive, that just recomputing based on git may be too expensive -- instead, we should keep these changes as well.
  3. then for versioning bq queries / tables there's document query version control best practice #12

1 may already be well-covered by just storing gcs shas or whatever.
2, if needed, should use DVC.

And if we use DVC, we might as well use it for 1, too.

@maxheld83 maxheld83 moved this to Todo in max's scrum(ish) Nov 3, 2021
@maxheld83 maxheld83 self-assigned this Nov 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

No branches or pull requests

1 participant