You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There's three sources of data diffs to be version controlled, each too big for git:
changes in large datasets (for example: ISSN/ISSN-L) (already covered by document raw data version control best practice #13)
These changes can have a big impact on reproducibility of downstream results.
This is separate from a substantive interest in longitudinal data (for example: cr dumps), where the change over time may be interesting in/of itself.
For ISSN/ISSN-L at any given point in time, we care only about the current mapping, we have no interest in how these changed historically.
For cr dumps, we may (e.g. development of HOAD) at any one point in time be interested in changes up to that point.
(actually cr dumps for any given month/year can also change after the fact, so that's a source of diffs, too 😐).
(git diffed) changes in how we wrangle data; resulting objects can be so expensive, that just recomputing based on git may be too expensive -- instead, we should keep these changes as well.
There's three sources of data diffs to be version controlled, each too big for git:
These changes can have a big impact on reproducibility of downstream results.
This is separate from a substantive interest in longitudinal data (for example: cr dumps), where the change over time may be interesting in/of itself.
For ISSN/ISSN-L at any given point in time, we care only about the current mapping, we have no interest in how these changed historically.
For cr dumps, we may (e.g. development of HOAD) at any one point in time be interested in changes up to that point.
(actually cr dumps for any given month/year can also change after the fact, so that's a source of diffs, too 😐).
1 may already be well-covered by just storing gcs shas or whatever.
2, if needed, should use DVC.
And if we use DVC, we might as well use it for 1, too.
The text was updated successfully, but these errors were encountered: