Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

status: don't recompute checksums and hashes for each stage with a shared dependency #10604

Open
dluks opened this issue Oct 24, 2024 · 3 comments
Assignees
Labels
triage Needs to be triaged

Comments

@dluks
Copy link

dluks commented Oct 24, 2024

Consider a pipeline that looks something like this:

"prep_data" -> "train_models";
"train_models" -> "predict";
"train_models" -> "analyze";
"train_models" -> "validate";
"predict" -> "publish";
"analyze" -> "publish";
"validate" -> "publish";

Each time dvc status (or commit or repro) is run, DVC collects files and computes hashes for every single dependency independent of other stages that have been checked, committed, or reproduced before it. This means that the output of train_models is re-hashed/checksumed first for train_models, then for predict, then for analyze, and then for validate, even though there is nothing in between any of those stages (after the initial train_models stage) that could have changed the output of train_models.

Additionally, when predict, analyze, and validate are collected/hashed/checksumed during their own execution/status/etc., the hashes are still recomputed again for the publish stage.

For a project like mine that has rather large files as the outputs of each stage (~66 files totaling ~120GB), this ends up taking at least an hour if not more. This is especially problematic when I make minor updates to upstream files that don't affect outputs and need to recommit or re-check dvc status.

It seems like there should be some persistence between downstream stages that share dependencies to reduce this redundancy.


Side note, though perhaps indicative of the underlying issue: when committing a stage which has dependencies that have changed, DVC builds the dependency tree twice, once to determine that the deps have changed, and once after confirming that "yes" to commit the stage.

For example:

$ dvc commit analyze -v
2024-10-24 11:36:35,380 DEBUG: v3.56.0 (pip), CPython 3.10.12 on Linux-5.15.0-122-generic-x86_64-with-glibc2.35
2024-10-24 11:36:35,381 DEBUG: command: <HOME_DIR>/.local/bin/dvc commit analyze -v
2024-10-24 11:36:35,576 DEBUG: Checking if stage 'analyze' is in 'dvc.yaml'
2024-10-24 11:36:35,640 DEBUG: Lockfile 'dvc.lock' needs to be updated.
2024-10-24 11:37:13,915 DEBUG: built tree 'object 3772f172da6d2560f8b1922c7300ee46.dir'    <----- Computes once
dependencies ['src/models/multires_stats.py', 'models/Shrub_Tree_Grass/001', 'params.yaml'] of stage: 'analyze' changed. Are you sure you want to commit it? [y/n] y
2024-10-24 11:37:42,012 DEBUG: built tree 'object 3772f172da6d2560f8b1922c7300ee46.dir'    <----- Computes again
2024-10-24 11:37:42,057 DEBUG: Computed stage: 'analyze' md5: '3213ffd0fb4741f26c242b6f78879476'
Updating lock file 'dvc.lock'
2024-10-24 11:37:42,215 DEBUG: Analytics is disabled.

Note that the hash was computed for the dependency twice...

@dluks
Copy link
Author

dluks commented Oct 24, 2024

For some additional context, in my case the DVC cache is located on a NAS drive (CIFS, unfortunately) and the cache type is "symlink".

$ dvc doctor
DVC version: 3.56.0 (pip)
-------------------------
Platform: Python 3.10.12 on Linux-5.15.0-122-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.16.6
        dvc_objects = 5.1.0
        dvc_render = 1.0.2
        dvc_task = 0.3.0
        scmrepo = 3.3.8
Supports:
        gs (gcsfs = 2024.3.1),
        http (aiohttp = 3.9.1, aiohttp-retry = 2.8.3),
        https (aiohttp = 3.9.1, aiohttp-retry = 2.8.3)
Config:
        Global: <USER DIR>/.config/dvc
        System: /etc/xdg/dvc
Cache types: symlink
Cache directory: cifs on //<NAS DRIVE>
Caches: local
Remotes: local
Workspace directory: ext4 on <LOCAL SYSTEM>
Repo: dvc, git
Repo.site_cache_dir: <LOCAL SYSTEM>

@skshetry
Copy link
Member

Can you please share profiling data? See https://github.com/iterative/dvc/wiki/Debugging,-Profiling-and-Benchmarking-DVC#generating-cprofile-data.

DVC caches checksums, so it does not recompute checksums for each file (even though the message may say so, the message is for Output, not files). DVC does stat files multiple times as you said for each stage dependency, which may be expensive depending on the filesystem, hardware, OS, etc.

@skshetry skshetry added the awaiting response we are waiting for your reply, please respond! :) label Oct 24, 2024
@dluks
Copy link
Author

dluks commented Oct 25, 2024

Hopefully I did this right. Let me know if should change something about my profiling setup.

cprofile dump: https://file.io/c2ChbfBj8Z1H

@shcheklein shcheklein added triage Needs to be triaged and removed awaiting response we are waiting for your reply, please respond! :) labels Nov 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triage Needs to be triaged
Projects
None yet
Development

No branches or pull requests

3 participants