-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Description
Bug Report
Description
After upgrading to DVC 3.65.0+, dvc data status --not-in-remote --granular incorrectly reports files as "not in remote" even though they have been successfully pushed. This happens when multiple tracked files have identical content (same hash).
The regression was introduced in 3.65.0 by PR #10923 "Use bulk calls for checking which entries are not in remote".
Root Cause
The issue is in dvc-data package, specifically in dvc_data/index/index.py:
entry_map: dict[str, DataIndexEntry] = {
self.get(entry)[1]: entry for entry in entries_with_hash
}This dictionary comprehension deduplicates entries by their path. When multiple files share the same hash, only one entry survives in the map. The other entries are lost and incorrectly reported as "not in remote" since they're never checked against the actual remote.
Reproduce
#!/bin/bash
set -ex
# Clean up any previous test
rm -rf test_repo_minimal test_remote_minimal
# Create local remote directory
mkdir -p test_remote_minimal
# Create and initialize test repo
mkdir -p test_repo_minimal
pushd test_repo_minimal
git init
dvc init
# Configure local remote
dvc remote add -d myremote ../test_remote_minimal
# Create two files with identical content
echo "identical content" > file1.txt
echo "identical content" > file2.txt
# Track files with dvc add
dvc add file1.txt
dvc add file2.txt
# Commit to git
git add .
git commit -m "Add two identical files"
# Push to remote
dvc push
# Check data status - this should show nothing if files are pushed
dvc data status --not-in-remote --granular
popdActual output:
Not in remote:
(use "dvc push <file>..." to upload files)
file1.txt
Expected
No output from dvc data status --not-in-remote --granular since all files were successfully pushed to remote.
Environment information
Output of dvc doctor:
╰─❯ dvc doctor
DVC version: 3.66.1.dev2+gff8752c3d.d20260108
---------------------------------------------
Platform: Python 3.11.11 on Linux-5.15.0-140-generic-x86_64-with-glibc2.35
Subprojects:
dvc_data = 3.18.0
dvc_objects = 5.2.0
dvc_render = 1.0.2
dvc_task = 0.40.2
scmrepo = 3.6.1
Supports:
http (aiohttp = 3.13.3, aiohttp-retry = 2.9.1),
https (aiohttp = 3.13.3, aiohttp-retry = 2.9.1),
s3 (s3fs = 2025.12.0, boto3 = 1.41.5)
Config:
Global: /home/pchmelar/.config/dvc
System: /etc/xdg/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: nfs4 on 10.11.72.32:/mlops_data/workplace
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/73fdc15e5692b65f491613dd4ea3b5d1Additional Information:
- Regression introduced in DVC 3.65.0
- Related PR: Use bulk calls for checking which entries are not in remote #10923
- The bug affects any scenario with multiple files having identical content (same hash)
- Also reproducible with DVC pipeline stage outputs (directories containing files with duplicate hashes)