Skip to content

data status: --not-in-remote incorrectly reports files with duplicate hashes as not pushed #10959

@petrchmelar

Description

@petrchmelar

Bug Report

Description

After upgrading to DVC 3.65.0+, dvc data status --not-in-remote --granular incorrectly reports files as "not in remote" even though they have been successfully pushed. This happens when multiple tracked files have identical content (same hash).

The regression was introduced in 3.65.0 by PR #10923 "Use bulk calls for checking which entries are not in remote".

Root Cause

The issue is in dvc-data package, specifically in dvc_data/index/index.py:

https://github.com/treeverse/dvc-data/blob/6301414f1c8feb46d3ed7077555c80fd7a08fe6b/src/dvc_data/index/index.py#L270-L272

        entry_map: dict[str, DataIndexEntry] = {
            self.get(entry)[1]: entry for entry in entries_with_hash
        }

This dictionary comprehension deduplicates entries by their path. When multiple files share the same hash, only one entry survives in the map. The other entries are lost and incorrectly reported as "not in remote" since they're never checked against the actual remote.

Reproduce

#!/bin/bash
set -ex

# Clean up any previous test
rm -rf test_repo_minimal test_remote_minimal

# Create local remote directory
mkdir -p test_remote_minimal

# Create and initialize test repo
mkdir -p test_repo_minimal

pushd test_repo_minimal

git init
dvc init

# Configure local remote
dvc remote add -d myremote ../test_remote_minimal

# Create two files with identical content
echo "identical content" > file1.txt
echo "identical content" > file2.txt

# Track files with dvc add
dvc add file1.txt
dvc add file2.txt

# Commit to git
git add .
git commit -m "Add two identical files"

# Push to remote
dvc push

# Check data status - this should show nothing if files are pushed
dvc data status --not-in-remote --granular

popd

Actual output:

Not in remote:
  (use "dvc push <file>..." to upload files)
        file1.txt

Expected

No output from dvc data status --not-in-remote --granular since all files were successfully pushed to remote.

Environment information

Output of dvc doctor:

╰─❯ dvc doctor 
DVC version: 3.66.1.dev2+gff8752c3d.d20260108
---------------------------------------------
Platform: Python 3.11.11 on Linux-5.15.0-140-generic-x86_64-with-glibc2.35
Subprojects:
        dvc_data = 3.18.0
        dvc_objects = 5.2.0
        dvc_render = 1.0.2
        dvc_task = 0.40.2
        scmrepo = 3.6.1
Supports:
        http (aiohttp = 3.13.3, aiohttp-retry = 2.9.1),
        https (aiohttp = 3.13.3, aiohttp-retry = 2.9.1),
        s3 (s3fs = 2025.12.0, boto3 = 1.41.5)
Config:
        Global: /home/pchmelar/.config/dvc
        System: /etc/xdg/dvc
Cache types: <https://error.dvc.org/no-dvc-cache>
Caches: local
Remotes: None
Workspace directory: nfs4 on 10.11.72.32:/mlops_data/workplace
Repo: dvc, git
Repo.site_cache_dir: /var/tmp/dvc/repo/73fdc15e5692b65f491613dd4ea3b5d1

Additional Information:

Metadata

Metadata

Assignees

No one assigned

    Labels

    A: statusRelated to the dvc diff/list/statusbugDid we break something?regressionOhh, we broke something :-(

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions