Skip to content

find_probable_copy incorrectly compares original files instead of historical versions #28

@dguido

Description

@dguido

find_probable_copy incorrectly compares original files instead of historical versions

Summary

The find_probable_copy method in src/vendetect/detector.py has a logic error where it always compares the original detection files regardless of which historical commit is being examined. This defeats the purpose of history traversal and prevents the tool from accurately finding when code was copied between repositories.

Problem Description

The bug is in src/vendetect/detector.py at line 283 in the find_probable_copy method:

def find_probable_copy(self, detection: Detection, max_depth: int | None = None) -> Detection:
    # ... initialization code ...
    
    while to_test:
        test_repo, source_repo, depth = to_test.pop()
        
        # ... depth checking ...
        
        if history:
            new_detections = tuple(self.compare((detection.test,), (detection.source,)))  # BUG
            if new_detections:
                best = min(best, *new_detections)
        
        pv = test_repo.previous_version(detection.test.relative_path)
        spv = source_repo.previous_version(detection.source.relative_path)
        
        # ... rest of the method ...

The problematic line compares detection.test and detection.source, which are File objects from the original detection pointing to the HEAD commits of their respective repositories. However, test_repo and source_repo may be RepositoryCommit objects representing historical versions of the repositories.

Expected Behavior

When traversing history, the method should compare the files at their historical versions, not always compare the HEAD versions. Each iteration should compare the files as they existed at the specific commits being examined.

Actual Behavior

The method always compares the same files from the HEAD commits, regardless of which historical commits are being examined. This means:

  1. The history traversal doesn't actually examine how the files looked in the past
  2. The tool cannot accurately determine when code was copied
  3. The comparison results are identical for every iteration through history

Recommended Fix

Replace line 283 with code that creates new File objects pointing to the correct repository versions:

if history:
    test_file = File(detection.test.relative_path, test_repo)
    source_file = File(detection.source.relative_path, source_repo)
    new_detections = tuple(self.compare((test_file,), (source_file,)))
    if new_detections:
        best = min(best, *new_detections)

This ensures that the comparison uses files from the specific repository versions (potentially RepositoryCommit objects) being examined in each iteration.

Impact

  • Severity: High - This bug completely breaks the history analysis feature, making vendetect unable to accurately identify when code was copied
  • Affected versions: Current main branch and likely all recent versions
  • Consequences:
    • Cannot determine the actual commit where code was vendored
    • May report incorrect similarity scores
    • Wastes computational resources comparing the same files repeatedly

Test Case

To verify this bug:

  1. Create two repositories with similar code
  2. Modify the similar code in both repositories over several commits
  3. Run vendetect with history traversal enabled
  4. Observe that the tool doesn't correctly identify when the code was most similar (the probable copy point)

Additional Notes

This bug works in conjunction with issue #1 (RepositoryCommit checkout failures). Once issue #1 is fixed, this issue becomes apparent as the history traversal runs but doesn't produce meaningful results. Both issues should be fixed together for the history traversal feature to work correctly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions