Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring the file copy operations - reflinks with fallback #3581

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

martinhoyer
Copy link
Collaborator

@martinhoyer martinhoyer commented Mar 5, 2025

More or less follow-up on #2791
The way tmt is copying files appears to be quite a performance bottleneck in some scenarios and can cause issues like #3558 (well, reflinks wouldn't help with running out of inodes there).

What makes reflinks interesting, is that it behaves the same as normal cp operation, but doesn't take additional storage space, unless the file is changed.

They do need to be supported by the underlying filesystem. Notably:
Btrfs - Fedora default since F33
✅ XFS - reflinks supported since CentOS Stream 8

❌ ext4 - Ubuntu GitHub action runners
❌ EFS - AWS

Pull Request Checklist

  • implement the feature
  • write the documentation
  • extend the test coverage
  • update the specification
  • mention the version
  • include a release note

@martinhoyer martinhoyer added code | utils Various utility functions and classes used across the code ci | full test Pull request is ready for the full test execution labels Mar 5, 2025
@martinhoyer martinhoyer self-assigned this Mar 5, 2025
cache_path = cache_dir / cache_key

if item.is_dir():
dst_item.mkdir(parents=True, exist_ok=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this bit: what happens to the content of item, being a directory?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be honest I'm not entirely confident about the whole "fallback" code. Hoping someone more knowledgeable can validate it. Too bad testing-farm is running on aws filesystem and it's kinda needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't blame just AWS. I think we shouldn't lock us with supporting just two very modern filesystems. ext4 is by no means rare.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blame at all. I meant it more like we could keep the existing shutil.copytree as fallback instead if of this custom cache that could be hard to maintain.

# significantly higher inode consumption and potential performance issues.
# Use a persistent cache for hardlinks
cache_dir = Path(tempfile.gettempdir()) / 'tmt_file_cache'
cache_dir.mkdir(exist_ok=True)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When is cache_dir going to be removed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no good solution at this point, other than it being in tmpfs? Ideas welcome :)


# Try reflink copy first (supported by btrfs, xfs with reflink, and some other filesystems)
try:
if logger:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logger should be mandatory, if there are debug messages, let's get them logged.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, leftover from a change. Thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

Args:
src: Source directory path
dst: Destination directory path
logger: Logger to use for debug messages (optional)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The more spread format is :param foo: description..., with no Args.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed


def copy_tree(
src: Union[str, Path],
dst: Union[str, Path],
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a path-like string that is not Path, we should fix that callsite. I for one would not like to allow path-like strings.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, I was thinking about shutil.copytree, which is dumb.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed


# Create a path in cache using a hash of the relative path to avoid path length issues
cache_key = relative_path_to_cache_key(relative_path)
cache_path = cache_dir / cache_key
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cache_path = cache_dir / relative_path_to_cache_key(relative_path) ? cache_key does not seem to be used anywhere else.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, still a draft - especially the hardlink, cache.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addressed

return


def relative_path_to_cache_key(relative_path: Union[str, Path]) -> str:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hashlib is full of hashing functions, why not using one from this standard library package?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I don't think we need hash for this. This is checksum based, which fits the purpose imho. zlib is also built-in (double checked 3.9 as well).

@martinhoyer
Copy link
Collaborator Author

@happz hold on, it's still a draft :)

@martinhoyer martinhoyer force-pushed the feature/efficient-directory-copy branch 4 times, most recently from 146d232 to f30d78b Compare March 19, 2025 16:48
copy_tree function in utils/filesystem.py optimizes directory copies by:
1. Using reflinks when available (btrfs, xfs) as primary strategy
2. Falling back to hardlinks with persistent caching for unchanged files
3. Using standard copy for files that changed

This should significantly reduce inode consumption during tmt runs,
especially beneficial for CI/CD environments and large repositories.
@martinhoyer martinhoyer force-pushed the feature/efficient-directory-copy branch from f30d78b to f76ab76 Compare March 20, 2025 15:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci | full test Pull request is ready for the full test execution code | utils Various utility functions and classes used across the code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants