Refactoring the file copy operations - reflinks with fallback #3581

martinhoyer · 2025-03-05T15:53:03Z

More or less follow-up on #2791
The way tmt is copying files appears to be quite a performance bottleneck in some scenarios and can cause issues like #3558 (well, reflinks wouldn't help with running out of inodes there).

What makes reflinks interesting, is that it behaves the same as normal cp operation, but doesn't take additional storage space, unless the file is changed.

They do need to be supported by the underlying filesystem. Notably:
✅ Btrfs - Fedora default since F33
✅ XFS - reflinks supported since CentOS Stream 8

❌ ext4 - Ubuntu GitHub action runners
❌ EFS - AWS

Pull Request Checklist

happz · 2025-03-05T16:07:20Z

tmt/utils/filesystem.py

+        cache_path = cache_dir / cache_key
+
+        if item.is_dir():
+            dst_item.mkdir(parents=True, exist_ok=True)


I don't understand this bit: what happens to the content of item, being a directory?

To be honest I'm not entirely confident about the whole "fallback" code. Hoping someone more knowledgeable can validate it. Too bad testing-farm is running on aws filesystem and it's kinda needed.

I wouldn't blame just AWS. I think we shouldn't lock us with supporting just two very modern filesystems. ext4 is by no means rare.

No blame at all. I meant it more like we could keep the existing shutil.copytree as fallback instead if of this custom cache that could be hard to maintain.

happz · 2025-03-05T16:07:21Z

tmt/utils/filesystem.py

+    # significantly higher inode consumption and potential performance issues.
+    # Use a persistent cache for hardlinks
+    cache_dir = Path(tempfile.gettempdir()) / 'tmt_file_cache'
+    cache_dir.mkdir(exist_ok=True)


When is cache_dir going to be removed?

I have no good solution at this point, other than it being in tmpfs? Ideas welcome :)

happz · 2025-03-05T16:07:24Z

tmt/utils/filesystem.py

+
+    # Try reflink copy first (supported by btrfs, xfs with reflink, and some other filesystems)
+    try:
+        if logger:


Logger should be mandatory, if there are debug messages, let's get them logged.

Right, leftover from a change. Thanks.

happz · 2025-03-05T16:07:25Z

tmt/utils/filesystem.py

+    Args:
+        src: Source directory path
+        dst: Destination directory path
+        logger: Logger to use for debug messages (optional)


The more spread format is :param foo: description..., with no Args.

happz · 2025-03-05T16:07:27Z

tmt/utils/filesystem.py

+
+def copy_tree(
+    src: Union[str, Path],
+    dst: Union[str, Path],


If there is a path-like string that is not Path, we should fix that callsite. I for one would not like to allow path-like strings.

Make sense, I was thinking about shutil.copytree, which is dumb.

happz · 2025-03-05T16:08:22Z

tmt/utils/filesystem.py

+
+        # Create a path in cache using a hash of the relative path to avoid path length issues
+        cache_key = relative_path_to_cache_key(relative_path)
+        cache_path = cache_dir / cache_key


cache_path = cache_dir / relative_path_to_cache_key(relative_path) ? cache_key does not seem to be used anywhere else.

again, still a draft - especially the hardlink, cache.

happz · 2025-03-05T16:10:54Z

tmt/utils/filesystem.py

+    return
+
+
+def relative_path_to_cache_key(relative_path: Union[str, Path]) -> str:


hashlib is full of hashing functions, why not using one from this standard library package?

Because I don't think we need hash for this. This is checksum based, which fits the purpose imho. zlib is also built-in (double checked 3.9 as well).

martinhoyer · 2025-03-05T16:32:27Z

@happz hold on, it's still a draft :)

copy_tree function in utils/filesystem.py optimizes directory copies by: 1. Using reflinks when available (btrfs, xfs) as primary strategy 2. Falling back to hardlinks with persistent caching for unchanged files 3. Using standard copy for files that changed This should significantly reduce inode consumption during tmt runs, especially beneficial for CI/CD environments and large repositories.

martinhoyer added code | utils Various utility functions and classes used across the code ci | full test Pull request is ready for the full test execution labels Mar 5, 2025

martinhoyer self-assigned this Mar 5, 2025

happz reviewed Mar 5, 2025

View reviewed changes

martinhoyer force-pushed the feature/efficient-directory-copy branch 4 times, most recently from 146d232 to f30d78b Compare March 19, 2025 16:48

martinhoyer mentioned this pull request Mar 20, 2025

Simplify workdir pruning implementation #3616

Open

1 task

martinhoyer force-pushed the feature/efficient-directory-copy branch from f30d78b to f76ab76 Compare March 20, 2025 15:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring the file copy operations - reflinks with fallback #3581

Refactoring the file copy operations - reflinks with fallback #3581

martinhoyer commented Mar 5, 2025 •

edited

Loading

happz Mar 5, 2025

martinhoyer Mar 19, 2025

happz Mar 19, 2025

martinhoyer Mar 19, 2025

happz Mar 5, 2025

martinhoyer Mar 19, 2025

happz Mar 5, 2025

martinhoyer Mar 5, 2025

martinhoyer Mar 19, 2025

happz Mar 5, 2025

martinhoyer Mar 19, 2025

happz Mar 5, 2025

martinhoyer Mar 5, 2025

martinhoyer Mar 19, 2025

happz Mar 5, 2025

martinhoyer Mar 5, 2025

martinhoyer Mar 19, 2025

happz Mar 5, 2025

martinhoyer Mar 5, 2025

martinhoyer commented Mar 5, 2025

		return


		def relative_path_to_cache_key(relative_path: Union[str, Path]) -> str:

Refactoring the file copy operations - reflinks with fallback #3581

Are you sure you want to change the base?

Refactoring the file copy operations - reflinks with fallback #3581

Conversation

martinhoyer commented Mar 5, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

martinhoyer commented Mar 5, 2025

martinhoyer commented Mar 5, 2025 •

edited

Loading