-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring the file copy operations - reflinks with fallback #3581
base: main
Are you sure you want to change the base?
Conversation
cache_path = cache_dir / cache_key | ||
|
||
if item.is_dir(): | ||
dst_item.mkdir(parents=True, exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this bit: what happens to the content of item
, being a directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest I'm not entirely confident about the whole "fallback" code. Hoping someone more knowledgeable can validate it. Too bad testing-farm is running on aws filesystem and it's kinda needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't blame just AWS. I think we shouldn't lock us with supporting just two very modern filesystems. ext4 is by no means rare.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No blame at all. I meant it more like we could keep the existing shutil.copytree as fallback instead if of this custom cache that could be hard to maintain.
# significantly higher inode consumption and potential performance issues. | ||
# Use a persistent cache for hardlinks | ||
cache_dir = Path(tempfile.gettempdir()) / 'tmt_file_cache' | ||
cache_dir.mkdir(exist_ok=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When is cache_dir
going to be removed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have no good solution at this point, other than it being in tmpfs? Ideas welcome :)
tmt/utils/filesystem.py
Outdated
|
||
# Try reflink copy first (supported by btrfs, xfs with reflink, and some other filesystems) | ||
try: | ||
if logger: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logger should be mandatory, if there are debug messages, let's get them logged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, leftover from a change. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
tmt/utils/filesystem.py
Outdated
Args: | ||
src: Source directory path | ||
dst: Destination directory path | ||
logger: Logger to use for debug messages (optional) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more spread format is :param foo: description...
, with no Args
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
tmt/utils/filesystem.py
Outdated
|
||
def copy_tree( | ||
src: Union[str, Path], | ||
dst: Union[str, Path], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is a path-like string that is not Path
, we should fix that callsite. I for one would not like to allow path-like strings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, I was thinking about shutil.copytree, which is dumb.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
tmt/utils/filesystem.py
Outdated
|
||
# Create a path in cache using a hash of the relative path to avoid path length issues | ||
cache_key = relative_path_to_cache_key(relative_path) | ||
cache_path = cache_dir / cache_key |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cache_path = cache_dir / relative_path_to_cache_key(relative_path)
? cache_key
does not seem to be used anywhere else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, still a draft - especially the hardlink, cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
addressed
tmt/utils/filesystem.py
Outdated
return | ||
|
||
|
||
def relative_path_to_cache_key(relative_path: Union[str, Path]) -> str: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hashlib
is full of hashing functions, why not using one from this standard library package?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I don't think we need hash for this. This is checksum based, which fits the purpose imho. zlib is also built-in (double checked 3.9 as well).
@happz hold on, it's still a draft :) |
146d232
to
f30d78b
Compare
copy_tree function in utils/filesystem.py optimizes directory copies by: 1. Using reflinks when available (btrfs, xfs) as primary strategy 2. Falling back to hardlinks with persistent caching for unchanged files 3. Using standard copy for files that changed This should significantly reduce inode consumption during tmt runs, especially beneficial for CI/CD environments and large repositories.
f30d78b
to
f76ab76
Compare
More or less follow-up on #2791
The way tmt is copying files appears to be quite a performance bottleneck in some scenarios and can cause issues like #3558 (well, reflinks wouldn't help with running out of inodes there).
What makes
reflinks
interesting, is that it behaves the same as normalcp
operation, but doesn't take additional storage space, unless the file is changed.They do need to be supported by the underlying filesystem. Notably:
✅ Btrfs - Fedora default since F33
✅ XFS - reflinks supported since CentOS Stream 8
❌ ext4 - Ubuntu GitHub action runners
❌ EFS - AWS
Pull Request Checklist