Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy_files should have an option to skip files if they already exist #17

Open
JohannesWiesner opened this issue Nov 30, 2022 · 3 comments

Comments

@JohannesWiesner
Copy link
Owner

Use case: I use copy_files mostly for copying files from a server to my local pc. Since it often happens that the connection gets lost, I have to restart the copying multiple times. In this case, it would be nice if copy_files would have an option to check if the file already exists (+ optionally checking if it's not corrupted) and only copy files that haven't already been copied.

@JohannesWiesner
Copy link
Owner Author

Could make sense to use bash's rsync command here?

import subprocess
# copy files using rsync in order to only copy new files and not old ones (
# this spares us time as we are avoiding unnecessary overwriting)
bashCommand = f"rsync -av --exclude .* --copy-links {src_dataset_path}/ {dst_dataset_path}"
process = subprocess.Popen(bashCommand.split(), stdout=subprocess.PIPE)
output, error = process.communicate()

@JohannesWiesner
Copy link
Owner Author

@JohannesWiesner
Copy link
Owner Author

JohannesWiesner commented Jan 29, 2024

Should be possible to sync files with sysrsync. Specifiy option='copy' or option='sync' in nisupply.io.copy_files (the latter is only possible only linux systems and only if rsync is preinstalled).

This would be the code for sysrsync:

for file,dst_dir in zip(dti_df['filepath'],dti_df['dst_dir']):
    
    dst_dir = dst_dir + '//'
    sysrsync.run(source=file,
                  destination=dst_dir,
                  options=['-a','--mkpath'],
                  sync_source_contents=False)

Important: sysrsync removes the trailing slash of the dst_dir by default, but we need that in order to sync a file to a folder.

The expression would be: rsync /foo/bar.txt /dst/bar/

Basic rules
Syncs source contents by default, so it adds a trailing slash to the end of source, unless sync_source_contents=False is specified
Removes trailing slash from destination
Extra arguments are put right after rsync
Breaks if source_ssh and destination_ssh are both set

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant