A macOS utility to replace duplicate file data with a copy-on-write clone.
dedup [-PVnvx]
[-t
threads] [-d
depth] [file ...]
dedup finds files with identical content using the provided file arguments.
Duplicates are replaced with a clone of another (using clonefile(2)
).
If no file is specified, the current directory is used.
Cloned files share data blocks with the file they were cloned from, saving space on disk. Unlike a hardlinked file, any future modification to either the clone or the original file will be remain private to that file (copy-on-write).
dedup works in two phases. First it evaluates all of the paths provided recursively, looking for duplicates. Once all duplicates are found, any files that are not already clones of "best" clone source are replaced with clones.
There are limits which files can be cloned:
- the file must be a regular file
- the file must have only one link
- the file and its directory must be writable by the user
The "best" source is chosen by first finding the file with the most hard links. Files with multiple hard links will not be replaced, so using them as the source of other clones allows their blocks to be shared without modifying the data to which they point. If all files have a single link, a file which shares the most clones with others is chosen. This ensures that files which have been previously processed will not need to be replaced during subsequent evaluations of the same directory. If none of the files have multiple links or clones, the first file encountered will be chosen.
Files with multiple hard links are not replaced because it is not possible to guarantee all other links to that inode exist within the tree(s) being evaluated. Replacing a link with a clone changes the semantics from two link pointing at the same, mutable shared storage to two links pointing at the same copy-on-write storage. For scenarios where hard links were previously being used because clones were not available, future versions may provide a flag to destructively replace hard links with clones. Future versions may also consider cloning files with multiple hard links if all links are within the space being evaluated and two or more hard link clusters reference duplicated data.
If all files in a matched set are compressed with HFS transparent compression, none of the files with be deduplicated. Future versions of dedup may select one file from the set to decompress in place and then use that file as a clone source.
dedup will only work on volumes that have the VOL_CAP_INT_CLONE
capability. Currently that is limited to APFS.
While dedup is primarily intended to be used to save storage by using clones, it also provides -l and -s flags to replace duplicates with hard links or symbolic links respectively. Care should be taken when using these options, however. Unlike clones, the replaced files share the metadata of one of the matched files, though it might not seem deterministic which one. If these options are used with automation where all files have default ownership and permissions, there should be little issue. The created files are also not copy-on-write and will share any modifications made. These options should only be used if the consequences of each choice are understood.
The following options are available:
-d depth, --depth depth
Only traverse depth directories deep into each provided path.
-h
Display sizes using SI suffixes with 2-4 digits of precision.
-n, --dry-run
Evaluate all files and find all duplicates but only print what would be done and do not modify any files.
-l, --link
Replace duplicate files with hard links instead of clones. Replaced files will not retain their metadata.
-s, --symlink
Replace duplicate files with symbolic links instead of clones. Replaced files will not retain their metadata.
-P, --no-progress
Do not display a progress bar.
-t threads
The number of threads to use for evaluating files. By default this is the same as the number of CPUs on the host as described by the
hw.ncpu
value returned bysysctl(8)
. If the value 0 is provided all evaluation will be done serially in the main thread.
-V, --version
Print the version and exit
-v, --verbose
Increase verbosity. May be specified multiple times.
-x, --one-file-system
Prevent dedup from descending into directories that have a device number different than that of the file from which the descent began.
-?, --help
Print a summary of options and exit.
The dedup utility exits 0 on success and >0 if an error occurs.
It is possible that during abnormal termination a temporary clone will be left
before it is moved to the path it is replacing. In such cases a file with the
prefix .~.
followed by the name of the file that was to be replaced will exist
in the same directory as target file.
find(1)
can be used to find these temporary files if necessary.
$ find . -name '.~.*'
Find all duplicates and display which files would be replaced by clones along with the estimated space that would be saved:
$ dedup -n
Limit execution to a single thread, disable progress, and display human readable
output while deduplicating files in ~/Downloads
and /tmp
$ dedup -t1 -Ph ~/Downloads /tmp
Sometimes.
dedup was written by Jonathan Hohle <jon@ttkb.co>.
dedup
makes a best effort attempt to replace original files with clones that have
the same permissions, metadata, and ACLs. Support for copying metadata comes
from
copyfile(3)
dedup
shouldn't be used on directories or files where files are actively being
modified. In it's current implementation dedup
doesn't lock any files which
causes several race conditions if underlying files are being modified. If a file
acting as a clone source or target is modified between any of the following
events, a file may be replaced by something that resembles it's previous state.
- File metadata retrieval and comparison to previously seen files
- Creation of a clone from a source
- Application of file metadata from the target
- Replacing the target with the clone
For example, if file a is seen and later file b is found to be a match, a may be changed, causing b to be cloned to its new content. Likewise, file b may be changed and then overwritten by a clone with it's previous content.
It may be reasonable for future versions to include additional checks and locks to ensure modifications are detected prior to clone replacement (#5).
If the author was more clever, he might have named this program
avril
.
Download the latest release
and extract the files in the $PREFIX
of your choice.
or
Build a copy locally
git clone https://github.com/ttkb-oss/dedup.git
cd dedup
make && sudo make install
Packages for package managers will be provided in the future.
Up to this point development has only been done on macOS 13, but all APIs used were available in macOS 12, so it may work there as well.
For building the app, a recent version of Xcode
is required along with Command Line Tools (xcode-select --install
).
Check is used for testing.
Aspell is used for spell checking.
LCOV is used for reporting test coverage.
MacPorts users can
sudo port install lcov aspell check
Homebrew users should install things the way they choose to. They may need to
add header and library search paths when running the check
target (or any
of the targets that compile things in test/Makefile
.
CFLAGS='-I/usr/local/include' LDFLAGS='-L/usr/local/lib' make check
Feel free to send a PR for build, code, test, or documentation changes. If the PR is approved, you will be asked to assign copyright of the submitted code.
A directory contains files "a", "b", and "c". "a" and "b" are links to the same inode which points to a data block containing "hello". "c" is a unique link to a different inode pointing to a different data block which also contains "hello".
dir ╔═══════╗ ╔═════════╗
⎺⎺⎺┌──▶║ ino 1 ║──▶║ "hello" ║
a ─┘┌─▶╚═══════╝ ╚═════════╝
b ──┘ ╔═══════╗ ╔═════════╗
c ────▶║ ino 2 ║──▶║ "hello" ║
╚═══════╝ ╚═════════╝
When dedup is run on dir
, "c"'s is updated to point to a new inode that
shares its data blocks with the inode which "a" and "c" are linked.
dir ╔═══════╗ ╔═════════╗
⎺⎺⎺┌──▶║ ino 1 ║──▶║ "hello" ║
a ─┘┌─▶╚═══════╝ ╚═════════╝
b ──┘ ╔═══════╗ ▲
c ────▶║ ino 3 ║────────┘
╚═══════╝
If the data is modified using the "a" or "b" link (file name), the data block shared with "c" will be disconnected from their "inode" and a new block (or blocks) will be created.
╔═════════╗
┌─▶║ "world" ║
│ ╚═════════╝
dir ╔═══════╗ │ ╔═════════╗
⎺⎺⎺┌──▶║ ino 1 ║─┘ ║ "hello" ║
a ─┘┌─▶╚═══════╝ ╚═════════╝
b ──┘ ╔═══════╗ ▲
c ────▶║ ino 3 ║────────┘
╚═══════╝
Likewise, if the data is modified using the "c" link, "a" and "b" will continue to point to the original data block and "c" will now point to a newly created data block.
dir ╔═══════╗ ╔═════════╗
⎺⎺⎺┌──▶║ ino 1 ║──▶║ "hello" ║
a ─┘┌─▶╚═══════╝ ╚═════════╝
b ──┘ ╔═══════╗ ╔═════════╗
c ────▶║ ino 3 ║──▶║ "world" ║
╚═══════╝ ╚═════════╝
If you run dedup again on the same directory tree multiple times, it will output the amount of space that is already saved by clones and hard links.
A patched version of du(1)
will also ignore clones it encounters multiple times just like hard links
to the same inode are ignored. The patched du
will display smaller
block usage if data can be deduplicated.
If you want to ensure a directory contains the same content before and after running dedup, you can create a checksum file and validate it immediately after cloning:
TARGET_TREE=some/path
find "$TARGET_TREE" -type f -print0 | xargs -0 -n1 shasum -a 256 > original.sha256
dedup "$TARGET_TREE"
shasum -c original.sha256 | grep -v ': OK$'
The final command should output nothing ($?
will be 1
because grep
selects no lines).
dedup
leverages both file system support for creating clones as well as the
appropriate system calls. OpenZFS support for sharing blocks
may make FreeBSD support possible in the future.
APFS supports HFS compression using xattrs and flags, just like HFS+. Howver, HFS compression does not store data in a file's data fork, but in a file's xattrs. Since APFS clones data blocks, not file metadata, there's nothing to share between clones of HFS compressed files.
HFS transparent compression uses a file's resource fork (the
com.apple.ResourceFork
xattr) along with a com.apple.decmps
xattr
describing compression details (algorithm, original file size), and the
UF_COMPRESSED
flag. There is no data in any data blocks (né data fork).
UPDATE In macOS 15 Sequoia, HFS Compressed files now appear to be
transparently handled and no longer expose UF_COMPRESSED
.