Skip to content

ttkb-oss/dedup

Repository files navigation

dedup(1)

A macOS utility to replace duplicate file data with a copy-on-write clone.

SYNOPSIS

dedup [-PVnvx] [-t threads] [-d depth] [file ...]

DESCRIPTION

dedup finds files with identical content using the provided file arguments. Duplicates are replaced with a clone of another (using clonefile(2)). If no file is specified, the current directory is used.

Cloned files share data blocks with the file they were cloned from, saving space on disk. Unlike a hardlinked file, any future modification to either the clone or the original file will be remain private to that file (copy-on-write).

dedup works in two phases. First it evaluates all of the paths provided recursively, looking for duplicates. Once all duplicates are found, any files that are not already clones of "best" clone source are replaced with clones.

There are limits which files can be cloned:

  1. the file must be a regular file
  2. the file must have only one link
  3. the file and its directory must be writable by the user

The "best" source is chosen by first finding the file with the most hard links. Files with multiple hard links will not be replaced, so using them as the source of other clones allows their blocks to be shared without modifying the data to which they point. If all files have a single link, a file which shares the most clones with others is chosen. This ensures that files which have been previously processed will not need to be replaced during subsequent evaluations of the same directory. If none of the files have multiple links or clones, the first file encountered will be chosen.

Files with multiple hard links are not replaced because it is not possible to guarantee all other links to that inode exist within the tree(s) being evaluated. Replacing a link with a clone changes the semantics from two link pointing at the same, mutable shared storage to two links pointing at the same copy-on-write storage. For scenarios where hard links were previously being used because clones were not available, future versions may provide a flag to destructively replace hard links with clones. Future versions may also consider cloning files with multiple hard links if all links are within the space being evaluated and two or more hard link clusters reference duplicated data.

If all files in a matched set are compressed with HFS transparent compression, none of the files with be deduplicated. Future versions of dedup may select one file from the set to decompress in place and then use that file as a clone source.

dedup will only work on volumes that have the VOL_CAP_INT_CLONE capability. Currently that is limited to APFS.

While dedup is primarily intended to be used to save storage by using clones, it also provides -l and -s flags to replace duplicates with hard links or symbolic links respectively. Care should be taken when using these options, however. Unlike clones, the replaced files share the metadata of one of the matched files, though it might not seem deterministic which one. If these options are used with automation where all files have default ownership and permissions, there should be little issue. The created files are also not copy-on-write and will share any modifications made. These options should only be used if the consequences of each choice are understood.

OPTIONS

The following options are available:

-d depth, --depth depth

Only traverse depth directories deep into each provided path.

-h

Display sizes using SI suffixes with 2-4 digits of precision.

-n, --dry-run

Evaluate all files and find all duplicates but only print what would be done and do not modify any files.

-l, --link

Replace duplicate files with hard links instead of clones. Replaced files will not retain their metadata.

-s, --symlink

Replace duplicate files with symbolic links instead of clones. Replaced files will not retain their metadata.

-P, --no-progress

Do not display a progress bar.

-t threads

The number of threads to use for evaluating files. By default this is the same as the number of CPUs on the host as described by the hw.ncpu value returned by sysctl(8). If the value 0 is provided all evaluation will be done serially in the main thread.

-V, --version

Print the version and exit

-v, --verbose

Increase verbosity. May be specified multiple times.

-x, --one-file-system

Prevent dedup from descending into directories that have a device number different than that of the file from which the descent began.

-?, --help

Print a summary of options and exit.

EXIT STATUS

The dedup utility exits 0 on success and >0 if an error occurs.

It is possible that during abnormal termination a temporary clone will be left before it is moved to the path it is replacing. In such cases a file with the prefix .~. followed by the name of the file that was to be replaced will exist in the same directory as target file.

find(1) can be used to find these temporary files if necessary.

$ find . -name '.~.*'

EXAMPLES

Find all duplicates and display which files would be replaced by clones along with the estimated space that would be saved:

$ dedup -n

Limit execution to a single thread, disable progress, and display human readable output while deduplicating files in ~/Downloads and /tmp

$ dedup -t1 -Ph ~/Downloads /tmp

SEE ALSO

clonefile(2)

STANDARDS

Sometimes.

AUTHORS

dedup was written by Jonathan Hohle <jon@ttkb.co>.

SECURITY CONSIDERATIONS

dedup makes a best effort attempt to replace original files with clones that have the same permissions, metadata, and ACLs. Support for copying metadata comes from copyfile(3)

dedup shouldn't be used on directories or files where files are actively being modified. In it's current implementation dedup doesn't lock any files which causes several race conditions if underlying files are being modified. If a file acting as a clone source or target is modified between any of the following events, a file may be replaced by something that resembles it's previous state.

  1. File metadata retrieval and comparison to previously seen files
  2. Creation of a clone from a source
  3. Application of file metadata from the target
  4. Replacing the target with the clone

For example, if file a is seen and later file b is found to be a match, a may be changed, causing b to be cloned to its new content. Likewise, file b may be changed and then overwritten by a clone with it's previous content.

It may be reasonable for future versions to include additional checks and locks to ensure modifications are detected prior to clone replacement (#5).

HISTORY

If the author was more clever, he might have named this program avril.

INSTALLATION

Download the latest release and extract the files in the $PREFIX of your choice.

or

Build a copy locally

git clone https://github.com/ttkb-oss/dedup.git
cd dedup
make && sudo make install

Packages for package managers will be provided in the future.

DEVELOPMENT

Up to this point development has only been done on macOS 13, but all APIs used were available in macOS 12, so it may work there as well.

For building the app, a recent version of Xcode is required along with Command Line Tools (xcode-select --install).

Check is used for testing.

Aspell is used for spell checking.

LCOV is used for reporting test coverage.

MacPorts users can

sudo port install lcov aspell check

Homebrew users should install things the way they choose to. They may need to add header and library search paths when running the check target (or any of the targets that compile things in test/Makefile.

CFLAGS='-I/usr/local/include' LDFLAGS='-L/usr/local/lib' make check

CONTRIBUTING

Feel free to send a PR for build, code, test, or documentation changes. If the PR is approved, you will be asked to assign copyright of the submitted code.

FAQ

How does it work?

A directory contains files "a", "b", and "c". "a" and "b" are links to the same inode which points to a data block containing "hello". "c" is a unique link to a different inode pointing to a different data block which also contains "hello".

dir    ╔═══════╗   ╔═════════╗
⎺⎺⎺┌──▶║ ino 1 ║──▶║ "hello" ║
a ─┘┌─▶╚═══════╝   ╚═════════╝
b ──┘  ╔═══════╗   ╔═════════╗
c ────▶║ ino 2 ║──▶║ "hello" ║
       ╚═══════╝   ╚═════════╝

When dedup is run on dir, "c"'s is updated to point to a new inode that shares its data blocks with the inode which "a" and "c" are linked.

dir    ╔═══════╗   ╔═════════╗
⎺⎺⎺┌──▶║ ino 1 ║──▶║ "hello" ║
a ─┘┌─▶╚═══════╝   ╚═════════╝
b ──┘  ╔═══════╗        ▲
c ────▶║ ino 3 ║────────┘
       ╚═══════╝

If the data is modified using the "a" or "b" link (file name), the data block shared with "c" will be disconnected from their "inode" and a new block (or blocks) will be created.

                    ╔═════════╗
                 ┌─▶║ "world" ║
                 │  ╚═════════╝
dir    ╔═══════╗ │ ╔═════════╗
⎺⎺⎺┌──▶║ ino 1 ║─┘ ║ "hello" ║
a ─┘┌─▶╚═══════╝   ╚═════════╝
b ──┘  ╔═══════╗        ▲
c ────▶║ ino 3 ║────────┘
       ╚═══════╝

Likewise, if the data is modified using the "c" link, "a" and "b" will continue to point to the original data block and "c" will now point to a newly created data block.

dir    ╔═══════╗   ╔═════════╗
⎺⎺⎺┌──▶║ ino 1 ║──▶║ "hello" ║
a ─┘┌─▶╚═══════╝   ╚═════════╝
b ──┘  ╔═══════╗   ╔═════════╗
c ────▶║ ino 3 ║──▶║ "world" ║
       ╚═══════╝   ╚═════════╝

How Can it be Checked?

If you run dedup again on the same directory tree multiple times, it will output the amount of space that is already saved by clones and hard links.

A patched version of du(1) will also ignore clones it encounters multiple times just like hard links to the same inode are ignored. The patched du will display smaller block usage if data can be deduplicated.

If you want to ensure a directory contains the same content before and after running dedup, you can create a checksum file and validate it immediately after cloning:

TARGET_TREE=some/path
find "$TARGET_TREE" -type f -print0 | xargs -0 -n1 shasum -a 256 > original.sha256
dedup "$TARGET_TREE"
shasum -c original.sha256 | grep -v ': OK$'

The final command should output nothing ($? will be 1 because grep selects no lines).

Are Any Others Operating Systems Supported?

dedup leverages both file system support for creating clones as well as the appropriate system calls. OpenZFS support for sharing blocks may make FreeBSD support possible in the future.

Why Aren't HFS Compressed Files Cloned?

APFS supports HFS compression using xattrs and flags, just like HFS+. Howver, HFS compression does not store data in a file's data fork, but in a file's xattrs. Since APFS clones data blocks, not file metadata, there's nothing to share between clones of HFS compressed files.

HFS transparent compression uses a file's resource fork (the com.apple.ResourceFork xattr) along with a com.apple.decmps xattr describing compression details (algorithm, original file size), and the UF_COMPRESSED flag. There is no data in any data blocks (né data fork).

UPDATE In macOS 15 Sequoia, HFS Compressed files now appear to be transparently handled and no longer expose UF_COMPRESSED.