Skip to content

Latest commit

 

History

History
111 lines (90 loc) · 5.59 KB

README.md

File metadata and controls

111 lines (90 loc) · 5.59 KB

Content Defined Chunking playground

This repository provides implementations for a small number of Content Defined Chunking algorithms for the Go programming language. These implementations are provided for testing/benchmarking purposes.

This repository was created in response to Bazel remote-apis PR #282, where support for Content Defined Chunking is considered for inclusion into the remote execution protocol that is used by tools like Bazel and Buildbarn.

MaxCDC: content defined chunking with lookahead

Algorithms like Rabin fingerprinting and FastCDC all work on the basis that they perform a simple forward scan through the input data. A decision to introduce a cutting point between chunks is made when the bytes right before it are hashed. This simple design has some limitations. For example:

  • If no cutting point is found before the maximum chunk size is reached, the algorithm is forced to make a cut at an undesirable offset. It will not be able to select a "close to optimal" cutting point.

  • When implemented trivially, the size of chunks is expected to follow a geometric distribution. This means that there is a relatively large spread in size between the smallest and largest chunks. For example, for FastCDC8KB the largest chunks can be 32 times as large as the smallest ones (2 KB vs 64 KB).

The MaxCDC algorithm attempts to address this by performing lookaheads. Instead of selecting cutting points on the fly, it always scans the input up to the maximum limit and only afterwards chooses a cutting point that is most desirable. It considers the most desirable cutting point to be the one for which the Gear hash has the highest value, hence the name MaxCDC.

Runtime performance

Implementing MaxCDC trivially (as done in simple_max_content_defined_chunker.go) has the disadvantage that input data is hashed redundantly. It may process input up to the maximum limit and select a cutting point close to the minimum limit. Any data in between those two limits would be hashed again during the next iteration. To eliminate this overhead, we provide an optimized implementation (in max_content_defined_chunker.go) that preserves potential future cutting points on a stack, allowing subsequent calls to reuse this information. Performance of this optimized implementation is nearly identical to plain FastCDC.

Deduplication performance

In order to validate the quality of the chunking performed by this algorithm, we have created uncompressed tarballs of 80 different versions of the Linux kernel (from v6.0 to v6.8, including all release candidates). Each of these tarballs is approximately 1.4 GB in size.

When chunking all of these tarballs with FastCDC8KB, we see that each tarball is split into about 145k chunks. When deduplicating chunks across all 80 versions, 383,093 chunks remain that have a total size of 3,872,754,501 bytes. Chunks thus have an average size of 10,109 bytes.

We then chunked the same tarballs using MaxCDC, using a minimum size of 4,096 bytes and a maximum size of 14,785 bytes. After deduplicating, this yielded 374,833 chunks having a total size of 3,790,013,152 bytes. The minimum and maximum chunk size were intentionally chosen so that the average chunk size was almost identical to that of FastCDC8KB, namely 10,111 bytes.

We therefore conclude that for this specific benchmark the MaxCDC generated output consumes 2.14% less space than FastCDC8KB. Furthermore, the spread in chunk size is also far better when using MaxCDC (14,785 B / 4,096 B ≈ 3.61) when compared to FastCDC8KB (64 KB / 2 KB = 32).

Tuning recommendations

Assume you use MaxCDC to chunk two non-identical, but similar files. Making the ratio between the minimum and maximum permitted chunk size too small leads to bad performance, because it causes the streams of chunks to take longer to converge after differing parts have finished processing.

Conversely, making the ratio between the minimum and maximum permitted chunk size too large is also suboptimal. The reason being that large chunks have a lower probability of getting deduplicated against others. This causes the average size of chunks stored in a deduplicating data store to become higher than that of the chunking algorithm itself.

When chunking and deduplicating the Linux kernel source tarballs, we observed that for that specific data set the optimal ratio between the minimum and maximum chunk size was somewhere close to 4x. We therefore recommend that this ratio is used as a starting point.

Relationship to RDC FilterMax

Microsoft's Remote Differential Compression algorithm uses a content defined chunking algorithm named FilterMax. Just like MaxCDC, it attempts to insert cutting points at positions where the hash value of a rolling hash function is a local maximum. The main difference is that this is only checked within a small region what the algorithm names the horizon. This results in a chunk size distribution that is geometric, similar to traditional Rabin fingerprinting implementations.

Some testing of this construct in combination with the Gear hash function was performed, using the same methodology as described above. Deduplicating yielded 398,967 unique chunks with a combined size of 4,031,959,354 bytes. This is 4.11% worse than FastCDC8KB and 6.38% worse than MaxCDC. The average chunk size was 10,105 bytes, which is similar to what was used for the previous tests.