Content Defined Chunking (CDC)

Defining Content Defined Chunking

Qu'est-ce que CDC?

Content-Defined Chunking (CDC) is a method of splitting a data stream into variable-sized blocks (“chunks”) based on the content of the data, rather than fixed byte offsets. In CDC, a rolling hash is computed over a sliding window of the data, and a cut-point (chunk boundary) is declared when the hash meets a simple condition (for example, when certain bits of the fingerprint equal a predefined pattern). This approach overcomes the “boundary-shift” problem of fixed-size chunking: if data is inserted or deleted, CDC can still find matching boundaries for the unchanged portions, greatly increasing deduplication effectiveness (FastCDC v2016, v2020).

CDC can detect on the order of 10–20% more redundancy than fixed‐size chunking and is widely used in backup, storage, and content-addressable systems to identify duplicate or similar data at fine granularity.

Unlike fixed-size chunking (FSC), CDC adapts chunk boundaries to the data. A simple example is using a polynomial rolling hash (Rabin fingerprint) on each window of ww bytes: if fpmodD=r\mathrm{fp} \bmod D = r for some constants DD and rr, then a boundary is marked. In practice, Rabin-based CDC reliably finds boundaries that depend on the data content, but it can be CPU-intensive since it updates and checks the fingerprint at every byte. As a faster alternative, many systems use a Gear rolling hash (introduced in the Ddelta paper and used in FastCDC). A Gear hash maintains a 64-bit fingerprint fpfp and a static 256-entry table of random 64-bit values GG. Each new byte bb updates the fingerprint as:

fp(fp1)+G[b] \mathrm{fp} \leftarrow (\mathrm{fp} \ll 1) + G[b]

In other words, the previous fingerprint is left-shifted by 1 bit and a precomputed random value (based on the new byte) is added. Because it shifts out the oldest bit and adds the new byte’s contribution, the Gear hash is a true rolling hash. It uses only three simple operations per byte (one shift, one addition, and a table lookup), so it is extremely fast. By contrast, Rabin requires several shifts, XORs, and multiplications per byte (see Josh Leeb's blog about the 2016 paper). Empirical comparisons note that Gear “uses far fewer operations than Rabin”, making it a good high-speed CDC hash. The Gear table must be filled with random 64-bit values to ensure a uniform fingerprint distribution, but otherwise the algorithm is simple to implement.

Chunk Boundary Detection

To use CDC, one chooses an average target chunk size (for example, 8 KiB) and computes a corresponding bit mask or pattern. A common scheme is: choose N=log2(target_size)N = \log_2(\text{target\_size}), then use a mask with NN bits set (e.g. mask=0x1FFF\text{mask} = 0x1FFF for 13 bits, giving approximately 8 KiB). In practice, one often uses an “all-zero” check: declare a boundary when (fp&mask)=0(\mathrm{fp} \,\&\, \text{mask}) = 0. Under the uniform-hash assumption, each bit of fp\mathrm{fp} is independently 0 or 1 with probability 1/21/2, so using NN bits with mask=1\text{mask} = 1 yields a 1/2N1 / 2^N probability of a boundary at each byte. Thus, on average, chunks are about 2N2^N bytes long, with an exponential-like distribution of sizes. In code, after updating fp\mathrm{fp} with each new byte, one checks whether (fp&mask)=0(\mathrm{fp} \,\&\, \text{mask}) = 0 and— to avoid tiny chunks—also enforces a minimum chunk size, skipping cut-points until at least min_size\text{min\_size} bytes have been emitted. One also typically enforces a maximum chunk size to cap runaway lengths. Typical parameters might be min32128KiB\text{min} \approx 32\text{–}128\,\text{KiB} and max2561024KiB\text{max} \approx 256\text{–}1024\,\text{KiB} for large-file use (e.g. in PyArrow).

When a cut-point is reached, the current chunk is output and the fingerprint resets (often to 00) to begin a new chunk. In streaming implementations, one can also “roll” out the oldest byte from the fingerprint as the window advances, but in Gear’s simple update formula no explicit removal step is required, since the left shift naturally discards the high-order bit. In effect, Gear continuously “rolls” over a sliding window whose size is determined by the fingerprint width. With an mm-bit fingerprint and a mask containing NN ones, the effective sliding window is on the order of NN bits, which roughly corresponds to NN bytes if the input bytes are random. For example, using a 13-bit mask in Gear yields an effective window of about 13 bytes, much smaller than a Rabin window of 48 bytes for an 8 KiB target.

Implementation notes: If building your own CDC library, you need to generate a Gear table (for example, 256 random 64-bit values), choose appropriate chunk-size parameters, and implement the update-and-test logic described above. Many libraries (in C, Rust, and other languages) already provide CDC or FastCDC implementations; for instance, the fastcdc crate supports both the original (2016) and updated (2020) algorithms. The mask should be derived deterministically from the target size, for example mask=(1N)1\text{mask} = (1 \ll N) - 1 for NN bits. For better consistency, one typically pads or distributes the NN ones evenly (as in FastCDC) to enlarge the effective window. The chunking loop then conceptually becomes: update fp\mathrm{fp}, check that the current size is at least min_size\text{min\_size}, and then test whether fp&mask=0\mathrm{fp} \,\&\, \text{mask} = 0.

In summary, a basic Gear-based CDC algorithm can be described as follows: start with fp=0\mathrm{fp} = 0, and for each input byte bb,

fp(fp1)+G[b] \mathrm{fp} \leftarrow (\mathrm{fp} \ll 1) + G[b]

and then apply the following logic:

if current_size >= min_size and (fp & mask) == 0:
    emit_chunk()
    fp = 0
elif current_size >= max_size:
    emit_chunk()
    fp = 0

This simple scheme already yields content-aligned chunks, though it can produce a wide range of sizes if left without further adjustments.

FastCDC (2016) Enhancements

The original FastCDC paper (Xia et al., 2016) built on the basic Gear CDC scheme with three key optimisations to boost speed while retaining high dedup ratio:

  1. Enhanced Hash Judgment (Padded Mask): Instead of using a plain mask of NN ones, FastCDC pads that mask with additional zeros to effectively enlarge the sliding window. In practice, one generates a larger bit‐width mask (e.g. 48 bits) that still has only log2(target)\log_2(\text{target}) effective ones scattered among them. This makes chunk boundaries harder to meet (so chunks tend to be closer to the average) and increases the hash window. Padding the mask in this way lets Gear achieve almost the same deduplication ratio as Rabin-based CDC while preserving Gear’s speed.

  2. Sub-Minimum Cut-Point Skipping: FastCDC observes that the hashing speed bottleneck shifts to the judgment phase once Gear makes hashing cheap. To speed up processing, FastCDC explicitly skips checking the hash for a chunk boundary until a larger minimum size is reached. In other words, rather than checking (fp&mask)=0(\mathrm{fp} \,\&\, \text{mask}) = 0 on every byte after the standard 2K or 4K prefix, it waits until the chunk has grown significantly. This “skip” is trivial to implement (just check only after min_size\text{min\_size} bytes) but it can double the speed of chunk finding. The trade-off is a slight drop in deduplication ratio (since small natural cut-points are missed) – on the order of ~10–15% worst-case.

  3. Normalised Chunking: To recover deduplication lost by skipping and to avoid extremely small or large chunks, FastCDC introduced “normalised chunking.” This means using two masks (called MaskSmallMask_{Small} and MaskLargeMask_{Large}) instead of one. For the first part of a chunk (when its current length < desired average), FastCDC applies a larger mask (more 1 bits) so that boundaries are less likely (pushing the chunk to grow). Once the chunk exceeds the target size, it switches to a smaller mask (fewer 1 bits), making boundaries more likely to cut sooner. By selectively biasing the cut-probability based on current length, the algorithm concentrates chunk sizes around the target range. In effect, normalised chunking “normalises” the distribution: most chunks end up between, say, 0.5× and 1.5× the target size. (See image below.) This not only improves deduplication (by avoiding too many tiny/huge chunks) but also enables even larger minimum-size skipping without harming ratio.

https://joshleeb.com/posts/fastcdc.html

Figure: Chunk-size distribution for un-normalised CDC (with a heavy-tailed spread) vs. FastCDC with normalisation (peaked near target size), from Xia et al. 2016. Normalisation centers the sizes near the mean.

These three techniques made FastCDC much faster than classical Gear CDC or Rabin CDC. The 2016 study showed FastCDC was roughly an order-of-magnitude faster than Rabin-based CDC (up to ~10×) and several times faster than earlier gear/Adler CDCs, while maintaining near-Rabin deduplication ratio.

FastCDC (2020) and Further Optimisations

A follow-up work in 2020 revisited FastCDC and added more refinements. The core algorithm remains Gear-based with padded masks and normalisation, but with additional optimisations for speed:

In short, the “FastCDC 2020” refinements aim to make Gear CDC even faster with minimal loss of chunk-boundary semantics. The key principles (gear hashing, padded mask, skip, normalised split) are the same; the updates allow chunking in fewer operations (two-bytes) and more consistent output across library versions.

Implementation Considerations

Anyone building a CDC library can follow these guidelines:

  1. MaskSmall=mask(1(N+level))Mask_{Small} = mask \,|\, (1 \ll (N + level)) (one or more extra ones)
  2. MaskLarge=masklevelMask_{Large} = mask \gg level (remove bits)

for some level1level \ge 1. The code then uses mask_smallmask\_small when the current chunk length <M< M, switching to mask_largemask\_large once length M\ge M.

CDC for Parquet Files

In the context of Parquet and data files, CDC is used to allow partial rewrites and deduplication of columns. By default, Parquet writers create fixed row-group boundaries and page sizes. With CDC enabled (use_content_defined_chunking=True in PyArrow), the writer instead splits each column into data pages at content-defined boundaries of the logical column values, before any encoding or compression. This ensures that small edits (like inserting rows or adding columns) only change nearby pages. For example, inserting a row in a column shifts all fixed-page boundaries and would invalidate many pages, but with CDC the chunking adjusts so that most data remains aligned. The Hugging Face Xet layer then sees identical byte chunks and avoids re-uploading unchanged pages.

In practice, using Parquet CDC on a dataset dramatically reduces the volume of changed data. The Hugging Face blog shows that with CDC enabled, adding or deleting rows leads to only ~6–8 MB of new data out of a 96 MB table (versus tens of MB without CDC). The Arrow docs put it like so:

When enabled, data pages are written according to content-defined chunk boundaries, determined by a rolling hash… When data in a column is modified, this approach minimises the number of changed data pages.

To implement something similar, a Parquet writer must collect each column’s values, run the CDC algorithm on their serialised representation, and cut pages accordingly. PyArrow automates this when the CDC flag is set.

Comparison to Other Methods

CDC (especially Gear/FastCDC) should be seen alongside other chunking strategies:

Fin

So to recap, Content-Defined Chunking (CDC) is a powerful technique for splitting data based on content rather than at fixed increments into the bytes, enabling fine-grained deduplication and efficient updates (i.e. adding/removing data from a dataset). Gear hash offers a very fast rolling hash for CDC, and FastCDC’s optimisations (padded masks, skip-min, normalisation, double-byte rolling) make it practical for high-speed, large-scale use.

Libraries like PyArrow now expose CDC for Parquet writers, allowing developers to achieve better data transfer efficiency. To implement (or even just appreciate as a user) CDC yourself you should consider the practical concerns discussed here of tuning mask bits, min/max sizes, and normalisation to balance deduplication ratio with performance.