Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements from fork #11

Merged
merged 3 commits into from
Dec 11, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
27 changes: 22 additions & 5 deletions src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -135,6 +135,9 @@ use std::num::Wrapping;
/// hash, etc).
pub trait ChunkerImpl {
/// Look at the new bytes to maybe find a boundary.
/// The boundary is an index within `data`, after which the cut-point is set.
/// I.e., a return value of `Some(0)` indicates that only the first byte of this block should be
/// included in the current chunk.
fn find_boundary(&mut self, data: &[u8]) -> Option<usize>;

/// Reset the internal state after a chunk has been emitted
Expand Down Expand Up @@ -252,11 +255,7 @@ impl<I: ChunkerImpl> Chunker<I> {
pub fn max_size(self, max: usize) -> Chunker<SizeLimited<I>> {
assert!(max > 0);
Chunker {
inner: SizeLimited {
inner: self.inner,
pos: 0,
max_size: max,
},
inner: SizeLimited::new(self.inner, max),
}
}
}
Expand Down Expand Up @@ -288,6 +287,7 @@ impl<R: Read, I: ChunkerImpl> Iterator for WholeChunks<R, I> {
/// Objects returned from the ChunkStream iterator.
///
/// This is either more data in the current chunk, or a chunk boundary.
#[derive(Debug)]
pub enum ChunkInput<'a> {
Data(&'a [u8]),
End,
Expand Down Expand Up @@ -429,12 +429,29 @@ impl<'a, I: ChunkerImpl> Iterator for Slices<'a, I> {
}
}

/// A wrapper that limits the size of produced chunks.
///
/// Note that the inner chunking implementation is reset when a chunk boundary is
/// emitted because of the size limit. This will generally reduce content-dependence,
/// and thus deduplication ratio, because the boundary is set by size rather than by
/// content.
pub struct SizeLimited<I: ChunkerImpl> {
inner: I,
pos: usize,
max_size: usize,
}

impl<I: ChunkerImpl> SizeLimited<I> {
/// Wraps the given chunker implementation to limit the size of produced chunks.
pub fn new(inner: I, max_size: usize) -> Self {
SizeLimited {
inner,
pos: 0,
max_size,
}
}
}

impl<I: ChunkerImpl> ChunkerImpl for SizeLimited<I> {
fn find_boundary(&mut self, data: &[u8]) -> Option<usize> {
assert!(self.max_size > self.pos);
Expand Down
Loading