Why is e.g. `decode_into` unsafe? Why `UnsafeCellSlice`? #133

flying-sheep · 2025-01-17T08:17:08Z

I wonder why the API takes UnsafeCellSlice<'_, u8> instead of just &mut [u8], and why it’s unsafe.

Assuming it’s eliding bound checks for performance: shouldn’t decode_into be safe and if really necessary for performance, we could add a decode_into_unchecked?

The text was updated successfully, but these errors were encountered:

LDeakin · 2025-01-17T08:59:56Z

I wonder why the API takes UnsafeCellSlice<'_, u8> instead of just &mut [u8], and why it’s unsafe.

I don't think I can take a &mut [u8] in these APIs without violating Rust's strict aliasing rules. I used to do something like this, but miri was not pleased.

Assuming it’s eliding bound checks for performance: shouldn’t decode_into be safe and if really necessary for performance, we could add a decode_into_unchecked?

~~I think the unsafe is just a holdover from my original intention for these methods to be internal. Yeah, I think you are right, these methods could have safe and _unchecked versions.~~ #133 (comment). Also, I really should be checking those invariants with debug assertions.

flying-sheep · 2025-01-17T12:12:45Z

I don't think I can take a &mut [u8] in these APIs without violating Rust's strict aliasing rules. I used to do something like this, but miri was not pleased.

I didn’t dive into this enough to understand: what prevents this from being written in 100% safe Rust, taking a &mut [u8] that isn’t aliased by anything else? What’s the reason for UnsafeCellSlice’s existence?

E.g. you can use slice.split_mut(pred) or slice.split_at_mut(usize) to get multiple unaliased mut slices from a slice.

LDeakin · 2025-01-17T12:46:58Z

Ok these have to stay unsafe actually, I'm missing writing the most important invariant that output subsets must be non-overlapping.

These methods have this signature so that threads can write to non overlapping multidimensional array subsets in parallel. If you can find a way to do that in safe Rust, I'd be interested.

flying-sheep · 2025-01-17T13:05:00Z

E.g. you can use slice.split_mut(pred) or slice.split_at_mut(usize) to get multiple unaliased mut slices from a slice.

these (and of course slice.chunks and other similarly named APIs) are the primary way I know how to make a long slice into multiple subslices, but I bet there are other APIs for other types.

so that threads can write to non overlapping multidimensional array subsets in parallel

Non-overlapping? Then there is no problem! doing that safely is pretty much one of Rust’s prime promises, I’m very convinced that you don‘t need any unsafe code here. E.g.: https://docs.rs/rayon/latest/rayon/slice/trait.ParallelSliceMut.html#method.par_chunks_mut

flying-sheep · 2025-01-17T15:19:43Z

OK I took a deeper look. With “multidimensional”, you mean that the output slices belonging to a chunk are interleaved right? So when retrieving e.g. the first chunk of 2D data, we have to write its first row to the start of the output array, then skip chunk_width*(n_chunk_h-1) elements, then write the chunk’s second row …

It doesn’t matter. My point is that you wrote a huge amount of unsafe code, when in reality, the only unsafe part is how arguments are passed, i.e. all functions have access to the whole output array and not just the part(s) they’re actually allowed to access. This is what I’d like to see changed.

The idea would be to write a helper parallel iterator which spits out tuples of (chunk_retrieval_info, writeable_memory)¹. Its implementation might be safe, or not, point is even if it’s unsafe, it would be the only unsafe spot in the code base that deals with keeping input and output aligned.

The result would probably have some nice code deduplication because the closures passed to rayon would no longer have to do their own offset calculations.

To solve the overlapping problem, writeable_memory could e.g. be an IntoIterator<Item=&mut [u8]>. Then when rayon gets to it, we can zip it together with the input data and write row-by-row without ever having created an array of slices. ↩

LDeakin · 2025-01-17T22:22:19Z

This is the crux of all this: allocate an output, write directly into it parallelised over chunks. Array::retrieve_array_subset_opt:

zarrs/zarrs/src/array/array_sync_readable.rs

Lines 710 to 755 in 2f5a665

    
           // Allocate the output 
        
           let size_output = array_subset.num_elements_usize() * data_type_size; 
        
           if size_output == 0 { 
        
               return Ok(ArrayBytes::new_flen(vec![])); 
        
           } 
        
           let mut output = Vec::with_capacity(size_output); 
        
           { 
        
               let output = 
        
                   UnsafeCellSlice::new_from_vec_with_spare_capacity(&mut output); 
        
               let retrieve_chunk = |chunk_indices: Vec<u64>| { 
        
                   let chunk_subset = self.chunk_subset(&chunk_indices)?; 
        
                   let chunk_subset_overlap = chunk_subset.overlap(array_subset)?; 
        
                   unsafe { 
        
                       self.retrieve_chunk_subset_into( 
        
                           &chunk_indices, 
        
                           &chunk_subset_overlap.relative_to(chunk_subset.start())?, 
        
                           &output, 
        
                           array_subset.shape(), 
        
                           &chunk_subset_overlap.relative_to(array_subset.start())?, 
        
                           &options, 
        
                       )?; 
        
                   } 
        
                   // let chunk_subset_bytes = self.retrieve_chunk_subset_opt( 
        
                   //     &chunk_indices, 
        
                   //     &chunk_subset_overlap.relative_to(chunk_subset.start())?, 
        
                   //     &options, 
        
                   // )?; 
        
                   // update_bytes_flen( 
        
                   //     &output, 
        
                   //     array_subset.shape(), 
        
                   //     &chunk_subset_bytes.into_fixed()?, 
        
                   //     &chunk_subset_overlap.relative_to(array_subset.start())?, 
        
                   //     data_type_size, 
        
                   // ); 
        
                   Ok::<_, ArrayError>(()) 
        
               }; 
        
               let indices = chunks.indices(); 
        
               iter_concurrent_limit!( 
        
                   chunk_concurrent_limit, 
        
                   indices, 
        
                   try_for_each, 
        
                   retrieve_chunk 
        
               )?; 
        
           } 
        
           unsafe { output.set_len(size_output) };

With “multidimensional”, you mean that the output slices belonging to a chunk are interleaved right? So when retrieving e.g. the first chunk of 2D data, we have to write its first row to the start of the output array, then skip chunk_width*(n_chunk_h-1) elements

Yes, but not quite. The iteration is not so simple for an arbitrary subset of a multidimensional array. E.g. with a 2(planes)*4(columns)*4(rows) array chunked as 2x2x2:

Plane 0        Plane 1
|a0 a1|b0 b1|  |a4 a5|b4 b5|
|a2 a3|b2 b3|  |a6 a7|b6 b7|
|c0 c1|d0 d1|  |c4 c5|d4 d5|
|c2 c3|d2 d3|  |c6 c7|d6 d7|

The offsets of chunk a elements are [0, 1, 4, 5, 16, 17, 20, 21], and iteration of course needs to be over the contiguous intervals within each chunk to memcpy into them.

The idea would be to write a helper parallel iterator which spits out tuples of (chunk_retrieval_info, writeable_memory)1. Its implementation might be safe, or not, point is even if it’s unsafe, it would be the only unsafe spot in the code base that deals with keeping input and output aligned.

I cannot picture a parallel iterator that does not use unsafe. But a safe serial iterator + ParallelBridge might be feasible. That would hurt performance, but would be easier to write and interesting for a POC of such an API. Relevant: writing parallel iterators for rayon. Such an iterator needs to support a few things you may not have considered:

Arbitrary chunk grids, not just regular
Further subdivision (e.g. the sharding codec is internally doing it's own decode_into)

I agree with your sentiment that it would be much better if the decode_into methods were safe and could only mutate the part they are allowed to, and I'd certainly like to see a POC if you are willing. But, I don't think it will be easy, and achieving performance parity with the current abstraction may not be feasible.

flying-sheep · 2025-01-18T12:48:37Z

the focus of my idea is this:

The idea would be to write a helper parallel iterator which spits out tuples of (chunk_retrieval_info, writeable_memory)1. Its implementation might be safe, or not, point is even if it’s unsafe, it would be the only unsafe spot in the code base that deals with keeping input and output aligned.

all other functions downstream from there (like decode_into) could be safe by being handed only the memory section(s) they’re allowed to write to.

achieving performance parity with the current abstraction may not be feasible.

I’m pretty sure the performance penalty will be neglegible or non-existent. We have to calculate the offsets anyway. It makes no difference if we

pass down the output reference and the information on which continuous slices in there we’re allowed to write to, or if we
pass down a parallel iterator yielding structs holding both of these things while having safe APIs that only allow to access the parts of memory that correspond to the chunk representing it.

2. would look something like

fn chunks_mut(&[ChunkInfo], output: &mut [u8]) -> impl IntoParallelIterator<Item=(
	&ChunkInfo,                   // for retrieval
	impl IntoIterator<&mut [u8]>, // for storage, none of these overlap
)> {
	// almost all of the unsafe code of the crate here, super well tested
}

(but of course as actual traits or stucts that allow accessing more relevant data. And of course we could also do things like pairing fetch offsets with output slices in the iterator item)

the amount of calculation and the amount of information stored in memory at any point is exactly the same, while at no point we have to pass on output as a whole without a safe API wrapping it.

LDeakin · 2025-01-19T07:25:44Z

Thanks for raising this @flying-sheep, I think I am on to something with #136.

I've made the decode_into() and related methods safe. The unsafe part is the construction of ArrayBytesFixedDisjointView structs, and new iterator/s could slot in to construct those via a safe API. This is still the same UnsafeCellSlice approach, but with constrained writes through ArrayBytesFixedDisjointView::{copy_from_slice,fill}.

flying-sheep · 2025-01-19T10:30:02Z

First off: I’m sorry that I’m all talk and no contribution here until now, if you want me to take a swing, I’d be happy to!

That being said: awesome! I think the gradual approach makes much more sense than to try to make as much safe as possible in one big patch.

After what you described, I think it would be possible to make an abstraction around patterns like the following, i.e. make it possible to safely iterate over all sub-chunks of a ArrayBytesFixedDisjointView or array, right?

Both of these patterns should always be safe, since “iterate over all possible chunks” always creates disjoint chunks.

let decode_chunk = |chunk_index: usize| {
    let output_subset_chunk = … chunk_index …;
    let mut output_view_inner_chunk =
        unsafe { output_view.subdivide_unchecked(output_subset_chunk) };
   …
};

iter_concurrent_limit!(
    shard_concurrent_limit,
    (0..num_chunks),
    try_for_each,
    decode_chunk
)?;

and

let retrieve_shard_into_slice = |shard_indices: Vec<u64>| {
    let mut output_view = unsafe {
        ArrayBytesFixedDisjointView::new_unchecked(…)
    };
    …
};
let indices = shards.indices();
iter_concurrent_limit!(
    chunk_concurrent_limit,
    indices,
    try_for_each,
    retrieve_shard_into_slice
)?;

flying-sheep mentioned this issue Jan 17, 2025

More validation for pointer casts? Use buffer protocol for importing data? ilan-gold/zarrs-python#78

Closed

LDeakin added this to the zarrs 0.20 milestone Jan 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why is e.g. `decode_into` unsafe? Why `UnsafeCellSlice`? #133

Why is e.g. `decode_into` unsafe? Why `UnsafeCellSlice`? #133

flying-sheep commented Jan 17, 2025

LDeakin commented Jan 17, 2025 •

edited

Loading

flying-sheep commented Jan 17, 2025 •

edited

Loading

LDeakin commented Jan 17, 2025

flying-sheep commented Jan 17, 2025 •

edited

Loading

flying-sheep commented Jan 17, 2025 •

edited

Loading

LDeakin commented Jan 17, 2025

flying-sheep commented Jan 18, 2025 •

edited

Loading

LDeakin commented Jan 19, 2025

flying-sheep commented Jan 19, 2025 •

edited

Loading

Why is e.g. decode_into unsafe? Why UnsafeCellSlice? #133

Why is e.g. decode_into unsafe? Why UnsafeCellSlice? #133

Comments

flying-sheep commented Jan 17, 2025

LDeakin commented Jan 17, 2025 • edited Loading

flying-sheep commented Jan 17, 2025 • edited Loading

LDeakin commented Jan 17, 2025

flying-sheep commented Jan 17, 2025 • edited Loading

flying-sheep commented Jan 17, 2025 • edited Loading

Footnotes

LDeakin commented Jan 17, 2025

flying-sheep commented Jan 18, 2025 • edited Loading

LDeakin commented Jan 19, 2025

flying-sheep commented Jan 19, 2025 • edited Loading

Why is e.g. `decode_into` unsafe? Why `UnsafeCellSlice`? #133

Why is e.g. `decode_into` unsafe? Why `UnsafeCellSlice`? #133

LDeakin commented Jan 17, 2025 •

edited

Loading

flying-sheep commented Jan 17, 2025 •

edited

Loading

flying-sheep commented Jan 17, 2025 •

edited

Loading

flying-sheep commented Jan 17, 2025 •

edited

Loading

flying-sheep commented Jan 18, 2025 •

edited

Loading

flying-sheep commented Jan 19, 2025 •

edited

Loading