Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions arrow-array/src/builder/generic_bytes_builder.rs
Original file line number Diff line number Diff line change
Expand Up @@ -348,6 +348,50 @@ impl<O: OffsetSizeTrait> std::fmt::Write for GenericStringBuilder<O> {
}
}

/// A byte size value representing the number of bytes to allocate per string in [`GenericStringBuilder`]
///
/// To create a [`GenericStringBuilder`] using `.with_capacity` we are required to provide: \
/// - `item_capacity` - the row count \
/// - `data_capacity` - total string byte count \
///
/// We will use the `AVERAGE_STRING_LENGTH` * row_count for `data_capacity`. \
///
/// These capacities are preallocation hints used to improve performance,
/// but consuquences of passing a hint too large or too small should be negligible.
const AVERAGE_STRING_LENGTH: usize = 16;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add a comment about this magic number?

Maybe something related to: #8600 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My only real concern with this hint is that it will work for some users and not for others (e.g. it will over allocate memory for short strings). Short of forcing the caller to pass in the variable capacity I can't see any way around it that doesn't have other tradeoffs

if the need lower level control they can always use the underlying builder, so I think this default is ok

/// Trait for string-like array builders
///
/// This trait provides unified interface for builders that append string-like data
/// such as [`GenericStringBuilder<O>`] and [`crate::builder::StringViewBuilder`]
pub trait StringLikeArrayBuilder: ArrayBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: does it actually need to be pub?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: I just realized -- this is making a public API change to arrow-array (not isolated to variant crate).

I'm fine with that, in principle, but we should make sure it's a very intentional change?
In particular, it's a one-way door to make this public, but a two-way door to make it variant-only at first.

CC @alamb

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not sure where to put it at first, but I don't think it has to be pub

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok to make it pub -- this seems like a reasonable API to me. We actually have something like this in DataFusion already so it makes sense

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW this will go into arrow 57.1.0 (not in 57.0.0, which is due out tomorrow).

/// Returns a human-readable type name for the builder.
fn type_name() -> &'static str;

/// Creates a new builder with the given row capacity.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pedantically, this also allocates 16x the capacity for the variable payload in StringArray/ LargeStringArray too (it isn't just the row capacity)

fn with_capacity(capacity: usize) -> Self;

/// Appends a non-null string value to the builder.
fn append_value(&mut self, value: &str);

/// Appends a null value to the builder.
fn append_null(&mut self);
}

impl<O: OffsetSizeTrait> StringLikeArrayBuilder for GenericStringBuilder<O> {
fn type_name() -> &'static str {
std::any::type_name::<Self>()
}
fn with_capacity(capacity: usize) -> Self {
Self::with_capacity(capacity, capacity * AVERAGE_STRING_LENGTH)
}
fn append_value(&mut self, value: &str) {
Self::append_value(self, value);
}
fn append_null(&mut self) {
Self::append_null(self);
}
}

/// Array builder for [`GenericBinaryArray`][crate::GenericBinaryArray]
///
/// Values can be appended using [`GenericByteBuilder::append_value`], and nulls with
Expand Down
17 changes: 16 additions & 1 deletion arrow-array/src/builder/generic_bytes_view_builder.rs
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ use arrow_schema::ArrowError;
use hashbrown::HashTable;
use hashbrown::hash_table::Entry;

use crate::builder::ArrayBuilder;
use crate::builder::{ArrayBuilder, StringLikeArrayBuilder};
use crate::types::bytes::ByteArrayNativeType;
use crate::types::{BinaryViewType, ByteViewType, StringViewType};
use crate::{Array, ArrayRef, GenericByteViewArray};
Expand Down Expand Up @@ -533,6 +533,21 @@ impl<T: ByteViewType + ?Sized, V: AsRef<T::Native>> Extend<Option<V>>
/// ```
pub type StringViewBuilder = GenericByteViewBuilder<StringViewType>;

impl StringLikeArrayBuilder for StringViewBuilder {
fn type_name() -> &'static str {
std::any::type_name::<StringViewBuilder>()
}
fn with_capacity(capacity: usize) -> Self {
Self::with_capacity(capacity)
}
fn append_value(&mut self, value: &str) {
Self::append_value(self, value);
}
fn append_null(&mut self) {
Self::append_null(self);
}
}

/// Array builder for [`BinaryViewArray`][crate::BinaryViewArray]
///
/// Values can be appended using [`GenericByteViewBuilder::append_value`], and nulls with
Expand Down
14 changes: 12 additions & 2 deletions parquet-variant-compute/src/variant_array.rs
Original file line number Diff line number Diff line change
Expand Up @@ -947,6 +947,16 @@ fn typed_value_to_variant<'a>(
let value = array.value(index);
Variant::from(value)
}
DataType::LargeUtf8 => {
let array = typed_value.as_string::<i64>();
let value = array.value(index);
Variant::from(value)
}
DataType::Utf8View => {
let array = typed_value.as_string_view();
let value = array.value(index);
Variant::from(value)
}
Comment on lines 947 to +959
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aside: there's a lot of duplication here (both new and existing code). Should we consider tracking a follow-up item to introduce either a trait or a macro that abstracts away the boilerplate?

DataType::Int8 => {
primitive_conversion_single_value!(Int8Type, typed_value, index)
}
Expand Down Expand Up @@ -1098,14 +1108,14 @@ fn canonicalize_and_verify_data_type(
// Binary and string are allowed. Force Binary to BinaryView because that's what the parquet
// reader returns and what the rest of the variant code expects.
Binary => Cow::Owned(DataType::BinaryView),
BinaryView | Utf8 => borrow!(),
BinaryView | Utf8 | LargeUtf8 | Utf8View => borrow!(),

// UUID maps to 16-byte fixed-size binary; no other width is allowed
FixedSizeBinary(16) => borrow!(),
FixedSizeBinary(_) | FixedSizeList(..) => fail!(),

// We can _possibly_ allow (some of) these some day?
LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => {
LargeBinary | ListView(_) | LargeList(_) | LargeListView(_) => {
fail!()
}

Expand Down
36 changes: 35 additions & 1 deletion parquet-variant-compute/src/variant_get.rs
Original file line number Diff line number Diff line change
Expand Up @@ -299,7 +299,8 @@ mod test {
use arrow::array::{
Array, ArrayRef, AsArray, BinaryViewArray, BooleanArray, Date32Array, Decimal32Array,
Decimal64Array, Decimal128Array, Decimal256Array, Float32Array, Float64Array, Int8Array,
Int16Array, Int32Array, Int64Array, StringArray, StructArray,
Int16Array, Int32Array, Int64Array, LargeStringArray, StringArray, StringViewArray,
StructArray,
};
use arrow::buffer::NullBuffer;
use arrow::compute::CastOptions;
Expand Down Expand Up @@ -766,6 +767,27 @@ mod test {
BooleanArray::from(vec![Some(true), Some(false), Some(true)])
);

perfectly_shredded_to_arrow_primitive_test!(
get_variant_perfectly_shredded_utf8_as_utf8,
DataType::Utf8,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to add tests for other types(LargeUtf8/Utf8View) here?

The test here wants to cover the variant_get logic, and the tests added in variant_to_arrow.rs were to cover the logic of the builder?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shredding is not supported for LargeUtf8/Utf8View' per specification.

I originally added the tests for them inside variant_get but got the error saying these types do not support shredding.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be from the VariantArray constructor, which invokes this code:

fn canonicalize_and_verify_data_type(
    data_type: &DataType,
) -> Result<Cow<'_, DataType>, ArrowError> {
      ...
    let new_data_type = match data_type {
          ...
        // We can _possibly_ allow (some of) these some day?                                                                                                                                                                                                                   
        LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => {
            fail!()
        }

I originally added that code because I was not confident I knew what the correct behavior should be. The shredding spec says:

Shredded values must use the following Parquet types:

Variant Type Parquet Physical Type Parquet Logical Type
...
binary BINARY
string BINARY STRING
...
array GROUP; see Arrays below LIST

But I'm pretty sure that doesn't need to constrain the use of DataType::Utf8 vs. DataType::LargeUtf8 vs DataType::Utf8Vew? (similar story for the various in-memory layouts of lists and binary values)?

A similar dilemma is that the metadata column is supposed to be parquet BINARY type, but arrow-parquet produces BinaryViewArray by default. Right now we replace DataType::Binary with DataType::BinaryView and force a cast as needed.

If we think the shredding spec forbids LargeUtf8 or Utf8View then we probably need to cast binary views back to normal binary as well.

If we don't think the shredding spec forbids those types, then we should probably support metadata: LargeBinaryArray (tho the narrowing cast to BinaryArray might fail if the offsets really don't fit in 32 bits).

@alamb @cashmand, any advice here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is perfectly reasonable to call variant_get and ask for the output to be LargeUtf8 or Utf8View

In terms of the Shredding Spec, https://github.com/apache/parquet-format/blob/master/VariantShredding.md is in terms of the Parquet type system which doesn't distinguish between string types like Utf8/LargeUtf8/Utf8View

So my opinion is that we should (eventually) support those different string types, though it doesn't have to be in this PR

Also, maybe it could be something simple such as variant_get internally knows how to extract strings as Utf8 and then calls the cast kernel to cast to one of the other string types. We can build specialized codepaths for the other types if/when someone needs more performnace

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb -- I don't think it's difficult to support one vs. all of the string and binary types (with help from a small trait) -- it was just a question of whether it was legal. I tend to agree with you that it is legal and we should just move forward with it (possibly in this PR, since it's easy, or as a follow-up, if preferred).

Would be nice if this made the arrow-57 cut.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@scovich I would do it in one PR as you've already made some suggestion I can commit here.

perfectly_shredded_utf8_variant_array,
StringArray::from(vec![Some("foo"), Some("bar"), Some("baz")])
);

perfectly_shredded_to_arrow_primitive_test!(
get_variant_perfectly_shredded_large_utf8_as_utf8,
DataType::Utf8,
perfectly_shredded_large_utf8_variant_array,
StringArray::from(vec![Some("foo"), Some("bar"), Some("baz")])
);

perfectly_shredded_to_arrow_primitive_test!(
get_variant_perfectly_shredded_utf8_view_as_utf8,
DataType::Utf8,
perfectly_shredded_utf8_view_variant_array,
StringArray::from(vec![Some("foo"), Some("bar"), Some("baz")])
);

macro_rules! perfectly_shredded_variant_array_fn {
($func:ident, $typed_value_gen:expr) => {
fn $func() -> ArrayRef {
Expand All @@ -789,6 +811,18 @@ mod test {
};
}

perfectly_shredded_variant_array_fn!(perfectly_shredded_utf8_variant_array, || {
StringArray::from(vec![Some("foo"), Some("bar"), Some("baz")])
});

perfectly_shredded_variant_array_fn!(perfectly_shredded_large_utf8_variant_array, || {
LargeStringArray::from(vec![Some("foo"), Some("bar"), Some("baz")])
});

perfectly_shredded_variant_array_fn!(perfectly_shredded_utf8_view_variant_array, || {
StringViewArray::from(vec![Some("foo"), Some("bar"), Some("baz")])
});

perfectly_shredded_variant_array_fn!(perfectly_shredded_bool_variant_array, || {
BooleanArray::from(vec![Some(true), Some(false), Some(true)])
});
Expand Down
29 changes: 28 additions & 1 deletion parquet-variant-compute/src/variant_to_arrow.rs
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,8 @@
// under the License.

use arrow::array::{
ArrayRef, BinaryViewArray, BooleanBuilder, NullBufferBuilder, PrimitiveBuilder,
ArrayRef, BinaryViewArray, LargeStringBuilder, NullBufferBuilder, PrimitiveBuilder,
StringBuilder, StringLikeArrayBuilder, StringViewBuilder, builder::BooleanBuilder,
};
use arrow::compute::{CastOptions, DecimalCast};
use arrow::datatypes::{self, DataType, DecimalType};
Expand Down Expand Up @@ -58,6 +59,9 @@ pub(crate) enum PrimitiveVariantToArrowRowBuilder<'a> {
TimestampNano(VariantToTimestampArrowRowBuilder<'a, datatypes::TimestampNanosecondType>),
TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, datatypes::TimestampNanosecondType>),
Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>),
String(VariantToStringArrowBuilder<'a, StringBuilder>),
LargeString(VariantToStringArrowBuilder<'a, LargeStringBuilder>),
StringView(VariantToStringArrowBuilder<'a, StringViewBuilder>),
}

/// Builder for converting variant values into strongly typed Arrow arrays.
Expand Down Expand Up @@ -97,6 +101,9 @@ impl<'a> PrimitiveVariantToArrowRowBuilder<'a> {
TimestampNano(b) => b.append_null(),
TimestampNanoNtz(b) => b.append_null(),
Date(b) => b.append_null(),
String(b) => b.append_null(),
LargeString(b) => b.append_null(),
StringView(b) => b.append_null(),
}
}

Expand Down Expand Up @@ -124,6 +131,9 @@ impl<'a> PrimitiveVariantToArrowRowBuilder<'a> {
TimestampNano(b) => b.append_value(value),
TimestampNanoNtz(b) => b.append_value(value),
Date(b) => b.append_value(value),
String(b) => b.append_value(value),
LargeString(b) => b.append_value(value),
StringView(b) => b.append_value(value),
}
}

Expand Down Expand Up @@ -151,6 +161,9 @@ impl<'a> PrimitiveVariantToArrowRowBuilder<'a> {
TimestampNano(b) => b.finish(),
TimestampNanoNtz(b) => b.finish(),
Date(b) => b.finish(),
String(b) => b.finish(),
LargeString(b) => b.finish(),
StringView(b) => b.finish(),
}
}
}
Expand Down Expand Up @@ -269,6 +282,13 @@ pub(crate) fn make_primitive_variant_to_arrow_row_builder<'a>(
cast_options,
capacity,
)),
DataType::Utf8 => String(VariantToStringArrowBuilder::new(cast_options, capacity)),
DataType::LargeUtf8 => {
LargeString(VariantToStringArrowBuilder::new(cast_options, capacity))
}
DataType::Utf8View => {
StringView(VariantToStringArrowBuilder::new(cast_options, capacity))
}
_ if data_type.is_primitive() => {
return Err(ArrowError::NotYetImplemented(format!(
"Primitive data_type {data_type:?} not yet implemented"
Expand Down Expand Up @@ -413,6 +433,13 @@ macro_rules! define_variant_to_primitive_builder {
}
}

define_variant_to_primitive_builder!(
struct VariantToStringArrowBuilder<'a, B: StringLikeArrayBuilder>
|capacity| -> B { B::with_capacity(capacity) },
|value| value.as_string(),
type_name: B::type_name()
);
Comment on lines +436 to +441
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is pretty nice


define_variant_to_primitive_builder!(
struct VariantToBooleanArrowRowBuilder<'a>
|capacity| -> BooleanBuilder { BooleanBuilder::with_capacity(capacity) },
Expand Down
Loading