-
Notifications
You must be signed in to change notification settings - Fork 1k
Support variant_to_arrow
for utf8
#8600
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
f6e88ef
61ed178
1fb612d
2b6d280
398b52d
defa07b
5022acd
ed66007
196b5d4
642d192
76b3c80
35785d6
5914218
216d401
95e94d0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -348,6 +348,50 @@ impl<O: OffsetSizeTrait> std::fmt::Write for GenericStringBuilder<O> { | |
} | ||
} | ||
|
||
/// A byte size value representing the number of bytes to allocate per string in [`GenericStringBuilder`] | ||
/// | ||
/// To create a [`GenericStringBuilder`] using `.with_capacity` we are required to provide: \ | ||
/// - `item_capacity` - the row count \ | ||
/// - `data_capacity` - total string byte count \ | ||
/// | ||
/// We will use the `AVERAGE_STRING_LENGTH` * row_count for `data_capacity`. \ | ||
/// | ||
/// These capacities are preallocation hints used to improve performance, | ||
/// but consuquences of passing a hint too large or too small should be negligible. | ||
const AVERAGE_STRING_LENGTH: usize = 16; | ||
/// Trait for string-like array builders | ||
/// | ||
/// This trait provides unified interface for builders that append string-like data | ||
/// such as [`GenericStringBuilder<O>`] and [`crate::builder::StringViewBuilder`] | ||
pub trait StringLikeArrayBuilder: ArrayBuilder { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: does it actually need to be pub? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. update: I just realized -- this is making a public API change to arrow-array (not isolated to variant crate). I'm fine with that, in principle, but we should make sure it's a very intentional change? CC @alamb There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was not sure where to put it at first, but I don't think it has to be pub There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is ok to make it pub -- this seems like a reasonable API to me. We actually have something like this in DataFusion already so it makes sense There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. FWIW this will go into arrow 57.1.0 (not in 57.0.0, which is due out tomorrow). |
||
/// Returns a human-readable type name for the builder. | ||
fn type_name() -> &'static str; | ||
|
||
/// Creates a new builder with the given row capacity. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. pedantically, this also allocates 16x the capacity for the variable payload in StringArray/ LargeStringArray too (it isn't just the row capacity) |
||
fn with_capacity(capacity: usize) -> Self; | ||
|
||
/// Appends a non-null string value to the builder. | ||
fn append_value(&mut self, value: &str); | ||
|
||
/// Appends a null value to the builder. | ||
fn append_null(&mut self); | ||
} | ||
|
||
impl<O: OffsetSizeTrait> StringLikeArrayBuilder for GenericStringBuilder<O> { | ||
fn type_name() -> &'static str { | ||
std::any::type_name::<Self>() | ||
} | ||
fn with_capacity(capacity: usize) -> Self { | ||
Self::with_capacity(capacity, capacity * AVERAGE_STRING_LENGTH) | ||
} | ||
fn append_value(&mut self, value: &str) { | ||
Self::append_value(self, value); | ||
} | ||
fn append_null(&mut self) { | ||
Self::append_null(self); | ||
} | ||
} | ||
|
||
/// Array builder for [`GenericBinaryArray`][crate::GenericBinaryArray] | ||
/// | ||
/// Values can be appended using [`GenericByteBuilder::append_value`], and nulls with | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -947,6 +947,16 @@ fn typed_value_to_variant<'a>( | |
let value = array.value(index); | ||
Variant::from(value) | ||
} | ||
DataType::LargeUtf8 => { | ||
let array = typed_value.as_string::<i64>(); | ||
let value = array.value(index); | ||
Variant::from(value) | ||
} | ||
DataType::Utf8View => { | ||
let array = typed_value.as_string_view(); | ||
let value = array.value(index); | ||
Variant::from(value) | ||
} | ||
Comment on lines
947
to
+959
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. aside: there's a lot of duplication here (both new and existing code). Should we consider tracking a follow-up item to introduce either a trait or a macro that abstracts away the boilerplate? |
||
DataType::Int8 => { | ||
primitive_conversion_single_value!(Int8Type, typed_value, index) | ||
} | ||
|
@@ -1098,14 +1108,14 @@ fn canonicalize_and_verify_data_type( | |
// Binary and string are allowed. Force Binary to BinaryView because that's what the parquet | ||
// reader returns and what the rest of the variant code expects. | ||
Binary => Cow::Owned(DataType::BinaryView), | ||
BinaryView | Utf8 => borrow!(), | ||
BinaryView | Utf8 | LargeUtf8 | Utf8View => borrow!(), | ||
|
||
// UUID maps to 16-byte fixed-size binary; no other width is allowed | ||
FixedSizeBinary(16) => borrow!(), | ||
FixedSizeBinary(_) | FixedSizeList(..) => fail!(), | ||
|
||
// We can _possibly_ allow (some of) these some day? | ||
LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => { | ||
LargeBinary | ListView(_) | LargeList(_) | LargeListView(_) => { | ||
fail!() | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -299,7 +299,8 @@ mod test { | |||||||||||||||||||
use arrow::array::{ | ||||||||||||||||||||
Array, ArrayRef, AsArray, BinaryViewArray, BooleanArray, Date32Array, Decimal32Array, | ||||||||||||||||||||
Decimal64Array, Decimal128Array, Decimal256Array, Float32Array, Float64Array, Int8Array, | ||||||||||||||||||||
Int16Array, Int32Array, Int64Array, StringArray, StructArray, | ||||||||||||||||||||
Int16Array, Int32Array, Int64Array, LargeStringArray, StringArray, StringViewArray, | ||||||||||||||||||||
StructArray, | ||||||||||||||||||||
}; | ||||||||||||||||||||
use arrow::buffer::NullBuffer; | ||||||||||||||||||||
use arrow::compute::CastOptions; | ||||||||||||||||||||
|
@@ -766,6 +767,27 @@ mod test { | |||||||||||||||||||
BooleanArray::from(vec![Some(true), Some(false), Some(true)]) | ||||||||||||||||||||
); | ||||||||||||||||||||
|
||||||||||||||||||||
perfectly_shredded_to_arrow_primitive_test!( | ||||||||||||||||||||
get_variant_perfectly_shredded_utf8_as_utf8, | ||||||||||||||||||||
DataType::Utf8, | ||||||||||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Do we need to add tests for other types( The test here wants to cover the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shredding is not supported for I originally added the tests for them inside There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That would be from the fn canonicalize_and_verify_data_type(
data_type: &DataType,
) -> Result<Cow<'_, DataType>, ArrowError> {
...
let new_data_type = match data_type {
...
// We can _possibly_ allow (some of) these some day?
LargeBinary | LargeUtf8 | Utf8View | ListView(_) | LargeList(_) | LargeListView(_) => {
fail!()
} I originally added that code because I was not confident I knew what the correct behavior should be. The shredding spec says:
But I'm pretty sure that doesn't need to constrain the use of A similar dilemma is that the If we think the shredding spec forbids If we don't think the shredding spec forbids those types, then we should probably support There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it is perfectly reasonable to call In terms of the Shredding Spec, https://github.com/apache/parquet-format/blob/master/VariantShredding.md is in terms of the Parquet type system which doesn't distinguish between string types like Utf8/LargeUtf8/Utf8View So my opinion is that we should (eventually) support those different string types, though it doesn't have to be in this PR Also, maybe it could be something simple such as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @alamb -- I don't think it's difficult to support one vs. all of the string and binary types (with help from a small trait) -- it was just a question of whether it was legal. I tend to agree with you that it is legal and we should just move forward with it (possibly in this PR, since it's easy, or as a follow-up, if preferred). Would be nice if this made the arrow-57 cut. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @scovich I would do it in one PR as you've already made some suggestion I can commit here. |
||||||||||||||||||||
perfectly_shredded_utf8_variant_array, | ||||||||||||||||||||
StringArray::from(vec![Some("foo"), Some("bar"), Some("baz")]) | ||||||||||||||||||||
); | ||||||||||||||||||||
|
||||||||||||||||||||
perfectly_shredded_to_arrow_primitive_test!( | ||||||||||||||||||||
get_variant_perfectly_shredded_large_utf8_as_utf8, | ||||||||||||||||||||
DataType::Utf8, | ||||||||||||||||||||
perfectly_shredded_large_utf8_variant_array, | ||||||||||||||||||||
StringArray::from(vec![Some("foo"), Some("bar"), Some("baz")]) | ||||||||||||||||||||
); | ||||||||||||||||||||
|
||||||||||||||||||||
perfectly_shredded_to_arrow_primitive_test!( | ||||||||||||||||||||
get_variant_perfectly_shredded_utf8_view_as_utf8, | ||||||||||||||||||||
DataType::Utf8, | ||||||||||||||||||||
perfectly_shredded_utf8_view_variant_array, | ||||||||||||||||||||
StringArray::from(vec![Some("foo"), Some("bar"), Some("baz")]) | ||||||||||||||||||||
); | ||||||||||||||||||||
|
||||||||||||||||||||
macro_rules! perfectly_shredded_variant_array_fn { | ||||||||||||||||||||
($func:ident, $typed_value_gen:expr) => { | ||||||||||||||||||||
fn $func() -> ArrayRef { | ||||||||||||||||||||
|
@@ -789,6 +811,18 @@ mod test { | |||||||||||||||||||
}; | ||||||||||||||||||||
} | ||||||||||||||||||||
|
||||||||||||||||||||
perfectly_shredded_variant_array_fn!(perfectly_shredded_utf8_variant_array, || { | ||||||||||||||||||||
StringArray::from(vec![Some("foo"), Some("bar"), Some("baz")]) | ||||||||||||||||||||
}); | ||||||||||||||||||||
|
||||||||||||||||||||
perfectly_shredded_variant_array_fn!(perfectly_shredded_large_utf8_variant_array, || { | ||||||||||||||||||||
LargeStringArray::from(vec![Some("foo"), Some("bar"), Some("baz")]) | ||||||||||||||||||||
}); | ||||||||||||||||||||
|
||||||||||||||||||||
perfectly_shredded_variant_array_fn!(perfectly_shredded_utf8_view_variant_array, || { | ||||||||||||||||||||
StringViewArray::from(vec![Some("foo"), Some("bar"), Some("baz")]) | ||||||||||||||||||||
}); | ||||||||||||||||||||
|
||||||||||||||||||||
perfectly_shredded_variant_array_fn!(perfectly_shredded_bool_variant_array, || { | ||||||||||||||||||||
BooleanArray::from(vec![Some(true), Some(false), Some(true)]) | ||||||||||||||||||||
}); | ||||||||||||||||||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -16,7 +16,8 @@ | |
// under the License. | ||
|
||
use arrow::array::{ | ||
ArrayRef, BinaryViewArray, BooleanBuilder, NullBufferBuilder, PrimitiveBuilder, | ||
ArrayRef, BinaryViewArray, LargeStringBuilder, NullBufferBuilder, PrimitiveBuilder, | ||
StringBuilder, StringLikeArrayBuilder, StringViewBuilder, builder::BooleanBuilder, | ||
}; | ||
use arrow::compute::{CastOptions, DecimalCast}; | ||
use arrow::datatypes::{self, DataType, DecimalType}; | ||
|
@@ -58,6 +59,9 @@ pub(crate) enum PrimitiveVariantToArrowRowBuilder<'a> { | |
TimestampNano(VariantToTimestampArrowRowBuilder<'a, datatypes::TimestampNanosecondType>), | ||
TimestampNanoNtz(VariantToTimestampNtzArrowRowBuilder<'a, datatypes::TimestampNanosecondType>), | ||
Date(VariantToPrimitiveArrowRowBuilder<'a, datatypes::Date32Type>), | ||
String(VariantToStringArrowBuilder<'a, StringBuilder>), | ||
LargeString(VariantToStringArrowBuilder<'a, LargeStringBuilder>), | ||
StringView(VariantToStringArrowBuilder<'a, StringViewBuilder>), | ||
} | ||
|
||
/// Builder for converting variant values into strongly typed Arrow arrays. | ||
|
@@ -97,6 +101,9 @@ impl<'a> PrimitiveVariantToArrowRowBuilder<'a> { | |
TimestampNano(b) => b.append_null(), | ||
TimestampNanoNtz(b) => b.append_null(), | ||
Date(b) => b.append_null(), | ||
String(b) => b.append_null(), | ||
LargeString(b) => b.append_null(), | ||
StringView(b) => b.append_null(), | ||
} | ||
} | ||
|
||
|
@@ -124,6 +131,9 @@ impl<'a> PrimitiveVariantToArrowRowBuilder<'a> { | |
TimestampNano(b) => b.append_value(value), | ||
TimestampNanoNtz(b) => b.append_value(value), | ||
Date(b) => b.append_value(value), | ||
String(b) => b.append_value(value), | ||
LargeString(b) => b.append_value(value), | ||
StringView(b) => b.append_value(value), | ||
} | ||
} | ||
|
||
|
@@ -151,6 +161,9 @@ impl<'a> PrimitiveVariantToArrowRowBuilder<'a> { | |
TimestampNano(b) => b.finish(), | ||
TimestampNanoNtz(b) => b.finish(), | ||
Date(b) => b.finish(), | ||
String(b) => b.finish(), | ||
LargeString(b) => b.finish(), | ||
StringView(b) => b.finish(), | ||
} | ||
} | ||
} | ||
|
@@ -269,6 +282,13 @@ pub(crate) fn make_primitive_variant_to_arrow_row_builder<'a>( | |
cast_options, | ||
capacity, | ||
)), | ||
DataType::Utf8 => String(VariantToStringArrowBuilder::new(cast_options, capacity)), | ||
DataType::LargeUtf8 => { | ||
LargeString(VariantToStringArrowBuilder::new(cast_options, capacity)) | ||
} | ||
DataType::Utf8View => { | ||
StringView(VariantToStringArrowBuilder::new(cast_options, capacity)) | ||
} | ||
_ if data_type.is_primitive() => { | ||
return Err(ArrowError::NotYetImplemented(format!( | ||
"Primitive data_type {data_type:?} not yet implemented" | ||
|
@@ -413,6 +433,13 @@ macro_rules! define_variant_to_primitive_builder { | |
} | ||
} | ||
|
||
define_variant_to_primitive_builder!( | ||
struct VariantToStringArrowBuilder<'a, B: StringLikeArrayBuilder> | ||
|capacity| -> B { B::with_capacity(capacity) }, | ||
|value| value.as_string(), | ||
type_name: B::type_name() | ||
); | ||
Comment on lines
+436
to
+441
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🎉 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that is pretty nice |
||
|
||
define_variant_to_primitive_builder!( | ||
struct VariantToBooleanArrowRowBuilder<'a> | ||
|capacity| -> BooleanBuilder { BooleanBuilder::with_capacity(capacity) }, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add a comment about this magic number?
Maybe something related to: #8600 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My only real concern with this hint is that it will work for some users and not for others (e.g. it will over allocate memory for short strings). Short of forcing the caller to pass in the variable capacity I can't see any way around it that doesn't have other tradeoffs
if the need lower level control they can always use the underlying builder, so I think this default is ok