Skip to content

Conversation

@klion26
Copy link
Member

@klion26 klion26 commented Oct 17, 2025

Which issue does this PR close?

What changes are included in this PR?

  • Add logic in typed_value_to_variant/PrimitiveVariantToArrowRowBuilder for Null/Time/Decimal{4,8,16}
  • Implement PrimitiveFromVariant for Time64MicrosecondType
  • Add tests to cover the added logic

Are these changes tested?

Added some tests

Are there any user-facing changes?

No

@github-actions github-actions bot added the parquet-variant parquet-variant* crates label Oct 17, 2025
@klion26 klion26 force-pushed the primitive-null/time/decimal branch 3 times, most recently from 1d7e5e8 to b5bfaed Compare October 17, 2025 07:53
Copy link
Member Author

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @scovich Please help review this when you're free, thanks.

let value = boolean_array.value(index);
Variant::from(value)
}
DataType::Date32 => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been adjusted to be grouped together with the Timestamp/Time.

fn append_value(&mut self, v: ());
}

impl AppendValueTrait for NullBuilder {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function is needed by the define_variant_to_primitive_builder macro.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NullArray is kind of strange -- maybe rather than creating a full on builder for it, we could just keep count and then create the output array with https://docs.rs/arrow/latest/arrow/array/fn.new_null_array.html ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullArray (with DataType::Null) is not the same as an array (of some arbitrary other type) full of nulls tho? But if we know the intended type, then I agree that new_null_array is better (since NullArray::is_null always returns false = weird)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now that I'm actually looking at the code (and not just reacting to the comments) --

It seems like we don't actually need a NullBuilder at all? We know the capacity, so just use NullArray::new:

struct FakeNullBuilder(NullArray);

impl FakeNullBuilder {
    fn new(capacity: usize) -> Self {
        Self(NullArray::new(capacity)
    }
    fn append_value(&mut self, _: ()) {}
    fn append_null(&mut self) {}
    fn finish(self) -> NullArray {
        self.0
    }
}

... and then the macro invocation us just:

|capacity| -> FakeNullBuilder { FakeNullBuilder::new(capacity) },

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullArray (with DataType::Null) is not the same as an array (of some arbitrary other type) full of nulls tho?
new_null_array with different types are different, the following unit test will fail

        let first = new_null_array(&DataType::Int32, 3);
        let second = new_null_array(&DataType::Timestamp(TimeUnit::Nanosecond, None), 3);

        assert_eq!(*first, second.clone());

Use NullArray instead of new_null_array in the changed code because the compiler will panic with the trait bound Result<Arc<dyn Array>, ArrowError>: Array is not satisfied if I use new_null_array in FakeNullBuilder::finish

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @klion26 -- this looks very neat. I left some comments, let me know what you think

cc @scovich

index
)
}
DataType::Time64(TimeUnit::Microsecond) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another possible type is DataType::Time64(TimeUnit::Nanosecond)

There is also a Time32 type in arrow: https://docs.rs/arrow/latest/arrow/datatypes/enum.DataType.html#variant.Time32

I am not sure if we want to handle that too

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only support the Time64(TimeUnit::Microsecond) here because that variant spec said that Time is TimeUnit::Microsecond.

Image

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supernit: perhaps we can move this branch to just below Date32 so that the order aligns the spec.

fn append_value(&mut self, v: ());
}

impl AppendValueTrait for NullBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NullArray is kind of strange -- maybe rather than creating a full on builder for it, we could just keep count and then create the output array with https://docs.rs/arrow/latest/arrow/array/fn.new_null_array.html ?

datatypes::Date32Type::from_naive_date
);
impl_primitive_from_variant!(datatypes::Time64MicrosecondType, as_time_utc, |v| {
(v.num_seconds_from_midnight() * 1_000_000 + v.nanosecond() / 1_000) as i64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it normal to take the floor instead of rounding?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok because nanosecond() will return u32, and the result will be truncated to zero.

generic_conversion_single_value!(
Decimal128Type,
as_primitive,
|v| VariantDecimal16::try_new(v, *s as u8).unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure these try_new calls will never fail?

I mean, we are unshredding shredded data, but it's also... data... which is not trustworthy.

For safety, suggest to do something like:

Suggested change
|v| VariantDecimal16::try_new(v, *s as u8).unwrap(),
|v| VariantDecimal16::try_new(v, *s as u8).map_or(Variant::Null, Variant::from),

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this suggestion, fixed.

Ah, found that we can pass in different types here(VariantDecimal16 vs Variant::Decimal16), it is cool.

Copy link
Contributor

@scovich scovich Oct 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it accepts any V: IntoVariant

(v / 1_000_000) as u32,
(v % 1_000_000) as u32 * 1000
)
.unwrap(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As above -- are we absolutely sure the conversion will never fail? Or do we need a Variant::Null fallback?

fn append_value(&mut self, v: ());
}

impl AppendValueTrait for NullBuilder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, now that I'm actually looking at the code (and not just reacting to the comments) --

It seems like we don't actually need a NullBuilder at all? We know the capacity, so just use NullArray::new:

struct FakeNullBuilder(NullArray);

impl FakeNullBuilder {
    fn new(capacity: usize) -> Self {
        Self(NullArray::new(capacity)
    }
    fn append_value(&mut self, _: ()) {}
    fn append_null(&mut self) {}
    fn finish(self) -> NullArray {
        self.0
    }
}

... and then the macro invocation us just:

|capacity| -> FakeNullBuilder { FakeNullBuilder::new(capacity) },

define_variant_to_primitive_builder!(
struct VariantToNullArrowRowBuilder<'a>
|_capacity| -> NullBuilder { NullBuilder::new() },
| v | v.as_null(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: odd spacing there?

Suggested change
| v | v.as_null(),
|v| v.as_null(),

@klion26 klion26 force-pushed the primitive-null/time/decimal branch from 7658b17 to 78ebf7b Compare October 20, 2025 08:38
Copy link
Member Author

@klion26 klion26 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb @scovich Thanks for the review! I've updated the code. Please take another look.

CI failed because

Using default tag: latest
  Error response from daemon: Head "https://registry-1.docker.io/v2/amd64/rust/manifests/latest": received unexpected HTTP status: 503 Service Unavailable
  Warning: Docker pull failed with exit code 1, back off 1.328 seconds before retry.

datatypes::Date32Type::from_naive_date
);
impl_primitive_from_variant!(datatypes::Time64MicrosecondType, as_time_utc, |v| {
(v.num_seconds_from_midnight() * 1_000_000 + v.nanosecond() / 1_000) as i64
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is ok because nanosecond() will return u32, and the result will be truncated to zero.

generic_conversion_single_value!(
Decimal128Type,
as_primitive,
|v| VariantDecimal16::try_new(v, *s as u8).unwrap(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this suggestion, fixed.

Ah, found that we can pass in different types here(VariantDecimal16 vs Variant::Decimal16), it is cool.

index
)
}
DataType::Time64(TimeUnit::Microsecond) => {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only support the Time64(TimeUnit::Microsecond) here because that variant spec said that Time is TimeUnit::Microsecond.

Image

fn append_value(&mut self, v: ());
}

impl AppendValueTrait for NullBuilder {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NullArray (with DataType::Null) is not the same as an array (of some arbitrary other type) full of nulls tho?
new_null_array with different types are different, the following unit test will fail

        let first = new_null_array(&DataType::Int32, 3);
        let second = new_null_array(&DataType::Timestamp(TimeUnit::Nanosecond, None), 3);

        assert_eq!(*first, second.clone());

Use NullArray instead of new_null_array in the changed code because the compiler will panic with the trait bound Result<Arc<dyn Array>, ArrowError>: Array is not satisfied if I use new_null_array in FakeNullBuilder::finish

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. One nit and one follow-up item to potentially track.

generic_conversion_single_value!(
Decimal128Type,
as_primitive,
|v| VariantDecimal16::try_new(v, *s as u8).map_or(Variant::Null, Variant::from),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need a follow-up item to track turning this into a proper Result instead of silently converting to NULL on failure? That would require VariantArray::value to return a Result<Variant>, which is a biggish change but honestly seems appropriate given that we can't guarantee the input shredding is even physically valid, let alone logically valid? It seems too sharp an edge to just panic.

@alamb -- thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed an issue #8672 to track this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need a follow-up item to track turning this into a proper Result instead of silently converting to NULL on failure?

I think this makes a lot of sense.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need a follow-up item to track turning this into a proper Result instead of silently converting to NULL on failure? That would require VariantArray::value to return a Result, which is a biggish change but honestly seems appropriate given that we can't guarantee the input shredding is even physically valid, let alone logically valid? It seems too sharp an edge to just panic.

@alamb -- thoughts?

How about we add a try_value() method that returns a Result and just have value() unwrap the result?

define_variant_to_primitive_builder!(
struct VariantToNullArrowRowBuilder<'a>
|capacity| -> FakeNullBuilder { FakeNullBuilder::new(capacity) },
|value| value.as_null(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we just hard-wire |_| Some(Variant::Null) here?
All input variant subtypes ultimately produce the same outcome because both append_null and append_value are no-ops.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more obvious: the result is always Variant::Null; updated the code.
As macro can't receive _ for ident (need change to pat and more), changed to _value

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

generic_conversion_single_value!(
Decimal128Type,
as_primitive,
|v| VariantDecimal16::try_new(v, *s as u8).map_or(Variant::Null, Variant::from),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may need a follow-up item to track turning this into a proper Result instead of silently converting to NULL on failure? That would require VariantArray::value to return a Result, which is a biggish change but honestly seems appropriate given that we can't guarantee the input shredding is even physically valid, let alone logically valid? It seems too sharp an edge to just panic.

@alamb -- thoughts?

How about we add a try_value() method that returns a Result and just have value() unwrap the result?

Copy link
Contributor

@liamzwbao liamzwbao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

index
)
}
DataType::Time64(TimeUnit::Microsecond) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

supernit: perhaps we can move this branch to just below Date32 so that the order aligns the spec.

@alamb alamb merged commit a7572eb into apache:main Oct 22, 2025
17 checks passed
@alamb
Copy link
Contributor

alamb commented Oct 22, 2025

🚀 thanks @liamzwbao @friendlymatthew @klion26 and @scovich

@klion26 klion26 deleted the primitive-null/time/decimal branch October 23, 2025 13:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet-variant parquet-variant* crates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Variant] Support variant to arrow primitive support null/time/decimal_*

5 participants