-
Notifications
You must be signed in to change notification settings - Fork 6
Add Variant type support for semi-structured JSON columns #13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
- Add parquet-variant dependencies (parquet-variant, parquet-variant-compute, parquet-variant-json) for proper Variant binary encoding per Parquet spec - Convert JSON columns (body, context, events, links, attributes, resource, errors) from Utf8 to Variant type in schema - Create variant_utils.rs with: - JSON to Variant conversion using parquet-variant-compute - Variant to JSON conversion for query results - Variant-aware wrapper UDFs (json_get, json_get_str, json_length, json_contains) that transparently handle both Variant and UTF8 inputs - Update schema_loader.rs: - Add variant_arrow_type() using BinaryView fields - Add variant_delta_type() using delta-kernel's unshredded_variant() - Add has_variant_columns() helper method - Update test_utils.rs to convert JSON columns to Variant on insert - Prepare Protocol with variantType feature (ready for when delta-rs adds support) Note: delta-rs ProtocolChecker doesn't yet support variantType feature, so Variant data is stored as Struct<metadata: BinaryView, value: BinaryView> without the protocol marker. The binary representation is correct per Parquet Variant spec.
Pull Request Review: Add Variant type support for semi-structured JSON columnsSummaryThis PR adds proper Parquet Variant type support for semi-structured JSON columns, which is a significant improvement for handling JSON data efficiently. The implementation is well-structured and follows best practices overall. ✅ StrengthsCode Quality
Architecture
🔍 Issues & Recommendations1. Potential Performance Concern - Cloning in
|
Summary
parquet-variantcratesbody,context,events,links,attributes,resource,errors) fromUtf8toVarianttypeChanges
New Dependencies
parquet-variantv0.2.0 - Core Variant typeparquet-variant-computev0.2.0 - JSON to Variant conversionparquet-variant-jsonv0.2.0 - Variant to JSON conversionNew Module:
variant_utils.rsjson_to_variant_array()- Convert JSON strings to Variant binary formatvariant_to_json_array()- Convert Variant back to JSON for queriesjson_get,json_get_str,json_length,json_contains) that handle both Variant and UTF8 inputsSchema Changes
schemas/otel_logs_and_spans.yaml- Changed 7 columns fromUtf8toVariantschema_loader.rs- Addedvariant_arrow_type()andvariant_delta_type()functionsProtocol Support (Prepared)
create_variant_protocol()function ready for when delta-rs adds variantType supportProtocolCheckerdoesn't includevariantTypein supported featuresTechnical Notes
Struct<metadata: BinaryView, value: BinaryView>per Parquet Variant specTest plan