Skip to content

Conversation

@tonyalaribe
Copy link
Contributor

Summary

  • Add proper Parquet Variant type support for semi-structured JSON columns using parquet-variant crates
  • Convert JSON columns (body, context, events, links, attributes, resource, errors) from Utf8 to Variant type
  • Create transparent wrapper UDFs so existing SQL queries continue to work unchanged

Changes

New Dependencies

  • parquet-variant v0.2.0 - Core Variant type
  • parquet-variant-compute v0.2.0 - JSON to Variant conversion
  • parquet-variant-json v0.2.0 - Variant to JSON conversion

New Module: variant_utils.rs

  • json_to_variant_array() - Convert JSON strings to Variant binary format
  • variant_to_json_array() - Convert Variant back to JSON for queries
  • Variant-aware wrapper UDFs (json_get, json_get_str, json_length, json_contains) that handle both Variant and UTF8 inputs

Schema Changes

  • schemas/otel_logs_and_spans.yaml - Changed 7 columns from Utf8 to Variant
  • schema_loader.rs - Added variant_arrow_type() and variant_delta_type() functions

Protocol Support (Prepared)

  • Added create_variant_protocol() function ready for when delta-rs adds variantType support
  • Currently disabled because delta-rs ProtocolChecker doesn't include variantType in supported features

Technical Notes

  • Variant data is stored as Struct<metadata: BinaryView, value: BinaryView> per Parquet Variant spec
  • The binary encoding is correct; only the protocol feature marker is missing until delta-rs adds support
  • All existing SQL queries work unchanged thanks to the transparent wrapper UDFs

Test plan

  • Unit tests for variant_utils roundtrip conversion pass
  • batch_queue tests pass with Variant columns
  • Build compiles without warnings
  • Integration tests with minio (requires minio running)

- Add parquet-variant dependencies (parquet-variant, parquet-variant-compute, parquet-variant-json) for proper Variant binary encoding per Parquet spec
- Convert JSON columns (body, context, events, links, attributes, resource, errors) from Utf8 to Variant type in schema
- Create variant_utils.rs with:
  - JSON to Variant conversion using parquet-variant-compute
  - Variant to JSON conversion for query results
  - Variant-aware wrapper UDFs (json_get, json_get_str, json_length, json_contains) that transparently handle both Variant and UTF8 inputs
- Update schema_loader.rs:
  - Add variant_arrow_type() using BinaryView fields
  - Add variant_delta_type() using delta-kernel's unshredded_variant()
  - Add has_variant_columns() helper method
- Update test_utils.rs to convert JSON columns to Variant on insert
- Prepare Protocol with variantType feature (ready for when delta-rs adds support)

Note: delta-rs ProtocolChecker doesn't yet support variantType feature, so Variant data is stored as Struct<metadata: BinaryView, value: BinaryView> without the protocol marker. The binary representation is correct per Parquet Variant spec.
@claude
Copy link

claude bot commented Dec 29, 2025

Pull Request Review: Add Variant type support for semi-structured JSON columns

Summary

This PR adds proper Parquet Variant type support for semi-structured JSON columns, which is a significant improvement for handling JSON data efficiently. The implementation is well-structured and follows best practices overall.

✅ Strengths

Code Quality

  • Excellent documentation: Clear module-level docs and inline comments explaining design decisions
  • Clean separation of concerns: New variant_utils.rs module is well-organized with logical grouping
  • Proper error handling: Comprehensive error messages with context
  • Good test coverage: Unit tests cover core functionality (roundtrip conversion, type checking)
  • Transparent wrapper pattern: The UDF wrappers elegantly handle both Variant and UTF8 inputs

Architecture

  • Future-proof design: Prepared for delta-rs variantType protocol support with clear documentation about current limitations
  • Backward compatibility: Existing SQL queries continue to work unchanged thanks to wrapper UDFs
  • Proper spec compliance: Uses official parquet-variant crates for correct binary encoding

🔍 Issues & Recommendations

1. Potential Performance Concern - Cloning in json_to_variant_array (src/variant_utils.rs:54)

let array_ref: ArrayRef = Arc::new(json_array.clone());

Issue: Cloning the entire StringArray before conversion could impact performance for large batches.
Recommendation: Consider if the clone is necessary, or if you can pass &array_ref directly to json_to_variant().

2. Memory Allocation Pattern (src/variant_utils.rs:129-130)

let mut new_columns: Vec<ArrayRef> = Vec::with_capacity(batch.num_columns());
let mut new_fields: Vec<Arc<Field>> = Vec::with_capacity(batch.num_columns());

Good: Pre-allocating with capacity is excellent for performance.
Minor suggestion: Consider adding a comment explaining this optimization for maintainability.

3. Error Messages Could Be More Specific (src/variant_utils.rs:85-87)

let metadata = get_bytes(metadata_col, i)
    .ok_or_else(|| DataFusionError::Execution("Missing metadata bytes".to_string()))?;

Issue: Error doesn't indicate which row index failed.
Recommendation: Include row index in error message:

.ok_or_else(|| DataFusionError::Execution(format!("Missing metadata bytes at row {}", i)))?;

4. Potential Edge Case in UDF Argument Validation

All UDFs validate argument count but don't validate argument types. While DataFusion may handle this, explicit validation could provide better error messages.

5. Test Coverage Gaps

Missing tests for:

  • Empty arrays/batches
  • Very large JSON documents (performance regression testing)
  • Malformed JSON handling
  • UDF behavior with null inputs
  • UDF behavior when mixing Variant and UTF8 inputs
  • convert_batch_json_to_variant function
  • Error conditions (e.g., invalid Variant structures)

Recommendation: Add integration tests that verify end-to-end behavior with actual Delta table writes/reads.

6. Commented-out Protocol Configuration (src/database.rs:1017)

// Note: v2 checkpoint policy requires v2Checkpoint feature which delta-rs doesn't support yet
// config.insert("delta.checkpointPolicy".to_string(), Some("v2".to_string()));

Issue: Commented code can cause confusion.
Recommendation: Move this to a GitHub issue or document in a separate TODO file, then remove the commented code.

7. Code Duplication in UDF Implementations

The four UDF implementations (JsonGetUDF, JsonGetStrUDF, etc.) follow an identical pattern with only minor variations. Consider:

  • A macro to reduce boilerplate
  • A generic wrapper struct that takes the underlying UDF as a parameter

Example approach:

struct VariantAwareUDF {
    name: &'static str,
    underlying: ScalarUDF,
    // ...
}

8. Schema Migration Strategy

Question: How will existing tables with UTF8 JSON columns migrate to Variant?
Recommendation: Document the migration path or add tooling to help users upgrade existing data.

9. Hardcoded Column Names (src/variant_utils.rs:21)

pub const VARIANT_COLUMNS: &[&str] = &["body", "context", "events", "links", "attributes", "resource", "errors"];

Issue: Hardcoded list creates tight coupling between variant_utils and schema definition.
Recommendation: Consider deriving this from schema metadata or making it configurable.

10. Missing Benchmarks

Given this changes data storage format, performance benchmarks would be valuable:

  • JSON → Variant conversion overhead
  • Query performance comparison (UTF8 vs Variant)
  • Storage size comparison

🔒 Security Considerations

  • ✅ No obvious security vulnerabilities
  • ✅ Input validation appears adequate
  • ✅ No SQL injection vectors introduced
  • ⚠️ Consideration: Very large or deeply nested JSON could potentially cause memory issues. Consider adding size limits or documenting maximum recommended JSON complexity.

📊 Performance Considerations

Positive

  • ✅ Binary encoding should reduce storage size
  • ✅ Pre-allocation of vectors prevents reallocation
  • ✅ Leverages official parquet-variant implementation

Concerns

  • ⚠️ Conversion overhead during ingestion (JSON → Variant → storage)
  • ⚠️ Query-time conversion overhead (Variant → JSON for UDFs)
  • ⚠️ The wrapper UDF pattern adds an extra function call layer

Recommendation: Add performance benchmarks comparing:

  1. Insert throughput (UTF8 vs Variant)
  2. Query performance for JSON operations
  3. Storage efficiency gains

🧪 Test Plan Verification

From the PR description:

  • ✅ Unit tests for variant_utils roundtrip conversion pass
  • ✅ batch_queue tests pass with Variant columns
  • ✅ Build compiles without warnings
  • ❌ Integration tests with minio (requires minio running)

Recommendation: The integration tests should be run before merge. Consider adding them to CI pipeline.

📝 Documentation

Missing documentation:

  1. Migration guide for existing deployments
  2. Performance characteristics and trade-offs
  3. When to use Variant vs UTF8 (decision matrix)
  4. Troubleshooting guide for common issues
  5. Update README.md with Variant type examples

🎯 Action Items Before Merge

High Priority:

  1. Run integration tests with MinIO
  2. Add tests for error conditions and edge cases
  3. Verify performance characteristics meet expectations
  4. Document migration path for existing data

Medium Priority:
5. Improve error messages with row indices
6. Add benchmarks for performance comparison
7. Consider reducing UDF boilerplate with macros/generics
8. Add integration test to CI

Low Priority (can be follow-up PRs):
9. Add comprehensive documentation
10. Make VARIANT_COLUMNS configurable
11. Add size limits for JSON documents

🎬 Conclusion

This is a well-implemented feature that adds valuable functionality. The code quality is high, and the design shows good foresight (preparing for delta-rs protocol support). The main concerns are around test coverage and performance validation.

Recommendation: ✅ Approve with suggestions - Address the high-priority items, and consider the others for follow-up work.

Great work on implementing proper Variant support! The transparent wrapper pattern is particularly elegant. 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants