-
Notifications
You must be signed in to change notification settings - Fork 6
Add Variant type support and fix SLT tests #15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
- Add Parquet Variant binary encoding support via datafusion-variant crate - Implement VariantAwareExprPlanner for -> and ->> operators on Variant columns - Add jsonb_path_exists UDF for JSONPath queries on Variant/JSON columns - Register variant functions: json_to_variant, variant_to_json, variant_get, etc. - Add is_variant_type helper in schema_loader for Variant type detection - Update schema with Variant columns: context, events, links, attributes, resource - Fix time_bucket UDF to handle Utf8/Utf8View/LargeUtf8 string types - Fix SLT tests: correct status_message type (string not array) - Fix json_functions.slt to use json_to_variant for Variant column inserts - Add variant_functions.slt tests for round-trip, path extraction, arrow operators
Pull Request Review: Variant Type SupportSummaryThis PR adds comprehensive Parquet Variant binary encoding support to TimeFusion, enabling efficient storage and querying of semi-structured JSON data. The implementation is well-structured with extensive test coverage. ✅ StrengthsCode Quality
Design Decisions
|
- Replace local path dependencies with git dependencies: - deltalake: Use fork with VariantType support (tonyalaribe/delta-rs) - datafusion-variant: Use git dependency with specific rev - Add warning log when schema has more fields than batch columns (helps debug schema evolution issues)
Code Review for PR #15: Add Variant type support and fix SLT testsSummaryThis PR adds comprehensive Parquet Variant binary encoding support to TimeFusion. The implementation is well-structured with good test coverage. Overall quality is high, but there are a few areas that need attention. 🟢 Strengths1. Excellent Test Coverage
2. Clean Architecture
3. Proper PostgreSQL Compatibility
4. Documentation
🟡 Issues & RecommendationsHIGH PRIORITY1. Error Handling in
|
PR Review: Add Variant Type Support and Fix SLT TestsSummaryThis PR adds comprehensive Parquet Variant binary encoding support to TimeFusion, enabling efficient storage and querying of semi-structured JSON data. The implementation is well-architected and includes thorough testing. ✅ Strengths1. Architecture & Design
2. Code Quality
3. Test Coverage
🔍 Issues & Concerns🔴 Critical Issues1. Unvalidated Dependency Source (src/database.rs:518)convert_variant_columns(batch, &target_schema)?Issue: Using forked dependencies ( Risk:
Recommendation:
2. JSON Parsing Error Handling Swallows Data Silently (src/database.rs:583)if let Err(e) = builder.append_json(json_str) {
warn!("Failed to parse JSON '{}': {}, inserting as null", json_str, e);
builder.append_null();
}Issue: Invalid JSON is silently converted to NULL with only a warning log. Risk:
Recommendation: // Option 1: Fail fast (preferred for data integrity)
builder.append_json(json_str)
.map_err(|e| DataFusionError::Execution(
format!("Invalid JSON in Variant column: {}", e)
))?;
// Option 2: Add metrics/counter for monitoring
if let Err(e) = builder.append_json(json_str) {
VARIANT_PARSE_ERRORS.inc(); // Prometheus counter
error!("Failed to parse JSON: {}", e); // Upgrade to error
builder.append_null();
}🟡 Major Issues3. Unbounded Memory Allocation (src/functions.rs:1029)let mut builder = BooleanArray::builder(struct_array.len());
for i in 0..struct_array.len() {
// ...decode entire Variant to JSON in memory
let json_value = variant_to_serde_json(&variant);
}Issue: Large Variant columns could cause OOM when converting all rows to Recommendation:
4. Arrow Version Bump Without Changelog (Cargo.lock:9-10)-version = "57.1.0"
+version = "57.2.0"Issue: Arrow updated from 57.1.0 → 57.2.0 across 12 crates, but no mention in PR description. Recommendation:
5. Base64 Encoding for Binary Data (src/functions.rs:984)Variant::Binary(bytes) => json!(base64::engine::general_purpose::STANDARD.encode(bytes))Issue: Binary data in Variants is Base64 encoded when converting to JSON, but there's no corresponding decode path. Risk: Round-trip inconsistency if users insert Base64 strings expecting binary storage. Recommendation:
🟢 Minor Issues6. Inconsistent String Type Handling (src/functions.rs:871-873)datafusion::scalar::ScalarValue::Utf8(Some(s))
| datafusion::scalar::ScalarValue::Utf8View(Some(s))
| datafusion::scalar::ScalarValue::LargeUtf8(Some(s)) => s.clone()Good: Handles all string variants, but repeated in 3+ places. Recommendation: Extract to helper function: fn extract_string_value(scalar: &ScalarValue) -> Option<String> {
match scalar {
ScalarValue::Utf8(Some(s))
| ScalarValue::Utf8View(Some(s))
| ScalarValue::LargeUtf8(Some(s)) => Some(s.clone()),
_ => None
}
}7. Missing Index Bounds Check (src/database.rs:530-537)if idx >= columns.len() {
warn!("Schema mismatch: ...");
continue;
}Good: Defensive check exists, but warns and continues instead of failing. Recommendation: This should be an error - schema mismatch indicates a programming bug, not a runtime condition. 8. Overly Broad Variant Type Check (src/functions.rs:773-792)fn is_variant_column(expr: &Expr, schema: &DFSchema) -> bool {
// Checks many function names...
matches!(name, "json_to_variant" | "variant_get" | ...)
}Issue: Hardcoded function name list could get out of sync. Recommendation: Check return type's data structure instead: expr.get_type(schema)
.map(|dt| is_variant_type(&dt))
.unwrap_or(false)(Already done as fallback, consider making primary approach) ⚡ Performance ConsiderationsPositive
Concerns
🔒 Security Assessment✅ Good Practices
|
- Replace .unwrap() with proper error handling in convert_variant_columns - json_strings_to_variant now fails fast on invalid JSON instead of silently inserting NULL - Upgrade schema mismatch logging from warn to error level - Expand registration order comment for clarity - Add scalar_to_string() helper to DRY string extraction from ScalarValue
PR Review: Add Variant Type Support and Fix SLT TestsSummaryThis PR adds comprehensive Variant type support to TimeFusion, enabling efficient storage and querying of JSON data using the Parquet Variant binary encoding format. The implementation is well-structured with excellent test coverage. ✅ Strengths1. Excellent Architecture & Design
2. Robust Implementation
3. Comprehensive Testing
4. Dependencies
🔍 Code Quality ObservationsPositive
Minor Concerns1. Schema Mismatch Handling (src/database.rs:99-107)if idx >= columns.len() {
error!("Schema mismatch: target expects '{}' at index {} but batch has only {} columns", ...);
continue; // Silently skips the column
}Issue: Schema mismatches are logged but processing continues. This could lead to data inconsistencies. 2. JSONPath Error Handling (src/functions.rs:1103-1106)let result = match serde_json::from_str::<JsonValue>(json_str) {
Ok(json_value) => !json_path.query(&json_value).is_empty(),
Err(_) => false, // Invalid JSON returns false
};Issue: Invalid JSON silently returns 3. Path Component Extraction (src/functions.rs:766-776)The function handles many integer types (Int32, Int64, UInt32, UInt64) but could potentially overflow when casting larger unsigned values to 🚀 Performance ConsiderationsExcellent
Question
🔒 Security AssessmentGood Practices
Considerations
📊 Test CoverageExcellent Coverage ✅
The test coverage is comprehensive and demonstrates real-world usage patterns. 📝 Documentation & Code StylePositive
Suggestions
🎯 RecommendationsHigh Priority
Medium Priority
Low Priority
✅ Approval RecommendationThis is a well-implemented feature with:
The concerns raised are mostly about edge cases and operational considerations rather than fundamental issues. Recommendation: APPROVE ✅ The PR is ready to merge after addressing the schema mismatch logging concern (consider making it more visible or actionable). Great work on this feature! The Variant type support will significantly improve performance for JSON-heavy workloads while maintaining PostgreSQL compatibility. Review generated by Claude Code |
Pull Request Review: Add Variant type support and fix SLT testsSummaryThis PR adds comprehensive Parquet Variant binary encoding support to TimeFusion, enabling efficient storage and querying of semi-structured JSON data. The implementation is well-architected with proper integration into DataFusion's type system. ✅ StrengthsArchitecture & Design
Code Quality
Testing
|
- Add MAX_VARIANT_DEPTH (100) limit to prevent JSON bomb attacks via deeply nested Variant data - Remove noisy error log for expected schema evolution case in convert_variant_columns - Simplify ShortString handling to use to_string() directly
PR Review: Add Variant Type Support and Fix SLT TestsSummaryThis PR introduces Parquet Variant binary encoding support via the ✅ Strengths1. Excellent Architecture & Design
2. Robust Error Handling
3. Good Code Quality
4. Integration & Testing
🔍 Issues & Concerns1. Security ConcernsCritical: Stack Overflow Risk in Recursive ConversionLocation: While there's a Variant::Object(obj) => {
let mut map = serde_json::Map::new();
for (key, value) in obj.iter() {
map.insert(key.to_string(), variant_to_serde_json(&value, depth + 1)?); // Recursive call
}
JsonValue::Object(map)
}Recommendations:
Memory Concerns in
|
Pull Request Review: Add Variant Type SupportSummaryThis PR adds comprehensive support for the Parquet Variant binary encoding format to TimeFusion, enabling efficient storage and querying of semi-structured JSON data. The implementation is well-architected with proper error handling, extensive test coverage, and thoughtful integration with existing systems. ✅ Strengths1. Code Quality & Architecture
2. Error Handling
3. Security
4. Test Coverage
5. Performance
🔍 Critical Issues1. Memory Exhaustion via Wide Objects (HIGH SEVERITY)Location: functions.rs:1035-1042 variant_to_serde_json() limits depth but NOT width. An attacker could send: Recommendation: Add MAX_VARIANT_WIDTH limit (e.g., 10000 fields/items) 2. Delta-rs Fork Dependency (HIGH LONG-TERM RISK)Location: Cargo.toml:428 Using personal fork blocks security patches and upstream improvements. Recommendation: Create upstream PR to delta-rs for Variant support 3. Stack Overflow Risk (MEDIUM)Location: functions.rs:1004-1047 Direct recursion with depth=100 could still overflow on some platforms. Recommendation: Use iterative conversion with explicit stack 🔧 Major Issues4. Missing Column Validation (MEDIUM)Location: database.rs:532 Skips conversion when idx >= columns.len() without validating nullability. Recommendation: Verify skipped columns are nullable or have defaults 5. Hot Path Performance (MEDIUM)Location: database.rs:1988-1990 Variant conversion on every insert could bottleneck high-throughput workloads. Recommendation: Cache schema analysis results 6. JSONPath Complexity (MEDIUM)Location: functions.rs:986 No limits on JSONPath expression complexity (potential ReDoS/CPU exhaustion). Recommendation: Add complexity limits or evaluation timeout 📝 Minor Issues
📊 Final AssessmentOverall Score: 8.5/10 Recommendation: Approve with Minor Changes Must Fix Before Merge:
Should Fix Soon:
This is high-quality work that adds significant value to TimeFusion. The critical issues are defensive improvements rather than showstopper bugs. Great job on the comprehensive test coverage and clean architecture! 🎉 |
Summary
datafusion-variantcrateVariantAwareExprPlannerfor->and->>operators on Variant columnsjsonb_path_existsUDF for JSONPath queries on Variant/JSON columnsjson_to_variant,variant_to_json,variant_get, etc.context,events,links,attributes,resourcetime_bucketUDF to handleUtf8/Utf8View/LargeUtf8string typesstatus_messagetype (string not array)variant_functions.slttests for round-trip, path extraction, arrow operatorsTest plan
🤖 Generated with Claude Code