-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Extended schema JSON serde to support collations #3628
[Kernel] Extended schema JSON serde to support collations #3628
Conversation
@vkorukanti can you please review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
Add bit more comments on how the collation property is stored. You can look at the method docs in ColumnMapping
(where similar nested level field ids are stored in metadata) for an example docs.
|
||
import java.util.Optional; | ||
|
||
public class CollationIdentifier { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Javadoc, @since
version and @evolving
tag
import java.util.Optional; | ||
|
||
public class CollationIdentifier { | ||
public static final String PROVIDER_SPARK = "SPARK"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do these constant mean? add some comment?
if (parts.length == 1) { | ||
throw new IllegalArgumentException( | ||
String.format("Invalid collation identifier: %s", identifier)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkArgument(parts.length != 1, String.format("Invalid collation identifier: %s", identifier));
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or switch(parts.length) {
case 2:
case 3:
default: throw error
}
return String.format("%s.%s", provider, name); | ||
} | ||
|
||
public static CollationIdentifier fromString(String identifier) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All public APIs are going to show up in API docs. Please add proper javadoc.
this.provider = provider; | ||
this.name = name; | ||
this.version = version; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
null checks. Objects.requireNonNull
@@ -16,6 +16,7 @@ | |||
package io.delta.kernel.types; | |||
|
|||
import io.delta.kernel.annotation.Evolving; | |||
import io.delta.kernel.expressions.CollationIdentifier; | |||
|
|||
/** | |||
* The data type representing {@code string} type values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update the doc to include collation info.
@@ -130,6 +132,14 @@ class DataTypeJsonSerDeSuite extends AnyFunSuite { | |||
} | |||
} | |||
|
|||
test("parseDataType: types with collated strings") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some negative tests which cause the parser to fail.
kernel/kernel-api/src/test/scala/io/delta/kernel/internal/types/DataTypeJsonSerDeSuite.scala
Show resolved
Hide resolved
|
||
import org.scalatest.funsuite.AnyFunSuite | ||
|
||
class StructTypeSuite extends AnyFunSuite { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. Why not just add the tests in DataTypeJsonSerDeSuite
itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, you can roundtrip and test both serialize and deserialize works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. thanks
@@ -89,7 +89,8 @@ public static String serializeDataType(DataType dataType) { | |||
*/ | |||
public static StructType deserializeStructType(String structTypeJson) { | |||
try { | |||
DataType parsedType = parseDataType(OBJECT_MAPPER.reader().readTree(structTypeJson)); | |||
DataType parsedType = | |||
parseDataType(OBJECT_MAPPER.reader().readTree(structTypeJson), "", new HashMap<>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parseDataType(OBJECT_MAPPER.reader().readTree(structTypeJson), "", new HashMap<>()); | |
parseDataType( | |
OBJECT_MAPPER.reader().readTree(structTypeJson), | |
"" /* fieldPath */, | |
new HashMap<>() /* collationMap*/); |
@@ -131,22 +132,29 @@ public static StructType deserializeStructType(String structTypeJson) { | |||
* "metadata" : { } | |||
* } | |||
* </pre> | |||
* | |||
* @param fieldPath Path from the nearest ancestor that is of the {@link StructField} type. | |||
* @param collationMap Maps the path of a {@link StringType} to its collation. Only maps non-UTF8_BINARY collated {@link StringType}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mention why it is needed and how it used. Basically for the element types of map or array, have no fieldMetadata -> can't contain the collation -> Use the nearest structfield which has the field metadata to store the collation for map/array elements. Mention lookup key (path) format.
@@ -0,0 +1,43 @@ | |||
package io.delta.kernel.types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add header? this didn't fail the CI job?
|
||
class StringTypeSuite extends AnyFunSuite { | ||
test("check equals") { | ||
Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seq( | |
// Testcase: (instance1, instance2, expected value for `instance1 == instance2`) | |
Seq( |
*/ | ||
static DataType parseDataType(JsonNode json) { | ||
static DataType parseDataType( | ||
JsonNode json, String fieldPath, HashMap<String, String> collationMap) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can just use Map
in arguments def instead of HashMap
. Same in other places.
private static HashMap<String, String> getCollationsMap(JsonNode fieldMetadata) { | ||
if (fieldMetadata == null || !fieldMetadata.has(DataType.COLLATIONS_METADATA_KEY)) { | ||
return new HashMap<>(); | ||
} | ||
HashMap<String, String> collationsMap = new HashMap<>(); | ||
FieldMetadata collationFieldMetadata = | ||
parseFieldMetadata(fieldMetadata.get(DataType.COLLATIONS_METADATA_KEY)); | ||
for (Map.Entry<String, Object> collationField : | ||
collationFieldMetadata.getEntries().entrySet()) { | ||
String fieldPath = collationField.getKey(); | ||
Object collationName = collationField.getValue(); | ||
if (!(collationName instanceof String)) { | ||
throw new IllegalArgumentException( | ||
String.format("Invalid collation name: %s.", collationName)); | ||
} | ||
collationsMap.put(fieldPath, (String) collationName); | ||
} | ||
return collationsMap; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see a need for this. Basically instead of creating the Map, why not just use the fieldMetadata
? And the fieldMetadata
has a getString
which already has type checks.
This will improve the code in this file as well. I think we do similarly in ColumnMapping.java for nested field ids.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right, thanks!
@@ -102,6 +105,47 @@ public String toString() { | |||
"StructField(name=%s,type=%s,nullable=%s,metadata=%s)", name, dataType, nullable, metadata); | |||
} | |||
|
|||
public FieldMetadata getSerializationMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this and how is it different from the getMetadata
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand correctly, this is capturing the nested field collation types and returning in FieldMetadata. Why is this not already the case when this StructField is created?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stefankandic is this how Spark does? This seems not clear. What is the difference between getMetadata
vs this method? I understand this has the additional metadata, but for developers I see this causing ambiguity.
bfb2ee5
to
908750d
Compare
DataTypeJsonSerDe.parseDataType(objectMapper.readTree(json), | ||
"", | ||
new FieldMetadata.Builder().build()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataTypeJsonSerDe.parseDataType(objectMapper.readTree(json), | |
"", | |
new FieldMetadata.Builder().build()) | |
DataTypeJsonSerDe.parseDataType(objectMapper.readTree(json), | |
"" /* fieldPath */, | |
new FieldMetadata.Builder().build() /* collation field metadata */) |
.add("b1", new ArrayType(new StringType("SPARK.UTF8_LCASE"), false)) | ||
.add("b2", new MapType( | ||
new StringType("ICU.UNICODE_CI"), new StringType("SPARK.UTF8_LCASE"), true), false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about the case where Map/array element is a struct which has a string column with collation.
), | ||
( | ||
structTypeJson(Seq( | ||
structFieldJson("a1", structTypeJson(Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add a comment just above the structFieldJson
on what test this field is covering. Easy to see the tests and understand.
) | ||
) | ||
|
||
val SAMPLE_JSON_TO_TYPES_WITH_COLLATION_DIFFERING = Seq( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does DIFFERING mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idea of this test was to have just difference in StringType collation, i will make some better name
.add("b1", new StringType("ICU.UNICODE")), true) | ||
.add("a2", new StructType() | ||
// In json, this field has "SPARK.UTF8_LCASE" collation | ||
.add("b1", new ArrayType(StringType.STRING, false)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need these negative tests? If I understand correctly, as long as the StructType.equals
is implemented correctly, then this should work. May be just add one specific test where it is not matching?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, make sense since we have StringTypeSuite where we test this
@@ -102,6 +105,47 @@ public String toString() { | |||
"StructField(name=%s,type=%s,nullable=%s,metadata=%s)", name, dataType, nullable, metadata); | |||
} | |||
|
|||
public FieldMetadata getSerializationMetadata() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@stefankandic is this how Spark does? This seems not clear. What is the difference between getMetadata
vs this method? I understand this has the additional metadata, but for developers I see this causing ambiguity.
nestedCollatedFields.addAll( | ||
getNestedCollatedFields(((ArrayType) parent).getElementType(), path + ".element")); | ||
} | ||
// We didn't check for StructType because we store the StringType's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to go through the fields within the StructType and check if any of them contains a Map/Array type.
@@ -79,9 +82,37 @@ public DataType getDataType() { | |||
|
|||
/** @return the metadata for this field */ | |||
public FieldMetadata getMetadata() { | |||
fetchCollationMetadata(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am wondering if this can be handled simply by adding the code in the StructField
constructor. It can go through the type and figure out if it needs collation data to be stored in its metadata. If yes, just add them there. Given StructField is immutable, we don't need to do dynamic computation of the collations for nested fields like here which is prone to bugs.
Which Delta project/connector is this regarding?
Description
Extended serialization and deserialization to support collations in metadata.
How was this patch tested?
Tests added to
DataTypeJsonSerDe.java
andStructTypeSuite.scala
.Does this PR introduce any user-facing changes?
No.