[Kernel] Extended schema JSON serde to support collations #3628

ilicmarkodb · 2024-08-30T17:11:50Z

Which Delta project/connector is this regarding?

Description

Extended serialization and deserialization to support collations in metadata.

How was this patch tested?

Tests added to DataTypeJsonSerDe.java and StructTypeSuite.scala.

Does this PR introduce any user-facing changes?

No.

ilicmarkodb · 2024-08-30T17:23:25Z

@vkorukanti can you please review?

vkorukanti

Looks great!

Add bit more comments on how the collation property is stored. You can look at the method docs in ColumnMapping (where similar nested level field ids are stored in metadata) for an example docs.

vkorukanti · 2024-09-06T05:55:11Z

kernel/kernel-api/src/main/java/io/delta/kernel/expressions/CollationIdentifier.java

+
+import java.util.Optional;
+
+public class CollationIdentifier {


Javadoc, @since version and @evolving tag

vkorukanti · 2024-09-06T05:55:31Z

kernel/kernel-api/src/main/java/io/delta/kernel/expressions/CollationIdentifier.java

+import java.util.Optional;
+
+public class CollationIdentifier {
+  public static final String PROVIDER_SPARK = "SPARK";


What do these constant mean? add some comment?

vkorukanti · 2024-09-06T05:57:20Z

kernel/kernel-api/src/main/java/io/delta/kernel/expressions/CollationIdentifier.java

+    if (parts.length == 1) {
+      throw new IllegalArgumentException(
+          String.format("Invalid collation identifier: %s", identifier));


checkArgument(parts.length != 1, String.format("Invalid collation identifier: %s", identifier));

or switch(parts.length) {
case 2:
case 3:
default: throw error
}

vkorukanti · 2024-09-06T05:58:46Z

kernel/kernel-api/src/main/java/io/delta/kernel/expressions/CollationIdentifier.java

+    return String.format("%s.%s", provider, name);
+  }
+
+  public static CollationIdentifier fromString(String identifier) {


All public APIs are going to show up in API docs. Please add proper javadoc.

vkorukanti · 2024-09-06T05:59:33Z

kernel/kernel-api/src/main/java/io/delta/kernel/expressions/CollationIdentifier.java

+    this.provider = provider;
+    this.name = name;
+    this.version = version;


null checks. Objects.requireNonNull

vkorukanti · 2024-09-06T06:11:43Z

kernel/kernel-api/src/main/java/io/delta/kernel/types/StringType.java

@@ -16,6 +16,7 @@
 package io.delta.kernel.types;

 import io.delta.kernel.annotation.Evolving;
+import io.delta.kernel.expressions.CollationIdentifier;

 /**
 * The data type representing {@code string} type values.


update the doc to include collation info.

kernel/kernel-api/src/main/java/io/delta/kernel/types/StructField.java

vkorukanti · 2024-09-06T06:15:50Z

kernel/kernel-api/src/test/scala/io/delta/kernel/internal/types/DataTypeJsonSerDeSuite.scala

@@ -130,6 +132,14 @@ class DataTypeJsonSerDeSuite extends AnyFunSuite {
    }
  }

+  test("parseDataType: types with collated strings") {


add some negative tests which cause the parser to fail.

kernel/kernel-api/src/test/scala/io/delta/kernel/internal/types/DataTypeJsonSerDeSuite.scala

vkorukanti · 2024-09-06T06:17:11Z

kernel/kernel-api/src/test/scala/io/delta/kernel/types/StructTypeSuite.scala

+
+import org.scalatest.funsuite.AnyFunSuite
+
+class StructTypeSuite extends AnyFunSuite {


Oh I see. Why not just add the tests in DataTypeJsonSerDeSuite itself?

Basically, you can roundtrip and test both serialize and deserialize works.

you're right. thanks

vkorukanti · 2024-09-24T10:48:25Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/types/DataTypeJsonSerDe.java

@@ -89,7 +89,8 @@ public static String serializeDataType(DataType dataType) {
   */
  public static StructType deserializeStructType(String structTypeJson) {
    try {
-      DataType parsedType = parseDataType(OBJECT_MAPPER.reader().readTree(structTypeJson));
+      DataType parsedType =
+          parseDataType(OBJECT_MAPPER.reader().readTree(structTypeJson), "", new HashMap<>());


Suggested change

parseDataType(OBJECT_MAPPER.reader().readTree(structTypeJson), "", new HashMap<>());

parseDataType(

OBJECT_MAPPER.reader().readTree(structTypeJson),

"" /* fieldPath */,

new HashMap<>() /* collationMap*/);

vkorukanti · 2024-09-24T10:51:15Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/types/DataTypeJsonSerDe.java

@@ -131,22 +132,29 @@ public static StructType deserializeStructType(String structTypeJson) {
   *   "metadata" : { }
   * }
   * </pre>
+   *
+   * @param fieldPath Path from the nearest ancestor that is of the {@link StructField} type.
+   * @param collationMap Maps the path of a {@link StringType} to its collation. Only maps non-UTF8_BINARY collated {@link StringType}.


mention why it is needed and how it used. Basically for the element types of map or array, have no fieldMetadata -> can't contain the collation -> Use the nearest structfield which has the field metadata to store the collation for map/array elements. Mention lookup key (path) format.

vkorukanti · 2024-09-24T10:53:27Z

kernel/kernel-api/src/test/scala/io/delta/kernel/types/StringTypeSuite.scala

@@ -0,0 +1,43 @@
+package io.delta.kernel.types


add header? this didn't fail the CI job?

vkorukanti · 2024-09-24T10:54:32Z

kernel/kernel-api/src/test/scala/io/delta/kernel/types/StringTypeSuite.scala

+
+class StringTypeSuite extends AnyFunSuite {
+  test("check equals") {
+    Seq(


Suggested change

Seq(

// Testcase: (instance1, instance2, expected value for `instance1 == instance2`)

Seq(

vkorukanti · 2024-09-24T11:23:35Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/types/DataTypeJsonSerDe.java

   */
-  static DataType parseDataType(JsonNode json) {
+  static DataType parseDataType(
+      JsonNode json, String fieldPath, HashMap<String, String> collationMap) {


you can just use Map in arguments def instead of HashMap. Same in other places.

vkorukanti · 2024-09-24T11:31:14Z

kernel/kernel-api/src/main/java/io/delta/kernel/internal/types/DataTypeJsonSerDe.java

+  private static HashMap<String, String> getCollationsMap(JsonNode fieldMetadata) {
+    if (fieldMetadata == null || !fieldMetadata.has(DataType.COLLATIONS_METADATA_KEY)) {
+      return new HashMap<>();
+    }
+    HashMap<String, String> collationsMap = new HashMap<>();
+    FieldMetadata collationFieldMetadata =
+        parseFieldMetadata(fieldMetadata.get(DataType.COLLATIONS_METADATA_KEY));
+    for (Map.Entry<String, Object> collationField :
+        collationFieldMetadata.getEntries().entrySet()) {
+      String fieldPath = collationField.getKey();
+      Object collationName = collationField.getValue();
+      if (!(collationName instanceof String)) {
+        throw new IllegalArgumentException(
+            String.format("Invalid collation name: %s.", collationName));
+      }
+      collationsMap.put(fieldPath, (String) collationName);
+    }
+    return collationsMap;
+  }


I don't see a need for this. Basically instead of creating the Map, why not just use the fieldMetadata? And the fieldMetadata has a getString which already has type checks.

This will improve the code in this file as well. I think we do similarly in ColumnMapping.java for nested field ids.

you're right, thanks!

vkorukanti · 2024-09-24T11:33:49Z

kernel/kernel-api/src/main/java/io/delta/kernel/types/StructField.java

@@ -102,6 +105,47 @@ public String toString() {
        "StructField(name=%s,type=%s,nullable=%s,metadata=%s)", name, dataType, nullable, metadata);
  }

+  public FieldMetadata getSerializationMetadata() {


Why do we need this and how is it different from the getMetadata?

If I understand correctly, this is capturing the nested field collation types and returning in FieldMetadata. Why is this not already the case when this StructField is created?

@stefankandic is this how Spark does? This seems not clear. What is the difference between getMetadata vs this method? I understand this has the additional metadata, but for developers I see this causing ambiguity.

vkorukanti · 2024-09-25T16:00:39Z

kernel/kernel-api/src/test/scala/io/delta/kernel/internal/types/DataTypeJsonSerDeSuite.scala

+    DataTypeJsonSerDe.parseDataType(objectMapper.readTree(json),
+      "",
+      new FieldMetadata.Builder().build())


Suggested change

DataTypeJsonSerDe.parseDataType(objectMapper.readTree(json),

"",

new FieldMetadata.Builder().build())

DataTypeJsonSerDe.parseDataType(objectMapper.readTree(json),

"" /* fieldPath */,

new FieldMetadata.Builder().build() /* collation field metadata */)

vkorukanti · 2024-09-25T16:08:56Z

kernel/kernel-api/src/test/scala/io/delta/kernel/internal/types/DataTypeJsonSerDeSuite.scala

+          .add("b1", new ArrayType(new StringType("SPARK.UTF8_LCASE"), false))
+          .add("b2", new MapType(
+            new StringType("ICU.UNICODE_CI"), new StringType("SPARK.UTF8_LCASE"), true), false)


what about the case where Map/array element is a struct which has a string column with collation.

vkorukanti · 2024-09-25T16:11:21Z

kernel/kernel-api/src/test/scala/io/delta/kernel/internal/types/DataTypeJsonSerDeSuite.scala

+    ),
+    (
+      structTypeJson(Seq(
+        structFieldJson("a1", structTypeJson(Seq(


nit: add a comment just above the structFieldJson on what test this field is covering. Easy to see the tests and understand.

vkorukanti · 2024-09-25T16:11:46Z

kernel/kernel-api/src/test/scala/io/delta/kernel/internal/types/DataTypeJsonSerDeSuite.scala

+    )
+  )
+
+  val SAMPLE_JSON_TO_TYPES_WITH_COLLATION_DIFFERING = Seq(


what does DIFFERING mean?

idea of this test was to have just difference in StringType collation, i will make some better name

vkorukanti · 2024-09-25T16:16:14Z

kernel/kernel-api/src/test/scala/io/delta/kernel/internal/types/DataTypeJsonSerDeSuite.scala

+          .add("b1", new StringType("ICU.UNICODE")), true)
+        .add("a2", new StructType()
+          // In json, this field has "SPARK.UTF8_LCASE" collation
+          .add("b1", new ArrayType(StringType.STRING, false))


why do we need these negative tests? If I understand correctly, as long as the StructType.equals is implemented correctly, then this should work. May be just add one specific test where it is not matching?

okay, make sense since we have StringTypeSuite where we test this

vkorukanti · 2024-09-25T16:18:56Z

kernel/kernel-api/src/main/java/io/delta/kernel/types/StructField.java

@@ -102,6 +105,47 @@ public String toString() {
        "StructField(name=%s,type=%s,nullable=%s,metadata=%s)", name, dataType, nullable, metadata);
  }

+  public FieldMetadata getSerializationMetadata() {


@stefankandic is this how Spark does? This seems not clear. What is the difference between getMetadata vs this method? I understand this has the additional metadata, but for developers I see this causing ambiguity.

vkorukanti · 2024-09-26T23:01:07Z

kernel/kernel-api/src/main/java/io/delta/kernel/types/StructField.java

+      nestedCollatedFields.addAll(
+          getNestedCollatedFields(((ArrayType) parent).getElementType(), path + ".element"));
+    }
+    // We didn't check for StructType because we store the StringType's


I think we still need to go through the fields within the StructType and check if any of them contains a Map/Array type.

vkorukanti · 2024-09-26T23:04:31Z

kernel/kernel-api/src/main/java/io/delta/kernel/types/StructField.java

@@ -79,9 +82,37 @@ public DataType getDataType() {

  /** @return the metadata for this field */
  public FieldMetadata getMetadata() {
+    fetchCollationMetadata();


I am wondering if this can be handled simply by adding the code in the StructField constructor. It can go through the type and figure out if it needs collation data to be stored in its metadata. If yes, just add them there. Given StructField is immutable, we don't need to do dynamic computation of the collations for nested fields like here which is prone to bugs.

ilicmarkodb added 8 commits August 30, 2024 18:23

extended StringType to have CollationIdentifier

fb588c1

reordered attributes

db4e7b2

changed PROVIDER_KERNEL to PROVIDER_SPARK

49059ff

extended serialization and deserialization to support collation

d9279ef

style fix

916049e

style fix

54b59f4

style fix

9715cf5

added CollationIdentifier equals

55c5191

ilicmarkodb added 9 commits August 30, 2024 19:25

style fix

51162f0

style fix

712e081

fix

36571ab

tests added for CollationIdentifier

86602c6

style fix

76cdbd5

style fix

d8fc611

changed toString and fromString

9c9684a

changed CollationIdentifier

c6bd336

changed CollationIdentifier

5e0e43e

vkorukanti changed the title ~~Extended serialization and deserialization to support collations in metadata.~~ [Kernel] Extended schema JSON serde to support collations Sep 6, 2024

vkorukanti requested changes Sep 6, 2024

View reviewed changes

vkorukanti added the kernel label Sep 6, 2024

vkorukanti mentioned this pull request Sep 6, 2024

[KERNEL] Extended StringType to have CollationIdentifier #3627

Merged

5 tasks

ilicmarkodb added 8 commits September 9, 2024 14:23

merged with extend_string_type_to_have_collation

2d9465d

suggestions applied

20a1081

suggestions applied

6469ba1

merged with extend_string_type_to_have_collation

daa2f66

javadoc updated

37c3617

merged with extend_string_type_to_have_collation

9b2835f

temp

8e0fb82

temp

164edcc

stringtype equals updated

14b7327

ilicmarkodb requested a review from vkorukanti September 10, 2024 14:17

ilicmarkodb added 5 commits September 12, 2024 11:10

removed DEFAULT values

e914e6f

since tag added

1feea71

merged with extend_string_type_to_have_collation

4c7d72f

changed CollationIdentifier constructor

d95ebc0

java doc added

a7e435b

vkorukanti requested changes Sep 24, 2024

View reviewed changes

ilicmarkodb added 4 commits September 24, 2024 16:13

temp

fa836a4

suggestion applied

eff0abf

test fixed

99ce5ae

style fix

bd62e3d

ilicmarkodb requested a review from vkorukanti September 24, 2024 16:22

merged with master

908750d

ilicmarkodb force-pushed the extend_SerDe_to_support_collations branch from bfb2ee5 to 908750d Compare September 24, 2024 17:02

vkorukanti requested changes Sep 25, 2024

View reviewed changes

ilicmarkodb added 4 commits September 25, 2024 22:45

temp

9b2001b

suggestions applied

52280ce

style fix

6a45b46

added fetchCollationMetadata method

555d49e

ilicmarkodb requested a review from vkorukanti September 26, 2024 21:47

vkorukanti requested changes Sep 26, 2024

View reviewed changes

ilicmarkodb added 4 commits September 27, 2024 01:14

moved fetchCollationMetadata to constructor

d359ed4

style fix

67854c9

fix

c6f1c97

fix

c5f41b9

vkorukanti approved these changes Sep 26, 2024

View reviewed changes

vkorukanti added 2 commits September 26, 2024 16:23

Update StructField.java

7b8c844

minor change

ae0b189

vkorukanti merged commit b1e4a03 into delta-io:master Sep 27, 2024
16 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel] Extended schema JSON serde to support collations #3628

[Kernel] Extended schema JSON serde to support collations #3628

ilicmarkodb commented Aug 30, 2024

ilicmarkodb commented Aug 30, 2024

vkorukanti left a comment

vkorukanti Sep 6, 2024

vkorukanti Sep 6, 2024

vkorukanti Sep 6, 2024

vkorukanti Sep 6, 2024

vkorukanti Sep 6, 2024

vkorukanti Sep 6, 2024

vkorukanti Sep 6, 2024

vkorukanti Sep 6, 2024

vkorukanti Sep 6, 2024

vkorukanti Sep 6, 2024

ilicmarkodb Sep 9, 2024

vkorukanti Sep 24, 2024

vkorukanti Sep 24, 2024

vkorukanti Sep 24, 2024

vkorukanti Sep 24, 2024

vkorukanti Sep 24, 2024

vkorukanti Sep 24, 2024

ilicmarkodb Sep 24, 2024

vkorukanti Sep 24, 2024

vkorukanti Sep 24, 2024

vkorukanti Sep 25, 2024

vkorukanti Sep 25, 2024

vkorukanti Sep 25, 2024

vkorukanti Sep 25, 2024

vkorukanti Sep 25, 2024

ilicmarkodb Sep 25, 2024

vkorukanti Sep 25, 2024

ilicmarkodb Sep 25, 2024

vkorukanti Sep 25, 2024

vkorukanti Sep 26, 2024

vkorukanti Sep 26, 2024


		import java.util.Optional;

		public class CollationIdentifier {


		import org.scalatest.funsuite.AnyFunSuite

		class StructTypeSuite extends AnyFunSuite {

	Seq(
	// Testcase: (instance1, instance2, expected value for `instance1 == instance2`)
	Seq(

[Kernel] Extended schema JSON serde to support collations #3628

[Kernel] Extended schema JSON serde to support collations #3628

Conversation

ilicmarkodb commented Aug 30, 2024

Which Delta project/connector is this regarding?

Description

How was this patch tested?

Does this PR introduce any user-facing changes?

ilicmarkodb commented Aug 30, 2024

vkorukanti left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment