Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spark] Make delta.dataSkippingStatsColumns more lenient for nested columns #2850

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

Kimahriman
Copy link
Contributor

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

Resolves #2822

Make delta.dataSkippingStatsColumns more lenient for nested columns by not throwing an exception if a nested column doesn't support gathering stats. This more closely matches the behavior of the dataSkippingNumIndexedCols which allows for unsupported types in those columns (and seems to still gather null counts for those unsupported types). This also allows more use cases where you might have a wide variety of types inside a top level struct, and you simply want to gather stats on whatever columns inside that struct you can.

I kept the duplicate column checking in place to avoid less changes, but I'm not sure how necessary that really is besides letting users know they are doing something dumb.

How was this patch tested?

A couple tests were removed that were specifically testing for the now-allowed behavior, and a new test was added to verify the new behavior works.

Does this PR introduce any user-facing changes?

Yes, specifying a struct with unsupported stats gathering types in delta.dataSkippingStatsColumns is now allowed instead of throwing an exception.

@Kimahriman
Copy link
Contributor Author

@kamcheungting-db since you added this originally

@Kimahriman
Copy link
Contributor Author

@longvu-db since you're reviewing my other PR 😊. This would be a great quality of life improvement for complex nested schemas

@longvu-db
Copy link
Contributor

@Kimahriman Will take a look!

…csCollection.scala

Co-authored-by: Thang Long Vu <107926660+longvu-db@users.noreply.github.com>
@@ -599,6 +599,27 @@ trait DataSkippingDeltaTestsBase extends DeltaExcludedBySparkVersionTestMixinShi
deltaStatsColNamesOpt = Some("b.c")
)

testSkipping(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we consider double struct test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a double nested struct as well

},
"i": 10
}""".replace("\n", ""),
hits = Seq(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have hit tests for elements inside struct and the double struct as well?

Btw click on re-request review so that I get notified.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added more hit tests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that when we are inside struct, we gather stats for the valid datatypes inside the struct, but the invalid ones we still don't right? Would we want to test that we still don't skip the invalid datatypes inside the struct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, if we can check another skipping valid type beside integer, like timestamp, string, ... that would be great

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically null counts are collected, but they can't be used for skipping. We could add that to the test, but seems kind of out of scope because you can't do min/max things with arrays/maps anyway, so it'd have to be like an element contains or something

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are other non-skipping eligible datatypes than lists datatypes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how to specify a Binary type in the inferred JSON. Is that the only non-complex type?

@longvu-db
Copy link
Contributor

@Kimahriman Just out of curiosity, why are you making this change? Do you have some complex nested schemas that you want to skip data on?

@Kimahriman
Copy link
Contributor Author

@Kimahriman Just out of curiosity, why are you making this change? Do you have some complex nested schemas that you want to skip data on?

Yes, we have structs with mixed arrays and primitive types. Currently I have to recursively count the fields to set the num index columns to the right value

}
case _ if insideStruct => columnPaths.append(name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we are inside the struct, we keep appending columns regardless of the data type right?

So would the description of the "@param columnPaths" no longer correct?

This seems outdated as well

* 2. Delta statistics column must exist in delta table's schema.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Technically it is "valid", it will actually collect null counts on all fields regardless of if min/max is supported

@@ -458,15 +458,20 @@ object StatisticsCollection extends DeltaCommand {
* @param name The name of the data skipping column for validating data type.
* @param dataType The data type of the data skipping column.
* @param columnPaths The column paths of all valid fields.
* @param insideStruct Whether the datatype is inside a struct already, in which case we don't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of saying "we don't clear", we can make it clearer what we mean by that I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed, not sure if it's anymore clear

@@ -608,7 +590,8 @@ class StatsCollectionSuite

Seq(
"BIGINT", "DATE", "DECIMAL(3, 2)", "DOUBLE", "FLOAT", "INT", "SMALLINT", "STRING",
"TIMESTAMP", "TIMESTAMP_NTZ", "TINYINT"
"TIMESTAMP", "TIMESTAMP_NTZ", "TINYINT", "STRUCT<c3: BIGINT>",
Copy link
Contributor

@longvu-db longvu-db Aug 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like there are relevant tests in DataSkippingDeltaTests as well

, could you take a look at this whole file as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those tests are for specifying the number of indexed columns, not columns by name

@kamcheungting-db
Copy link
Contributor

Hi Adam,

I intentionally throw error when the column doesn't support delta stats.
The reason behind is: if we don't throw error, user does;t know why the delta stat is not collected. It is hard for user to differentiate between unsupported data type with a bug.

@kamcheungting-db
Copy link
Contributor

If we want to swallow the supported error, there should be a way to tell user that the column is unsupported.

@Kimahriman
Copy link
Contributor Author

Kimahriman commented Aug 18, 2024

Happy to log a warning, but it's not actually unsupported. Null counts are still collected for all columns. If users are confused why their arrays and maps aren't collecting mins and maxes, they have bigger problems. This behavior matches how the number of indexed columns work.

@kamcheungting-db
Copy link
Contributor

**Kimahriman ** commented Aug 18, 2024

Warming sounds a good indicator.

@Kimahriman
Copy link
Contributor Author

Warming sounds a good indicator.

Added a warning log, let me know if you have thoughts on exact verbiage, just copied the error text

}
case SkippingEligibleDataType(_) => columnPaths.append(name)
case d if insideStruct =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also make the non-struct field have the same behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to you. The idea was if you directly specify an in-eligible type, throw an exception because you did something wrong. If you specify a struct, just work with the sub fields that are supported

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then, I am good about it.
Thank you

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Feature Request] Make delta.dataSkippingStatsColumns more lenient for nested columns
3 participants