SNOW-1570734: Reduce describe query when there is no schema change #2126

sfc-gh-jdu · 2024-08-19T23:08:31Z

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes SNOW-1570734
Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
Please describe how your code solves the related issue.

This PR now actually only captures one part: cache attributes (metadata) when 1) the DataFrame operation doesn't change attributes (e.g., filter, sort) 2) SQL simplifier is disabled, or SQL simplifier is enabled but creating a SelectStatement directly from a non-SelectStatement (e.g., sesison.sql(...).schema can cache attributes, see more in tests). I will have another PR that will try to cache attributes on SelectStatement, which will address 2)

github-actions · 2024-08-20T17:16:12Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

CHANGELOG.md

tests/integ/test_reduce_describe_query.py

CHANGELOG.md

github-actions · 2024-08-26T20:57:16Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-10-08T20:26:09Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-10-08T20:28:15Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-10-08T21:33:11Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-10-09T20:28:21Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-10-09T21:00:04Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-10-09T21:02:11Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

sfc-gh-helmeleegy · 2024-10-10T22:30:15Z

tests/integ/test_reduce_describe_query.py

+# Create from SQL
+create_from_sql_funcs = [
+    lambda session: session.sql("SELECT 1 AS a, 2 AS b"),
+    # lambda session: session.sql("SELECT 1 AS a, 2 AS b").select("b"),


remove commented code?

I think @sfc-gh-jdu want to enable this when the inferring also works on selectStatement, but @sfc-gh-jdu i thought this works when sql simplifier is enabled, maybe we can only disable those case on sql simplifier is enabled to be more clear

no worries, I think I can remove them for now, and add them back in next PR (very soon)

sfc-gh-helmeleegy · 2024-10-10T22:31:32Z

tests/integ/test_reduce_describe_query.py

+
+# Create from Values
+create_from_values_funcs = [
+    # lambda session: session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"]),


same (although I see all entries commented in this case?)

sfc-gh-helmeleegy · 2024-10-10T22:31:39Z

tests/integ/test_reduce_describe_query.py

+    lambda session: session.create_dataframe(
+        [[1, 2], [3, 4]], schema=["a", "b"]
+    ).cache_result(),
+    # lambda session: session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"]).cache_result().select("b"),


sfc-gh-helmeleegy · 2024-10-10T22:31:49Z

tests/integ/test_reduce_describe_query.py

+    lambda session: session.create_dataframe(
+        [[1, 2], [3, 4]], schema=["a", "b"]
+    ).rename({"b": "c"}),
+    # lambda session: session.range(10).to_df("a")


sfc-gh-helmeleegy · 2024-10-10T22:31:56Z

tests/integ/test_reduce_describe_query.py

+    lambda df: df.sort(-col("a")),
+    lambda df: df.limit(2),
+    # TODO SNOW-1728988: enable this test case (no flatten) after caching attributes on SelectStatement
+    # lambda df: df.sort(col("a").desc()).limit(2).filter(col("a") > 2),


sure I can remove them and add later in next PR

sfc-gh-yzou · 2024-10-12T00:51:03Z

CHANGELOG.md

+
+#### Improvements
+
+- Reduced the number of additional [describe queries](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-example#retrieving-column-metadata) sent to the server to fetch the metadata of a DataFrame. It is still an experimental feature not enabled by default, and can be enabled by setting `session.reduce_describe_query_enabled` to `True`.


@sfc-gh-jdu do you want to separate the reducing attribute call with inferring and the reducing describe query for select statement? it might be good to introduce a different parameter name for this to separate the two changes

I think they are related (in terms of code change) and targeting at reducing describe queries, so one parameter is enough?

if we want to bundle those together, then let's remove it from the release log for now, we should add this release log when change is ready, or we do it the same as other improvement features we have added that do not make it user visible.
Also we should have a ticket to track adding the server side parameters

Oh yeah, that's true, sorry I missed that. Let me remove it first

https://snowflakecomputing.atlassian.net/browse/SNOW-1737311

sfc-gh-yzou · 2024-10-12T00:51:25Z

CHANGELOG.md

+#### Improvements
+
+- Reduced the number of additional [describe queries](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-example#retrieving-column-metadata) sent to the server to fetch the metadata of a DataFrame. It is still an experimental feature not enabled by default, and can be enabled by setting `session.reduce_describe_query_enabled` to `True`.
+
 ## 1.23.0 (TBD)


i think you need to rebase this, 1.23.0 has been released

sfc-gh-yzou · 2024-10-12T01:27:48Z

src/snowflake/snowpark/_internal/analyzer/metadata_utils.py

+    elif isinstance(source_plan, SelectStatement):
+        # When source_plan._snowflake_plan is not None, `get_snowflake_plan` is called
+        # to create a new SnowflakePlan and `infer_metadata` is already called on the new plan.
+        if source_plan._snowflake_plan is not None:


i am little bit confused here, i thought we need to cache the attribute at SelectStatement? or maybe i am confusing myself, would you mind remind me what it the case that we need to cache the attribute on selectStatement?

yes, we will cache the attribute at SelectStatement here, but we will do it in next PR (or if you think it's ok to include it this PR?)

no, i don't think we want to do it in that pr, but could you remind me the example about why we need to cache it on the selectStatement (i recall the cache would lost under some case, but couldn't recall it exactly).

Also, if we need to cache it on the selectStatement, does the branch here do anything? in other words, what case it handles here?

Yes, it's needed for any dataframe after select, e.g., df = session.sql("SELECT 1 AS a, 2 AS b").select("b")

This is because after select, df._plan is resolved from df._select_statement (also, it's df._select_statement._snowflake_plan). When we call df.schema, attributes is cached at df._plan. Then if we call df.filter(...) when sql simplifier is enabled, df._select_statement._snowflake_plan is not copied to new SelectStatement,

snowpark-python/src/snowflake/snowpark/_internal/analyzer/select_statement.py

Lines 1172 to 1183 in 2d61a4b

if can_be_flattened:

new = copy(self)

new.from_ = self.from_.to_subqueryable()

new.pre_actions = new.from_.pre_actions

new.post_actions = new.from_.post_actions

new.column_states = self.column_states

new.where = And(self.where, col) if self.where is not None else col

new._merge_projection_complexity_with_subquery = False

else:

new = SelectStatement(

from_=self.to_subqueryable(), where=col, analyzer=self.analyzer

)

, and df._plan is also not used anymore in new df._select_statement because of

snowpark-python/src/snowflake/snowpark/dataframe.py

Lines 1336 to 1340 in 2d61a4b

return self._with_plan(

self._select_statement.filter(

_to_col_if_sql_expr(expr, "filter/where")._expression

)

)

. It means we will lose attributes cached on df._plan and we need to cache on SelectStatement

I went back to double check our conversion before i think now i recall what is the problem now. This could still handle the selectStatement when flatten is not triggered, the cache on select statement is used to handle the case when flatten is triggered

To be precise,

if can_be_flattened: new = copy(self) new.from_ = self.from_.to_subqueryable() new.pre_actions = new.from_.pre_actions new.post_actions = new.from_.post_actions new.column_states = self.column_states new.where = And(self.where, col) if self.where is not None else col new._merge_projection_complexity_with_subquery = False else: new = SelectStatement( from_=self.to_subqueryable(), where=col, analyzer=self.analyzer )

when flatten is triggered, we can still handle it, because from_ is carried over to next SelectStatement; when flatten is not triggered, we have to cache it on SelectStatement

src/snowflake/snowpark/_internal/analyzer/metadata_utils.py

src/snowflake/snowpark/_internal/analyzer/snowflake_plan.py

sfc-gh-yzou · 2024-10-14T00:04:49Z

src/snowflake/snowpark/_internal/analyzer/snowflake_plan.py

+        # If _attributes is None, retrieve quoted_identifiers from _quoted_identifiers.
+        # If _quoted_identifiers is None, retrieve quoted_identifiers from attributes
+        # (which triggers describe query).
+        if self._attributes is not None:


what we can do here is first check self._quoted_identifiers is not None, else just return [attr.name for attr in self.attributes]. self.attributes should take care of the attribute check

yes good idea, we can use it in next PR

src/snowflake/snowpark/mock/_plan.py

sfc-gh-yzou · 2024-10-14T00:17:53Z

src/snowflake/snowpark/_internal/analyzer/snowflake_plan.py

@@ -266,6 +267,14 @@ def __init__(
        # UUID for the plan to uniquely identify the SnowflakePlan object. We also use this
        # to UUID track queries that are generated from the same plan.
        self._uuid = str(uuid.uuid4())
+        self._attributes = None
+        self._quoted_identifiers = None


no one is actually using quoted identifiers yet, right?

sfc-gh-yzou · 2024-10-14T00:19:42Z

src/snowflake/snowpark/_internal/analyzer/metadata_utils.py

+        # to create a new SnowflakePlan and `infer_metadata` is already called on the new plan.
+        if source_plan._snowflake_plan is not None:
+            attributes = source_plan._snowflake_plan._attributes
+            quoted_identifiers = source_plan._snowflake_plan._quoted_identifiers


it seems the quoted identifiers currently is not used anywhere, and there is no actual inferring yet, by inferring i mean extracting the name from the projection. maybe we can do it along with when we actual use it in a different pr

sure, let's add it in another PR

sfc-gh-yzou · 2024-10-14T00:23:46Z

tests/integ/test_reduce_describe_query.py

+# Create from SQL
+create_from_sql_funcs = [
+    lambda session: session.sql("SELECT 1 AS a, 2 AS b"),
+    # lambda session: session.sql("SELECT 1 AS a, 2 AS b").select("b"),


I think @sfc-gh-jdu want to enable this when the inferring also works on selectStatement, but @sfc-gh-jdu i thought this works when sql simplifier is enabled, maybe we can only disable those case on sql simplifier is enabled to be more clear

github-actions · 2024-10-14T18:58:30Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-10-14T19:55:41Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

sfc-gh-yzou · 2024-10-14T23:21:06Z

src/snowflake/snowpark/_internal/analyzer/snowflake_plan.py

    def attributes(self) -> List[Attribute]:
+        if self._attributes is not None:


i see we have an output below, which seems just a copy of the attributes, i am really confused about why we have output and attributes? do you know?

I think it's from scala snowpark when we try to copy some code at that time.... but I don't know why we have it in scala snowpark.

sfc-gh-yzou · 2024-10-14T23:22:12Z

@sfc-gh-jdu overall looks good, i had couple of clarification questions and suggestion about release parameters

github-actions · 2024-10-14T23:44:53Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

github-actions · 2024-10-14T23:49:02Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

sfc-gh-yzou · 2024-10-15T00:31:49Z

src/snowflake/snowpark/_internal/analyzer/metadata_utils.py

+        if (
+            source_plan._snowflake_plan is not None
+            and source_plan._snowflake_plan._attributes is not None
+        ):


typically for select statement, if the _snowflake_plan is not None, that would be resolved Snowflake Plan directly, this check and update seems not necessary here

hmm you're right, actually it can be removed.

sfc-gh-yzou · 2024-10-15T00:33:18Z

tests/integ/test_reduce_describe_query.py

+    lambda df: df.limit(2),
+    lambda df: df.filter(col("a") > 2).sort(col("a").desc()).limit(2),
+    lambda df: df.sample(0.5),
+    lambda df: df.sample(0.5).filter(col("a") > 2),


i think you can still have some selectStatement with case where flatten is not triggered

yes, I have some like lambda df: df.sort(col("a").desc()).limit(2).filter(col("a") > 2) that flatten is not triggred will be added in next PR

github-actions · 2024-10-15T17:23:15Z

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

sfc-gh-jdu requested a review from a team as a code owner August 19, 2024 23:08

sfc-gh-jdu requested review from sfc-gh-yuwang, sfc-gh-aalam and sfc-gh-jrose August 19, 2024 23:08

sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from 108c337 to 9da3614 Compare August 19, 2024 23:14

github-actions bot added the local testing Local Testing issues/PRs label Aug 20, 2024

sfc-gh-jdu requested a review from sfc-gh-yzou August 20, 2024 16:53

sfc-gh-aalam approved these changes Aug 20, 2024

View reviewed changes

sfc-gh-evandenberg reviewed Aug 22, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

tests/integ/test_reduce_describe_query.py Outdated Show resolved Hide resolved

sfc-gh-helmeleegy reviewed Aug 22, 2024

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

sfc-gh-jdu marked this pull request as draft August 26, 2024 20:55

sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from 6f99186 to 465c937 Compare August 26, 2024 20:57

sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from 465c937 to f916556 Compare October 8, 2024 20:25

sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from f916556 to f2e642c Compare October 8, 2024 20:27

sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from ab55255 to 4b069e4 Compare October 9, 2024 20:28

sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from 4b069e4 to 4127d27 Compare October 9, 2024 20:59

sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from 4127d27 to ed5eb81 Compare October 9, 2024 21:01

sfc-gh-jdu requested review from sfc-gh-helmeleegy and sfc-gh-aalam October 9, 2024 21:02

sfc-gh-jdu marked this pull request as ready for review October 9, 2024 21:02

sfc-gh-helmeleegy reviewed Oct 10, 2024

View reviewed changes

sfc-gh-yzou reviewed Oct 14, 2024

View reviewed changes

sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from ed5eb81 to 58a46d5 Compare October 14, 2024 18:58

sfc-gh-yzou reviewed Oct 14, 2024

View reviewed changes

sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from cdec0f1 to 9854781 Compare October 14, 2024 23:44

add

f0efd3b

sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from 9854781 to f0efd3b Compare October 14, 2024 23:48

sfc-gh-jdu added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Oct 14, 2024

sfc-gh-yzou reviewed Oct 15, 2024

View reviewed changes

sfc-gh-yzou approved these changes Oct 15, 2024

View reviewed changes

rm

aaddf6a

sfc-gh-jdu merged commit 536cba9 into main Oct 15, 2024
33 of 34 checks passed

sfc-gh-jdu deleted the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch October 15, 2024 18:53

github-actions bot locked and limited conversation to collaborators Oct 15, 2024


		#### Improvements

		- Reduced the number of additional [describe queries](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-example#retrieving-column-metadata) sent to the server to fetch the metadata of a DataFrame. It is still an experimental feature not enabled by default, and can be enabled by setting `session.reduce_describe_query_enabled` to `True`.

	if can_be_flattened:
	new = copy(self)
	new.from_ = self.from_.to_subqueryable()
	new.pre_actions = new.from_.pre_actions
	new.post_actions = new.from_.post_actions
	new.column_states = self.column_states
	new.where = And(self.where, col) if self.where is not None else col
	new._merge_projection_complexity_with_subquery = False
	else:
	new = SelectStatement(
	from_=self.to_subqueryable(), where=col, analyzer=self.analyzer
	)

	return self._with_plan(
	self._select_statement.filter(
	_to_col_if_sql_expr(expr, "filter/where")._expression
	)
	)

		def attributes(self) -> List[Attribute]:
		if self._attributes is not None:

SNOW-1570734: Reduce describe query when there is no schema change #2126

SNOW-1570734: Reduce describe query when there is no schema change #2126

Conversation

sfc-gh-jdu commented Aug 19, 2024 • edited Loading

github-actions bot commented Aug 20, 2024

github-actions bot commented Aug 26, 2024

github-actions bot commented Oct 8, 2024

github-actions bot commented Oct 8, 2024

github-actions bot commented Oct 8, 2024

github-actions bot commented Oct 9, 2024

github-actions bot commented Oct 9, 2024

github-actions bot commented Oct 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-yzou Oct 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 14, 2024

github-actions bot commented Oct 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-yzou commented Oct 14, 2024

github-actions bot commented Oct 14, 2024

github-actions bot commented Oct 14, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Oct 15, 2024

sfc-gh-jdu commented Aug 19, 2024 •

edited

Loading

sfc-gh-yzou Oct 14, 2024 •

edited

Loading