-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-1570734: Reduce describe query when there is no schema change #2126
SNOW-1570734: Reduce describe query when there is no schema change #2126
Conversation
108c337
to
9da3614
Compare
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
6f99186
to
465c937
Compare
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
465c937
to
f916556
Compare
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
f916556
to
f2e642c
Compare
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
1 similar comment
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
ab55255
to
4b069e4
Compare
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
4b069e4
to
4127d27
Compare
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
4127d27
to
ed5eb81
Compare
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
# Create from SQL | ||
create_from_sql_funcs = [ | ||
lambda session: session.sql("SELECT 1 AS a, 2 AS b"), | ||
# lambda session: session.sql("SELECT 1 AS a, 2 AS b").select("b"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove commented code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @sfc-gh-jdu want to enable this when the inferring also works on selectStatement, but @sfc-gh-jdu i thought this works when sql simplifier is enabled, maybe we can only disable those case on sql simplifier is enabled to be more clear
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no worries, I think I can remove them for now, and add them back in next PR (very soon)
|
||
# Create from Values | ||
create_from_values_funcs = [ | ||
# lambda session: session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same (although I see all entries commented in this case?)
lambda session: session.create_dataframe( | ||
[[1, 2], [3, 4]], schema=["a", "b"] | ||
).cache_result(), | ||
# lambda session: session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"]).cache_result().select("b"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
lambda session: session.create_dataframe( | ||
[[1, 2], [3, 4]], schema=["a", "b"] | ||
).rename({"b": "c"}), | ||
# lambda session: session.range(10).to_df("a") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
lambda df: df.sort(-col("a")), | ||
lambda df: df.limit(2), | ||
# TODO SNOW-1728988: enable this test case (no flatten) after caching attributes on SelectStatement | ||
# lambda df: df.sort(col("a").desc()).limit(2).filter(col("a") > 2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure I can remove them and add later in next PR
CHANGELOG.md
Outdated
|
||
#### Improvements | ||
|
||
- Reduced the number of additional [describe queries](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-example#retrieving-column-metadata) sent to the server to fetch the metadata of a DataFrame. It is still an experimental feature not enabled by default, and can be enabled by setting `session.reduce_describe_query_enabled` to `True`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sfc-gh-jdu do you want to separate the reducing attribute call with inferring and the reducing describe query for select statement? it might be good to introduce a different parameter name for this to separate the two changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think they are related (in terms of code change) and targeting at reducing describe queries, so one parameter is enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we want to bundle those together, then let's remove it from the release log for now, we should add this release log when change is ready, or we do it the same as other improvement features we have added that do not make it user visible.
Also we should have a ticket to track adding the server side parameters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, that's true, sorry I missed that. Let me remove it first
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CHANGELOG.md
Outdated
#### Improvements | ||
|
||
- Reduced the number of additional [describe queries](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-example#retrieving-column-metadata) sent to the server to fetch the metadata of a DataFrame. It is still an experimental feature not enabled by default, and can be enabled by setting `session.reduce_describe_query_enabled` to `True`. | ||
|
||
## 1.23.0 (TBD) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you need to rebase this, 1.23.0 has been released
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
elif isinstance(source_plan, SelectStatement): | ||
# When source_plan._snowflake_plan is not None, `get_snowflake_plan` is called | ||
# to create a new SnowflakePlan and `infer_metadata` is already called on the new plan. | ||
if source_plan._snowflake_plan is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am little bit confused here, i thought we need to cache the attribute at SelectStatement? or maybe i am confusing myself, would you mind remind me what it the case that we need to cache the attribute on selectStatement?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we will cache the attribute at SelectStatement here, but we will do it in next PR (or if you think it's ok to include it this PR?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, i don't think we want to do it in that pr, but could you remind me the example about why we need to cache it on the selectStatement (i recall the cache would lost under some case, but couldn't recall it exactly).
Also, if we need to cache it on the selectStatement, does the branch here do anything? in other words, what case it handles here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's needed for any dataframe after select, e.g., df = session.sql("SELECT 1 AS a, 2 AS b").select("b")
This is because after select, df._plan is resolved from df._select_statement (also, it's df._select_statement._snowflake_plan). When we call df.schema, attributes is cached at df._plan. Then if we call df.filter(...) when sql simplifier is enabled, df._select_statement._snowflake_plan is not copied to new SelectStatement,
snowpark-python/src/snowflake/snowpark/_internal/analyzer/select_statement.py
Lines 1172 to 1183 in 2d61a4b
if can_be_flattened: | |
new = copy(self) | |
new.from_ = self.from_.to_subqueryable() | |
new.pre_actions = new.from_.pre_actions | |
new.post_actions = new.from_.post_actions | |
new.column_states = self.column_states | |
new.where = And(self.where, col) if self.where is not None else col | |
new._merge_projection_complexity_with_subquery = False | |
else: | |
new = SelectStatement( | |
from_=self.to_subqueryable(), where=col, analyzer=self.analyzer | |
) |
snowpark-python/src/snowflake/snowpark/dataframe.py
Lines 1336 to 1340 in 2d61a4b
return self._with_plan( | |
self._select_statement.filter( | |
_to_col_if_sql_expr(expr, "filter/where")._expression | |
) | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be precise,
if can_be_flattened:
new = copy(self)
new.from_ = self.from_.to_subqueryable()
new.pre_actions = new.from_.pre_actions
new.post_actions = new.from_.post_actions
new.column_states = self.column_states
new.where = And(self.where, col) if self.where is not None else col
new._merge_projection_complexity_with_subquery = False
else:
new = SelectStatement(
from_=self.to_subqueryable(), where=col, analyzer=self.analyzer
)
when flatten is triggered, we can still handle it, because from_
is carried over to next SelectStatement; when flatten is not triggered, we have to cache it on SelectStatement
# If _attributes is None, retrieve quoted_identifiers from _quoted_identifiers. | ||
# If _quoted_identifiers is None, retrieve quoted_identifiers from attributes | ||
# (which triggers describe query). | ||
if self._attributes is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what we can do here is first check self._quoted_identifiers is not None, else just return [attr.name for attr in self.attributes]. self.attributes should take care of the attribute check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes good idea, we can use it in next PR
@@ -266,6 +267,14 @@ def __init__( | |||
# UUID for the plan to uniquely identify the SnowflakePlan object. We also use this | |||
# to UUID track queries that are generated from the same plan. | |||
self._uuid = str(uuid.uuid4()) | |||
self._attributes = None | |||
self._quoted_identifiers = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no one is actually using quoted identifiers yet, right?
# to create a new SnowflakePlan and `infer_metadata` is already called on the new plan. | ||
if source_plan._snowflake_plan is not None: | ||
attributes = source_plan._snowflake_plan._attributes | ||
quoted_identifiers = source_plan._snowflake_plan._quoted_identifiers |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it seems the quoted identifiers currently is not used anywhere, and there is no actual inferring yet, by inferring i mean extracting the name from the projection. maybe we can do it along with when we actual use it in a different pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, let's add it in another PR
# Create from SQL | ||
create_from_sql_funcs = [ | ||
lambda session: session.sql("SELECT 1 AS a, 2 AS b"), | ||
# lambda session: session.sql("SELECT 1 AS a, 2 AS b").select("b"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think @sfc-gh-jdu want to enable this when the inferring also works on selectStatement, but @sfc-gh-jdu i thought this works when sql simplifier is enabled, maybe we can only disable those case on sql simplifier is enabled to be more clear
ed5eb81
to
58a46d5
Compare
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
1 similar comment
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
def attributes(self) -> List[Attribute]: | ||
if self._attributes is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i see we have an output below, which seems just a copy of the attributes, i am really confused about why we have output and attributes? do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's from scala snowpark when we try to copy some code at that time.... but I don't know why we have it in scala snowpark.
@sfc-gh-jdu overall looks good, i had couple of clarification questions and suggestion about release parameters |
cdec0f1
to
9854781
Compare
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
9854781
to
f0efd3b
Compare
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
if ( | ||
source_plan._snowflake_plan is not None | ||
and source_plan._snowflake_plan._attributes is not None | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typically for select statement, if the _snowflake_plan is not None, that would be resolved Snowflake Plan directly, this check and update seems not necessary here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm you're right, actually it can be removed.
lambda df: df.limit(2), | ||
lambda df: df.filter(col("a") > 2).sort(col("a").desc()).limit(2), | ||
lambda df: df.sample(0.5), | ||
lambda df: df.sample(0.5).filter(col("a") > 2), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you can still have some selectStatement with case where flatten is not triggered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I have some like lambda df: df.sort(col("a").desc()).limit(2).filter(col("a") > 2)
that flatten is not triggred will be added in next PR
Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing |
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-1570734
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
This PR now actually only captures one part: cache attributes (metadata) when 1) the DataFrame operation doesn't change attributes (e.g., filter, sort) 2) SQL simplifier is disabled, or SQL simplifier is enabled but creating a SelectStatement directly from a non-SelectStatement (e.g., sesison.sql(...).schema can cache attributes, see more in tests). I will have another PR that will try to cache attributes on SelectStatement, which will address 2)