Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1570734: Reduce describe query when there is no schema change #2126

Merged
merged 2 commits into from
Oct 15, 2024

Conversation

sfc-gh-jdu
Copy link
Collaborator

@sfc-gh-jdu sfc-gh-jdu commented Aug 19, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1570734

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
  3. Please describe how your code solves the related issue.

    This PR now actually only captures one part: cache attributes (metadata) when 1) the DataFrame operation doesn't change attributes (e.g., filter, sort) 2) SQL simplifier is disabled, or SQL simplifier is enabled but creating a SelectStatement directly from a non-SelectStatement (e.g., sesison.sql(...).schema can cache attributes, see more in tests). I will have another PR that will try to cache attributes on SelectStatement, which will address 2)

@sfc-gh-jdu sfc-gh-jdu requested a review from a team as a code owner August 19, 2024 23:08
@sfc-gh-jdu sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from 108c337 to 9da3614 Compare August 19, 2024 23:14
@github-actions github-actions bot added the local testing Local Testing issues/PRs label Aug 20, 2024
Copy link

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

CHANGELOG.md Outdated Show resolved Hide resolved
tests/integ/test_reduce_describe_query.py Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@sfc-gh-jdu sfc-gh-jdu marked this pull request as draft August 26, 2024 20:55
@sfc-gh-jdu sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from 6f99186 to 465c937 Compare August 26, 2024 20:57
Copy link

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@sfc-gh-jdu sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from 465c937 to f916556 Compare October 8, 2024 20:25
Copy link

github-actions bot commented Oct 8, 2024

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@sfc-gh-jdu sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from f916556 to f2e642c Compare October 8, 2024 20:27
Copy link

github-actions bot commented Oct 8, 2024

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

1 similar comment
Copy link

github-actions bot commented Oct 8, 2024

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@sfc-gh-jdu sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from ab55255 to 4b069e4 Compare October 9, 2024 20:28
Copy link

github-actions bot commented Oct 9, 2024

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@sfc-gh-jdu sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from 4b069e4 to 4127d27 Compare October 9, 2024 20:59
Copy link

github-actions bot commented Oct 9, 2024

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@sfc-gh-jdu sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from 4127d27 to ed5eb81 Compare October 9, 2024 21:01
Copy link

github-actions bot commented Oct 9, 2024

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@sfc-gh-jdu sfc-gh-jdu marked this pull request as ready for review October 9, 2024 21:02
# Create from SQL
create_from_sql_funcs = [
lambda session: session.sql("SELECT 1 AS a, 2 AS b"),
# lambda session: session.sql("SELECT 1 AS a, 2 AS b").select("b"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove commented code?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @sfc-gh-jdu want to enable this when the inferring also works on selectStatement, but @sfc-gh-jdu i thought this works when sql simplifier is enabled, maybe we can only disable those case on sql simplifier is enabled to be more clear

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no worries, I think I can remove them for now, and add them back in next PR (very soon)


# Create from Values
create_from_values_funcs = [
# lambda session: session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"]),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same (although I see all entries commented in this case?)

lambda session: session.create_dataframe(
[[1, 2], [3, 4]], schema=["a", "b"]
).cache_result(),
# lambda session: session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"]).cache_result().select("b"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

lambda session: session.create_dataframe(
[[1, 2], [3, 4]], schema=["a", "b"]
).rename({"b": "c"}),
# lambda session: session.range(10).to_df("a")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

lambda df: df.sort(-col("a")),
lambda df: df.limit(2),
# TODO SNOW-1728988: enable this test case (no flatten) after caching attributes on SelectStatement
# lambda df: df.sort(col("a").desc()).limit(2).filter(col("a") > 2),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure I can remove them and add later in next PR

CHANGELOG.md Outdated

#### Improvements

- Reduced the number of additional [describe queries](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-example#retrieving-column-metadata) sent to the server to fetch the metadata of a DataFrame. It is still an experimental feature not enabled by default, and can be enabled by setting `session.reduce_describe_query_enabled` to `True`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-jdu do you want to separate the reducing attribute call with inferring and the reducing describe query for select statement? it might be good to introduce a different parameter name for this to separate the two changes

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think they are related (in terms of code change) and targeting at reducing describe queries, so one parameter is enough?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to bundle those together, then let's remove it from the release log for now, we should add this release log when change is ready, or we do it the same as other improvement features we have added that do not make it user visible.
Also we should have a ticket to track adding the server side parameters

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeah, that's true, sorry I missed that. Let me remove it first

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CHANGELOG.md Outdated
#### Improvements

- Reduced the number of additional [describe queries](https://docs.snowflake.com/en/developer-guide/python-connector/python-connector-example#retrieving-column-metadata) sent to the server to fetch the metadata of a DataFrame. It is still an experimental feature not enabled by default, and can be enabled by setting `session.reduce_describe_query_enabled` to `True`.

## 1.23.0 (TBD)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you need to rebase this, 1.23.0 has been released

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

elif isinstance(source_plan, SelectStatement):
# When source_plan._snowflake_plan is not None, `get_snowflake_plan` is called
# to create a new SnowflakePlan and `infer_metadata` is already called on the new plan.
if source_plan._snowflake_plan is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am little bit confused here, i thought we need to cache the attribute at SelectStatement? or maybe i am confusing myself, would you mind remind me what it the case that we need to cache the attribute on selectStatement?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we will cache the attribute at SelectStatement here, but we will do it in next PR (or if you think it's ok to include it this PR?)

Copy link
Collaborator

@sfc-gh-yzou sfc-gh-yzou Oct 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, i don't think we want to do it in that pr, but could you remind me the example about why we need to cache it on the selectStatement (i recall the cache would lost under some case, but couldn't recall it exactly).

Also, if we need to cache it on the selectStatement, does the branch here do anything? in other words, what case it handles here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's needed for any dataframe after select, e.g., df = session.sql("SELECT 1 AS a, 2 AS b").select("b")

This is because after select, df._plan is resolved from df._select_statement (also, it's df._select_statement._snowflake_plan). When we call df.schema, attributes is cached at df._plan. Then if we call df.filter(...) when sql simplifier is enabled, df._select_statement._snowflake_plan is not copied to new SelectStatement,

if can_be_flattened:
new = copy(self)
new.from_ = self.from_.to_subqueryable()
new.pre_actions = new.from_.pre_actions
new.post_actions = new.from_.post_actions
new.column_states = self.column_states
new.where = And(self.where, col) if self.where is not None else col
new._merge_projection_complexity_with_subquery = False
else:
new = SelectStatement(
from_=self.to_subqueryable(), where=col, analyzer=self.analyzer
)
, and df._plan is also not used anymore in new df._select_statement because of
return self._with_plan(
self._select_statement.filter(
_to_col_if_sql_expr(expr, "filter/where")._expression
)
)
. It means we will lose attributes cached on df._plan and we need to cache on SelectStatement

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back to double check our conversion before i think now i recall what is the problem now. This could still handle the selectStatement when flatten is not triggered, the cache on select statement is used to handle the case when flatten is triggered

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be precise,

        if can_be_flattened:
            new = copy(self)
            new.from_ = self.from_.to_subqueryable()
            new.pre_actions = new.from_.pre_actions
            new.post_actions = new.from_.post_actions
            new.column_states = self.column_states
            new.where = And(self.where, col) if self.where is not None else col
            new._merge_projection_complexity_with_subquery = False
        else:
            new = SelectStatement(
                from_=self.to_subqueryable(), where=col, analyzer=self.analyzer
            )

when flatten is triggered, we can still handle it, because from_ is carried over to next SelectStatement; when flatten is not triggered, we have to cache it on SelectStatement

# If _attributes is None, retrieve quoted_identifiers from _quoted_identifiers.
# If _quoted_identifiers is None, retrieve quoted_identifiers from attributes
# (which triggers describe query).
if self._attributes is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what we can do here is first check self._quoted_identifiers is not None, else just return [attr.name for attr in self.attributes]. self.attributes should take care of the attribute check

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes good idea, we can use it in next PR

src/snowflake/snowpark/mock/_plan.py Show resolved Hide resolved
@@ -266,6 +267,14 @@ def __init__(
# UUID for the plan to uniquely identify the SnowflakePlan object. We also use this
# to UUID track queries that are generated from the same plan.
self._uuid = str(uuid.uuid4())
self._attributes = None
self._quoted_identifiers = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no one is actually using quoted identifiers yet, right?

# to create a new SnowflakePlan and `infer_metadata` is already called on the new plan.
if source_plan._snowflake_plan is not None:
attributes = source_plan._snowflake_plan._attributes
quoted_identifiers = source_plan._snowflake_plan._quoted_identifiers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems the quoted identifiers currently is not used anywhere, and there is no actual inferring yet, by inferring i mean extracting the name from the projection. maybe we can do it along with when we actual use it in a different pr

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, let's add it in another PR

# Create from SQL
create_from_sql_funcs = [
lambda session: session.sql("SELECT 1 AS a, 2 AS b"),
# lambda session: session.sql("SELECT 1 AS a, 2 AS b").select("b"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @sfc-gh-jdu want to enable this when the inferring also works on selectStatement, but @sfc-gh-jdu i thought this works when sql simplifier is enabled, maybe we can only disable those case on sql simplifier is enabled to be more clear

@sfc-gh-jdu sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from ed5eb81 to 58a46d5 Compare October 14, 2024 18:58
Copy link

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

1 similar comment
Copy link

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

def attributes(self) -> List[Attribute]:
if self._attributes is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see we have an output below, which seems just a copy of the attributes, i am really confused about why we have output and attributes? do you know?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's from scala snowpark when we try to copy some code at that time.... but I don't know why we have it in scala snowpark.

@sfc-gh-yzou
Copy link
Collaborator

@sfc-gh-jdu overall looks good, i had couple of clarification questions and suggestion about release parameters

@sfc-gh-jdu sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from cdec0f1 to 9854781 Compare October 14, 2024 23:44
Copy link

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@sfc-gh-jdu sfc-gh-jdu force-pushed the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch from 9854781 to f0efd3b Compare October 14, 2024 23:48
Copy link

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@sfc-gh-jdu sfc-gh-jdu added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Oct 14, 2024
if (
source_plan._snowflake_plan is not None
and source_plan._snowflake_plan._attributes is not None
):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typically for select statement, if the _snowflake_plan is not None, that would be resolved Snowflake Plan directly, this check and update seems not necessary here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm you're right, actually it can be removed.

lambda df: df.limit(2),
lambda df: df.filter(col("a") > 2).sort(col("a").desc()).limit(2),
lambda df: df.sample(0.5),
lambda df: df.sample(0.5).filter(col("a") > 2),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think you can still have some selectStatement with case where flatten is not triggered

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I have some like lambda df: df.sort(col("a").desc()).limit(2).filter(col("a") > 2) that flatten is not triggred will be added in next PR

Copy link

Seems like your changes contain some Local Testing changes, please request review from @snowflakedb/local-testing

@sfc-gh-jdu sfc-gh-jdu merged commit 536cba9 into main Oct 15, 2024
33 of 34 checks passed
@sfc-gh-jdu sfc-gh-jdu deleted the jdu-SNOW-1570734-reduce-desc-no-schmea-change branch October 15, 2024 18:53
@github-actions github-actions bot locked and limited conversation to collaborators Oct 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
local testing Local Testing issues/PRs NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants