Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SNOW-1728988: Cache attributes on SelectStatement to reduce describe query #2462

Merged
merged 2 commits into from
Oct 17, 2024

Conversation

sfc-gh-jdu
Copy link
Collaborator

@sfc-gh-jdu sfc-gh-jdu commented Oct 15, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

    Fixes SNOW-1728988

  2. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
  3. Please describe how your code solves the related issue.

    When sql simplifier is on, it can resolve the scenarios where 1) select() is called because df._plan is not carried over, 2) sql simplifier can't flatten the query

@sfc-gh-jdu sfc-gh-jdu requested a review from a team as a code owner October 15, 2024 23:08
@sfc-gh-jdu sfc-gh-jdu added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Oct 15, 2024
@@ -1181,6 +1183,7 @@ def filter(self, col: Expression) -> "SelectStatement":
new = SelectStatement(
from_=self.to_subqueryable(), where=col, analyzer=self.analyzer
)
new._attributes = self._attributes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this also apply to set operations like union and intersection?

Copy link
Collaborator Author

@sfc-gh-jdu sfc-gh-jdu Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah, set operations possibly change the data type, e.g., (select '1' as a) union (select 2 as a)

Copy link
Contributor

@sfc-gh-helmeleegy sfc-gh-helmeleegy Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we first check if the input data types are the same, then they can be propagated to the output? I believe this would be the more common case, and we should be able to recognize it.

Copy link
Collaborator Author

@sfc-gh-jdu sfc-gh-jdu Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it's data coercion in Snowflake, and there are many complicated rules on the server side. Even if they are the same type, there may still be data coercion happening (such as two varchar values with different length). So I don't think it's a good idea to have such rules on the client side, and we should only rely on the server side for data coercion. This type of operation is not in the scope of reducing describe queries.

Copy link
Contributor

@sfc-gh-helmeleegy sfc-gh-helmeleegy Oct 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think this will reduce describe queries. We're just worried that the rules we use on the client side may end up being different than the server side, right? Would we still be worried if the data types for both operands are exactly the same (including lengths for varchar, etc)? Or is it that we don't have this information on the client side so we cannot be 100% sure? I'm just trying to understand the concern.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original concern is just about data coercion, but as you said, if two types are exactly the same, we can try to infer too and we can investigate it in the future, if this is a common pattern that we should optimize.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I agree it doesn't have to be in this PR. For Snowpark pandas, we do use union operations pretty commonly. And I think that in most cases, the data types are identical.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it, yeah if it's pretty common, we can definitely do it.

Copy link
Contributor

@sfc-gh-helmeleegy sfc-gh-helmeleegy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good progress with reducing describe queries. Thanks, Jianzhun.

Copy link
Contributor

@sfc-gh-aalam sfc-gh-aalam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would you have to update some tests as a consequence because this change will reduce num describe queries?

@@ -1181,6 +1183,7 @@ def filter(self, col: Expression) -> "SelectStatement":
new = SelectStatement(
from_=self.to_subqueryable(), where=col, analyzer=self.analyzer
)
new._attributes = self._attributes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flag control

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we don't need this one because if the parameter is off, _attributes is always None. But we can also add

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can have parameter turned off during middle. Let's protect the code for extra safety, and when we do copy, don't we also need to copy the attribute over?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, let me add the parameter protection. I'd prefer not to copy the attributes for now, because select() use copy to create a new select_statement, where attributes may not be the same

. Rather copying attributes then resetting attributes outside of copy, not copying it for now is safer

if self._attributes is not None:
return self._attributes
assert (
self.schema_query is not None
), "No schema query is available for the SnowflakePlan"
self._attributes = analyze_attributes(self.schema_query, self.session)
# We need to cache attributes on SelectStatement too because df._plan is not
# carried over to next SelectStatement (e.g., check the implementation of df.filter()).
if isinstance(self.source_plan, SelectStatement):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flag_control here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added

]

# Create from Values
create_from_values_funcs = []
create_from_values_funcs = [
lambda session: session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"]),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we setup this test suite with the control on and off to make sure things all works as expected when the flag is on or off

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@sfc-gh-jdu
Copy link
Collaborator Author

sfc-gh-jdu commented Oct 17, 2024

would you have to update some tests as a consequence because this change will reduce num describe queries?

yes, it does reduce some describe queries (though our tests haven't really add many describe queries check), but because the parameter is not enabled now, it won't affect other tests.

@sfc-gh-jdu sfc-gh-jdu merged commit 0b47a67 into main Oct 17, 2024
34 checks passed
@sfc-gh-jdu sfc-gh-jdu deleted the jdu-SNOW-1728988-cache-selectstatement branch October 17, 2024 19:40
@github-actions github-actions bot locked and limited conversation to collaborators Oct 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants