SNOW-1728988: Cache attributes on SelectStatement to reduce describe query #2462

sfc-gh-jdu · 2024-10-15T23:08:07Z

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fixes SNOW-1728988
Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
- I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
Please describe how your code solves the related issue.

When sql simplifier is on, it can resolve the scenarios where 1) select() is called because df._plan is not carried over, 2) sql simplifier can't flatten the query

sfc-gh-helmeleegy · 2024-10-16T20:35:21Z

src/snowflake/snowpark/_internal/analyzer/select_statement.py

@@ -1181,6 +1183,7 @@ def filter(self, col: Expression) -> "SelectStatement":
            new = SelectStatement(
                from_=self.to_subqueryable(), where=col, analyzer=self.analyzer
            )
+        new._attributes = self._attributes


Wouldn't this also apply to set operations like union and intersection?

Nah, set operations possibly change the data type, e.g., (select '1' as a) union (select 2 as a)

Can we first check if the input data types are the same, then they can be propagated to the output? I believe this would be the more common case, and we should be able to recognize it.

Actually it's data coercion in Snowflake, and there are many complicated rules on the server side. Even if they are the same type, there may still be data coercion happening (such as two varchar values with different length). So I don't think it's a good idea to have such rules on the client side, and we should only rely on the server side for data coercion. This type of operation is not in the scope of reducing describe queries.

Hmm, I think this will reduce describe queries. We're just worried that the rules we use on the client side may end up being different than the server side, right? Would we still be worried if the data types for both operands are exactly the same (including lengths for varchar, etc)? Or is it that we don't have this information on the client side so we cannot be 100% sure? I'm just trying to understand the concern.

My original concern is just about data coercion, but as you said, if two types are exactly the same, we can try to infer too and we can investigate it in the future, if this is a common pattern that we should optimize.

Yes, I agree it doesn't have to be in this PR. For Snowpark pandas, we do use union operations pretty commonly. And I think that in most cases, the data types are identical.

got it, yeah if it's pretty common, we can definitely do it.

sfc-gh-helmeleegy

LGTM. Good progress with reducing describe queries. Thanks, Jianzhun.

sfc-gh-aalam

would you have to update some tests as a consequence because this change will reduce num describe queries?

sfc-gh-yzou · 2024-10-17T00:30:31Z

src/snowflake/snowpark/_internal/analyzer/select_statement.py

@@ -1181,6 +1183,7 @@ def filter(self, col: Expression) -> "SelectStatement":
            new = SelectStatement(
                from_=self.to_subqueryable(), where=col, analyzer=self.analyzer
            )
+        new._attributes = self._attributes


flag control

maybe we don't need this one because if the parameter is off, _attributes is always None. But we can also add

you can have parameter turned off during middle. Let's protect the code for extra safety, and when we do copy, don't we also need to copy the attribute over?

sure, let me add the parameter protection. I'd prefer not to copy the attributes for now, because select() use copy to create a new select_statement, where attributes may not be the same

snowpark-python/src/snowflake/snowpark/_internal/analyzer/select_statement.py

Line 1114 in 016b063

new = copy(self)

. Rather copying attributes then resetting attributes outside of copy, not copying it for now is safer

sfc-gh-yzou · 2024-10-17T00:30:49Z

src/snowflake/snowpark/_internal/analyzer/snowflake_plan.py

        if self._attributes is not None:
            return self._attributes
        assert (
            self.schema_query is not None
        ), "No schema query is available for the SnowflakePlan"
        self._attributes = analyze_attributes(self.schema_query, self.session)
+        # We need to cache attributes on SelectStatement too because df._plan is not
+        # carried over to next SelectStatement (e.g., check the implementation of df.filter()).
+        if isinstance(self.source_plan, SelectStatement):


flag_control here

sfc-gh-yzou · 2024-10-17T00:31:36Z

tests/integ/test_reduce_describe_query.py

 ]

 # Create from Values
-create_from_values_funcs = []
+create_from_values_funcs = [
+    lambda session: session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"]),


can we setup this test suite with the control on and off to make sure things all works as expected when the flag is on or off

sfc-gh-jdu · 2024-10-17T18:41:19Z

would you have to update some tests as a consequence because this change will reduce num describe queries?

yes, it does reduce some describe queries (though our tests haven't really add many describe queries check), but because the parameter is not enabled now, it won't affect other tests.

add

79ddd1c

sfc-gh-jdu requested a review from a team as a code owner October 15, 2024 23:08

sfc-gh-jdu requested review from sfc-gh-yixie, sfc-gh-aling, sfc-gh-yuwang, sfc-gh-yzou, sfc-gh-helmeleegy and sfc-gh-aalam October 15, 2024 23:08

sfc-gh-jdu added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Oct 15, 2024

sfc-gh-helmeleegy reviewed Oct 16, 2024

View reviewed changes

sfc-gh-helmeleegy approved these changes Oct 17, 2024

View reviewed changes

sfc-gh-aalam approved these changes Oct 17, 2024

View reviewed changes

sfc-gh-yzou reviewed Oct 17, 2024

View reviewed changes

address comment

24478b5

sfc-gh-yzou approved these changes Oct 17, 2024

View reviewed changes

sfc-gh-jdu merged commit 0b47a67 into main Oct 17, 2024
34 checks passed

sfc-gh-jdu deleted the jdu-SNOW-1728988-cache-selectstatement branch October 17, 2024 19:40

github-actions bot locked and limited conversation to collaborators Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SNOW-1728988: Cache attributes on SelectStatement to reduce describe query #2462

SNOW-1728988: Cache attributes on SelectStatement to reduce describe query #2462

sfc-gh-jdu commented Oct 15, 2024 •

edited

Loading

sfc-gh-helmeleegy Oct 16, 2024

sfc-gh-jdu Oct 16, 2024 •

edited

Loading

sfc-gh-helmeleegy Oct 16, 2024 •

edited

Loading

sfc-gh-jdu Oct 16, 2024 •

edited

Loading

sfc-gh-helmeleegy Oct 16, 2024 •

edited

Loading

sfc-gh-jdu Oct 16, 2024

sfc-gh-helmeleegy Oct 16, 2024

sfc-gh-jdu Oct 16, 2024

sfc-gh-helmeleegy left a comment

sfc-gh-aalam left a comment

sfc-gh-yzou Oct 17, 2024

sfc-gh-jdu Oct 17, 2024

sfc-gh-yzou Oct 17, 2024

sfc-gh-jdu Oct 17, 2024

sfc-gh-yzou Oct 17, 2024

sfc-gh-jdu Oct 17, 2024

sfc-gh-yzou Oct 17, 2024

sfc-gh-jdu Oct 17, 2024

sfc-gh-jdu commented Oct 17, 2024 •

edited

Loading

SNOW-1728988: Cache attributes on SelectStatement to reduce describe query #2462

SNOW-1728988: Cache attributes on SelectStatement to reduce describe query #2462

Conversation

sfc-gh-jdu commented Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

sfc-gh-jdu Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

sfc-gh-helmeleegy Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

sfc-gh-jdu Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

sfc-gh-helmeleegy Oct 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-helmeleegy left a comment

Choose a reason for hiding this comment

sfc-gh-aalam left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-jdu commented Oct 17, 2024 • edited Loading

sfc-gh-jdu commented Oct 15, 2024 •

edited

Loading

sfc-gh-jdu Oct 16, 2024 •

edited

Loading

sfc-gh-helmeleegy Oct 16, 2024 •

edited

Loading

sfc-gh-jdu Oct 16, 2024 •

edited

Loading

sfc-gh-helmeleegy Oct 16, 2024 •

edited

Loading

sfc-gh-jdu commented Oct 17, 2024 •

edited

Loading