New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[SNOW-1731783] Refactor node query comparison for Repeated subquery elimination #2437

Merged

sfc-gh-yzou merged 9 commits into main from yzou-SNOW-1731783-cte-id-refactor

Oct 16, 2024

Collaborator

sfc-gh-yzou commented Oct 11, 2024 •

edited

Loading

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
Please describe how your code solves the related issue.

In the previous repeated subquery elimination, we updated the node comparison for SnowflakePlan for Selectable to

    def __eq__(self, other: "SnowflakePlan") -> bool:
        if not isinstance(other, SnowflakePlan):
            return False
        if self._id is not None and other._id is not None:
            return isinstance(other, SnowflakePlan) and self._id == other._id
        else:
            return super().__eq__(other)

    def __hash__(self) -> int:
        return hash(self._id) if self._id else super().__hash__()

where the id is generated based on the query and query parameter, this means two node are treated as the same if they have same type and same query. This make sense when we do repeated subquery elimination, but not expected by other transformations.

Refactor the comparison to make sure that we only use the id comparison for repeated subquery eliminations, not for others.

sfc-gh-yzou changed the title ~~[SNOW-1731783] Refactor id comparison for CTE~~ [SNOW-1731783] Refactor node query comparison for Repeated subquery elimination

sfc-gh-yzou marked this pull request as ready for review

October 11, 2024 22:49

sfc-gh-yzou requested a review from a team as a code owner

October 11, 2024 22:49

sfc-gh-yzou requested review from sfc-gh-yixie, sfc-gh-yuwang, sfc-gh-jrose, sfc-gh-jdu and sfc-gh-aalam and removed request for sfc-gh-yixie, sfc-gh-yuwang and sfc-gh-jrose

October 11, 2024 22:49

sfc-gh-yzou added the NO-CHANGELOG-UPDATES label

sfc-gh-yzou force-pushed the yzou-SNOW-1731783-cte-id-refactor branch from fb74319 to a6b696f Compare

October 14, 2024 16:52

Contributor

sfc-gh-aalam commented Oct 14, 2024

This make sense when we do repeated subquery elimination, but not expected by other transformations.

Can you give an example of where this does not make sense?

Collaborator Author

sfc-gh-yzou commented Oct 14, 2024 •

edited

Loading

@sfc-gh-aalam
"""
Can you give an example of where this does not make sense?
"""
i am not sure for which transformation this could make sense, unless the transformation want to treat two nodes with the same query as the same node, for example, for your large query breakdown, i don't think you want to treat two nodes with the same query as the same node during transformation. Or even cte transformation, during the actual node transformation, we don't want to treat two nodes as the same node during the plan transformation, because the are different nodes. This could cause us to missing apply transformation on some nodes and cause potential problem, because we could incorrectly thought some nodes have been handled. This only make sense when we are doing query comparison
In general, we should never change node comparison unless it is by design, however, for node in plan tree, that should not be the case.

This has been causing problem to our ctc workloads, where we missed some handling of some nodes, I have tried it with our benchmark workloads, the problem has been gone with this refactoring.

sfc-gh-aalam approved these changes

View reviewed changes

Contributor

sfc-gh-aalam left a comment

LGTM. a few comments.

src/snowflake/snowpark/_internal/analyzer/cte_utils.py Outdated

Comment on lines 185 to 187

+                  query_id = encoded_query_id(query, query_params)
+                  if query_id is not None:
+                      return query_id + node_type_name

Contributor

sfc-gh-aalam Oct 11, 2024

we can do f"{query_id}_{node_type_name}"

Collaborator

sfc-gh-jdu Oct 14, 2024

+1

Collaborator Author

sfc-gh-yzou Oct 15, 2024

updated

src/snowflake/snowpark/_internal/analyzer/cte_utils.py Outdated

Comment on lines 175 to 177

+              def encode_id(
+                  node_type_name: str, query: str, query_params: Optional[Sequence[Any]] = None
+              ) -> str:

Contributor

sfc-gh-aalam Oct 14, 2024

can we call this encode_node_id?

Collaborator Author

sfc-gh-yzou Oct 15, 2024 •

edited

Loading

i actually call this encode_node_id_with_query to be more clear of the encoded id to reducing the chance for people to use it incorrectly

src/snowflake/snowpark/_internal/analyzer/cte_utils.py Outdated

Comment on lines 188 to 189

		else:
		return str(uuid.uuid4())

Contributor

sfc-gh-aalam Oct 14, 2024

do we want the same node to have the same encode_id. If so, we can rely on id(node) which gives you deterministic result, instead of uuid4()

Contributor

sfc-gh-helmeleegy Oct 14, 2024

+1

Collaborator Author

sfc-gh-yzou Oct 15, 2024

this is used for cached property, so with uuid4 generation it will also be deterministitic, but we can use id also.

sfc-gh-jdu reviewed

View reviewed changes

src/snowflake/snowpark/_internal/analyzer/cte_utils.py Outdated

Comment on lines 185 to 187

+                  query_id = encoded_query_id(query, query_params)
+                  if query_id is not None:
+                      return query_id + node_type_name

Collaborator

sfc-gh-jdu Oct 14, 2024

+1

src/snowflake/snowpark/_internal/analyzer/cte_utils.py

+                  Encode given query, query parameters and the node type into an id.
+                  If query and query parameters can be encoded successfully using sha256,
+                  return the encoded query id + node_type_name.

Collaborator

sfc-gh-jdu Oct 14, 2024

why do we need node_type_name?

Collaborator Author

sfc-gh-yzou Oct 15, 2024

that is following the previous equivalence check, we require the node type to be the same, for example, a snowflake plan node and selectstatement node with the same query is not counted as the same

src/snowflake/snowpark/_internal/analyzer/select_statement.py Outdated

+                      return encode_id(type(self).__name__, self.original_sql, self.query_params)
+                  @cached_property
+                  def encoded_query_id(self) -> Optional[str]:

Collaborator

sfc-gh-jdu Oct 14, 2024

these two properties seem having the same docstring?

Collaborator Author

sfc-gh-yzou Oct 15, 2024

they are slightly different, but i further updated the comment to make it more clear

Contributor

sfc-gh-helmeleegy commented Oct 14, 2024 •

edited

Loading

This make sense when we do repeated subquery elimination, but not expected by other transformations.

Can you give an example of where this does not make sense?

Also, it sounds like this is a behavioral change. Are there production workloads that can be impacted? Which ones?

sfc-gh-yzou force-pushed the yzou-SNOW-1731783-cte-id-refactor branch from a6b696f to 87db802 Compare

October 15, 2024 02:19

Collaborator Author

sfc-gh-yzou commented Oct 15, 2024 •

edited

Loading

@sfc-gh-helmeleegy there should be no user facing behavior change, this equivalence check is introduced by the cte transformation, and only used internally for plan node comparison for cte transformation, and no others should rely on this fact. If yes, that would be counted as a bug. This is counted as a refactor work before others misuse this property, so no existing workload should be impacted

sfc-gh-yzou commented

View reviewed changes

tests/integ/test_deepcopy.py

                   def traverse_plan(plan, plan_id_map):
-                      plan_id = plan._id
-                      plan_type = type(plan)

Collaborator Author

sfc-gh-yzou Oct 15, 2024 •

edited

Loading

@sfc-gh-aalam i am not sure if we are doing the write test here, if two nodes have the same query and same type, there should still be two nodes after deepcopy, because they are two different nodes.

sfc-gh-yzou added 9 commits

October 15, 2024 09:29


          refactor

fe75544


          refactor

85d20a8


          fix error

4dc4136


          fix error

4dbd23f


          updat code

de4bba0


          add comment

cfea33d


          fix error


          address feedabck

dcf5100


          fix test failure

f42030d

sfc-gh-yzou force-pushed the yzou-SNOW-1731783-cte-id-refactor branch from 87db802 to f42030d Compare

October 15, 2024 16:31

sfc-gh-yzou requested review from sfc-gh-jdu, sfc-gh-aalam and sfc-gh-helmeleegy

October 15, 2024 18:04

sfc-gh-jdu approved these changes

View reviewed changes

sfc-gh-aalam approved these changes

View reviewed changes

sfc-gh-yzou merged commit 952599b into main

33 of 34 checks passed

sfc-gh-yzou deleted the yzou-SNOW-1731783-cte-id-refactor branch

October 16, 2024 19:08

github-actions bot locked and limited conversation to collaborators

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

NO-CHANGELOG-UPDATES