Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNOW-1731783] Refactor node query comparison for Repeated subquery elimination #2437

Merged
merged 9 commits into from
Oct 16, 2024

Conversation

sfc-gh-yzou
Copy link
Collaborator

@sfc-gh-yzou sfc-gh-yzou commented Oct 11, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

SNOW-1731783

  1. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
  2. Please describe how your code solves the related issue.

In the previous repeated subquery elimination, we updated the node comparison for SnowflakePlan for Selectable to

    def __eq__(self, other: "SnowflakePlan") -> bool:
        if not isinstance(other, SnowflakePlan):
            return False
        if self._id is not None and other._id is not None:
            return isinstance(other, SnowflakePlan) and self._id == other._id
        else:
            return super().__eq__(other)

    def __hash__(self) -> int:
        return hash(self._id) if self._id else super().__hash__()

where the id is generated based on the query and query parameter, this means two node are treated as the same if they have same type and same query. This make sense when we do repeated subquery elimination, but not expected by other transformations.

Refactor the comparison to make sure that we only use the id comparison for repeated subquery eliminations, not for others.

@sfc-gh-yzou sfc-gh-yzou changed the title [SNOW-1731783] Refactor id comparison for CTE [SNOW-1731783] Refactor node query comparison for Repeated subquery elimination Oct 11, 2024
@sfc-gh-yzou sfc-gh-yzou marked this pull request as ready for review October 11, 2024 22:49
@sfc-gh-yzou sfc-gh-yzou requested a review from a team as a code owner October 11, 2024 22:49
@sfc-gh-yzou sfc-gh-yzou added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Oct 12, 2024
@sfc-gh-yzou sfc-gh-yzou force-pushed the yzou-SNOW-1731783-cte-id-refactor branch from fb74319 to a6b696f Compare October 14, 2024 16:52
@sfc-gh-aalam
Copy link
Contributor

This make sense when we do repeated subquery elimination, but not expected by other transformations.

Can you give an example of where this does not make sense?

@sfc-gh-yzou
Copy link
Collaborator Author

sfc-gh-yzou commented Oct 14, 2024

@sfc-gh-aalam
"""
Can you give an example of where this does not make sense?
"""
i am not sure for which transformation this could make sense, unless the transformation want to treat two nodes with the same query as the same node, for example, for your large query breakdown, i don't think you want to treat two nodes with the same query as the same node during transformation. Or even cte transformation, during the actual node transformation, we don't want to treat two nodes as the same node during the plan transformation, because the are different nodes. This could cause us to missing apply transformation on some nodes and cause potential problem, because we could incorrectly thought some nodes have been handled. This only make sense when we are doing query comparison
In general, we should never change node comparison unless it is by design, however, for node in plan tree, that should not be the case.

This has been causing problem to our ctc workloads, where we missed some handling of some nodes, I have tried it with our benchmark workloads, the problem has been gone with this refactoring.

Copy link
Contributor

@sfc-gh-aalam sfc-gh-aalam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. a few comments.

Comment on lines 185 to 187
query_id = encoded_query_id(query, query_params)
if query_id is not None:
return query_id + node_type_name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can do f"{query_id}_{node_type_name}"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

Comment on lines 175 to 177
def encode_id(
node_type_name: str, query: str, query_params: Optional[Sequence[Any]] = None
) -> str:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we call this encode_node_id?

Copy link
Collaborator Author

@sfc-gh-yzou sfc-gh-yzou Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i actually call this encode_node_id_with_query to be more clear of the encoded id to reducing the chance for people to use it incorrectly

Comment on lines 188 to 189
else:
return str(uuid.uuid4())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want the same node to have the same encode_id. If so, we can rely on id(node) which gives you deterministic result, instead of uuid4()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is used for cached property, so with uuid4 generation it will also be deterministitic, but we can use id also.

Comment on lines 185 to 187
query_id = encoded_query_id(query, query_params)
if query_id is not None:
return query_id + node_type_name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Encode given query, query parameters and the node type into an id.

If query and query parameters can be encoded successfully using sha256,
return the encoded query id + node_type_name.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need node_type_name?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is following the previous equivalence check, we require the node type to be the same, for example, a snowflake plan node and selectstatement node with the same query is not counted as the same

return encode_id(type(self).__name__, self.original_sql, self.query_params)

@cached_property
def encoded_query_id(self) -> Optional[str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two properties seem having the same docstring?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they are slightly different, but i further updated the comment to make it more clear

@sfc-gh-helmeleegy
Copy link
Contributor

sfc-gh-helmeleegy commented Oct 14, 2024

This make sense when we do repeated subquery elimination, but not expected by other transformations.

Can you give an example of where this does not make sense?

Also, it sounds like this is a behavioral change. Are there production workloads that can be impacted? Which ones?

@sfc-gh-yzou sfc-gh-yzou force-pushed the yzou-SNOW-1731783-cte-id-refactor branch from a6b696f to 87db802 Compare October 15, 2024 02:19
@sfc-gh-yzou
Copy link
Collaborator Author

sfc-gh-yzou commented Oct 15, 2024

@sfc-gh-helmeleegy there should be no user facing behavior change, this equivalence check is introduced by the cte transformation, and only used internally for plan node comparison for cte transformation, and no others should rely on this fact. If yes, that would be counted as a bug. This is counted as a refactor work before others misuse this property, so no existing workload should be impacted

def traverse_plan(plan, plan_id_map):
plan_id = plan._id
plan_type = type(plan)
Copy link
Collaborator Author

@sfc-gh-yzou sfc-gh-yzou Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sfc-gh-aalam i am not sure if we are doing the write test here, if two nodes have the same query and same type, there should still be two nodes after deepcopy, because they are two different nodes.

@sfc-gh-yzou sfc-gh-yzou merged commit 952599b into main Oct 16, 2024
33 of 34 checks passed
@sfc-gh-yzou sfc-gh-yzou deleted the yzou-SNOW-1731783-cte-id-refactor branch October 16, 2024 19:08
@github-actions github-actions bot locked and limited conversation to collaborators Oct 16, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants