Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SNOW-1758768] Add repeated node complexity distribution telemetry #2494

Merged
merged 5 commits into from
Oct 24, 2024

Conversation

sfc-gh-yzou
Copy link
Collaborator

@sfc-gh-yzou sfc-gh-yzou commented Oct 23, 2024

  1. Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

SNOW-1758768

  1. Fill out the following pre-review checklist:

    • I am adding a new automated test(s) to verify correctness of my new code
      • If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
    • I am adding new logging messages
    • I am adding a new telemetry message
    • I am adding new credentials
    • I am adding a new dependency
    • If this is a new feature/behavior, I'm adding the Local Testing parity changes.
    • I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
  2. Please describe how your code solves the related issue.

Add telemetry to record the distribution of repeated nodes complexity. Note, we are recording the number of repeated nodes, not just the cte nodes. For example, if n2 occured twice during the visit of the whole tree, it will be counted twice, instead of once.

The complexity bucket used today are :

 1) low complexity
        bin 0: <= 10,000; bin 1: > 10,000, <= 100,000; bin 2: > 100,000, <= 500,000
    2) medium complexity
        bin 3: > 500,000, <= 1,000,000;  bin 4: > 1,000,000, <= 5,000,000
    4) large complexity
        bin 5: > 5,000,000, <= 10,000,000;  bin 6: > 10,000,000

@sfc-gh-yzou sfc-gh-yzou added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Oct 23, 2024
@github-actions github-actions bot added the local testing Local Testing issues/PRs label Oct 23, 2024
@sfc-gh-yzou sfc-gh-yzou marked this pull request as ready for review October 23, 2024 18:41
@sfc-gh-yzou sfc-gh-yzou requested review from a team as code owners October 23, 2024 18:41
return {
PlanState.PLAN_HEIGHT: height,
PlanState.NUM_SELECTS_WITH_COMPLEXITY_MERGED: num_selects_with_complexity_merged,
PlanState.NUM_CTE_NODES: len(cte_nodes),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the same as "num_duplicate_nodes" before, right? I'm not sure whether we should change the name since it will break the existing query in dashboard

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no change to the telemetry node, it is just a change for our internal code to be more clear at least from our internal developer's point of view. If you look at the telemetry sent here https://github.com/snowflakedb/snowpark-python/pull/2494/files#diff-c46aa591b19bb02d4b80a69dda3c5b4ff9e2ff4ad18cf1170779aa678ccbdafaR194 it is still using the same name

) -> List[int]:
"""
Calculate the complexity distribution for the detected repeated node. The complexity are categorized as following:
1) low complexity
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just curious, where are these ranges from?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the upper bound 10,000,000 is from the default exhaust bound, the rest is really just by sense, if we have other suggestions about the bins, please feel free to put the suggestion, i just don't want to send a big list for the complexity if we have a lot of duplications

@@ -607,6 +607,7 @@ def test_execute_queries_api_calls(session, sql_simplifier_enabled):
"query_plan_height": query_plan_height,
"query_plan_num_duplicate_nodes": 0,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might want to remove this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, why is that? we are still sending this. So the query_plan_num_duplicate_nodes is the number of cte nodes. the query_plan_duplicated_node_complexity_distribution is the actual distribution of repeated nodes. so it is different, if we sum all of them together, we will get the actual number of repeated nodes, not the cte nodes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah I missed that

Comment on lines 121 to 122
for complexity_score in id_complexity_map[node_id]:
if complexity_score <= 10000:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we do

complexity_score = id_complexity_map[node_id]
repetition_count = id_count_map[node_id]
node_complexity_dist[..] += repetition_count

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

answered below

Comment on lines 61 to 63
id_complexity_map[node.encoded_node_id_with_query].append(
get_complexity_score(node)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't two nodes with same encoded_node_id_with_query have the same complexity? we are simply going to make a list of [complexity] * id_count_map[node_id]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so i was recording the complexity for each node because theoritically they can be different node end with same query. However, if they are the same query, theoretically i would in general expect the complexity to be same or close, since that information is mainly used for estimation, i think we can just use the repetition_count to make things simpiler

@sfc-gh-yzou sfc-gh-yzou force-pushed the yzou-SNOW-1758768-query-complexity branch from 4889ffd to c3156e3 Compare October 24, 2024 06:09
@sfc-gh-yzou sfc-gh-yzou merged commit 78b3c63 into main Oct 24, 2024
37 checks passed
@sfc-gh-yzou sfc-gh-yzou deleted the yzou-SNOW-1758768-query-complexity branch October 24, 2024 17:44
@github-actions github-actions bot locked and limited conversation to collaborators Oct 24, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
local testing Local Testing issues/PRs NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants