[SNOW-1758768] Add repeated node complexity distribution telemetry #2494

sfc-gh-yzou · 2024-10-23T06:10:56Z

Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.

Fill out the following pre-review checklist:
- I am adding a new automated test(s) to verify correctness of my new code
  - If this test skips Local Testing mode, I'm requesting review from @snowflakedb/local-testing
- I am adding new logging messages
- I am adding a new telemetry message
- I am adding new credentials
- I am adding a new dependency
- If this is a new feature/behavior, I'm adding the Local Testing parity changes.
- I acknowledge that I have ensured my changes to be thread-safe. Follow the link for more information: Thread-safe Developer Guidelines
Please describe how your code solves the related issue.

Add telemetry to record the distribution of repeated nodes complexity. Note, we are recording the number of repeated nodes, not just the cte nodes. For example, if n2 occured twice during the visit of the whole tree, it will be counted twice, instead of once.

The complexity bucket used today are :

 1) low complexity
        bin 0: <= 10,000; bin 1: > 10,000, <= 100,000; bin 2: > 100,000, <= 500,000
    2) medium complexity
        bin 3: > 500,000, <= 1,000,000;  bin 4: > 1,000,000, <= 5,000,000
    4) large complexity
        bin 5: > 5,000,000, <= 10,000,000;  bin 6: > 10,000,000

sfc-gh-jdu · 2024-10-23T18:58:38Z

src/snowflake/snowpark/_internal/analyzer/snowflake_plan.py

        return {
            PlanState.PLAN_HEIGHT: height,
            PlanState.NUM_SELECTS_WITH_COMPLEXITY_MERGED: num_selects_with_complexity_merged,
+            PlanState.NUM_CTE_NODES: len(cte_nodes),


it's the same as "num_duplicate_nodes" before, right? I'm not sure whether we should change the name since it will break the existing query in dashboard

There is no change to the telemetry node, it is just a change for our internal code to be more clear at least from our internal developer's point of view. If you look at the telemetry sent here https://github.com/snowflakedb/snowpark-python/pull/2494/files#diff-c46aa591b19bb02d4b80a69dda3c5b4ff9e2ff4ad18cf1170779aa678ccbdafaR194 it is still using the same name

sfc-gh-jdu · 2024-10-23T18:59:20Z

src/snowflake/snowpark/_internal/compiler/cte_utils.py

+) -> List[int]:
+    """
+    Calculate the complexity distribution for the detected repeated node. The complexity are categorized as following:
+    1) low complexity


just curious, where are these ranges from?

the upper bound 10,000,000 is from the default exhaust bound, the rest is really just by sense, if we have other suggestions about the bins, please feel free to put the suggestion, i just don't want to send a big list for the complexity if we have a lot of duplications

sfc-gh-jdu · 2024-10-23T18:59:35Z

tests/integ/test_telemetry.py

@@ -607,6 +607,7 @@ def test_execute_queries_api_calls(session, sql_simplifier_enabled):
            "query_plan_height": query_plan_height,
            "query_plan_num_duplicate_nodes": 0,


you might want to remove this

oh, why is that? we are still sending this. So the query_plan_num_duplicate_nodes is the number of cte nodes. the query_plan_duplicated_node_complexity_distribution is the actual distribution of repeated nodes. so it is different, if we sum all of them together, we will get the actual number of repeated nodes, not the cte nodes.

yeah I missed that

sfc-gh-aalam · 2024-10-23T22:07:22Z

src/snowflake/snowpark/_internal/compiler/cte_utils.py

+        for complexity_score in id_complexity_map[node_id]:
+            if complexity_score <= 10000:


can't we do

complexity_score = id_complexity_map[node_id] repetition_count = id_count_map[node_id] node_complexity_dist[..] += repetition_count

answered below

sfc-gh-aalam · 2024-10-23T22:08:38Z

src/snowflake/snowpark/_internal/compiler/cte_utils.py

+                    id_complexity_map[node.encoded_node_id_with_query].append(
+                        get_complexity_score(node)
+                    )


shouldn't two nodes with same encoded_node_id_with_query have the same complexity? we are simply going to make a list of [complexity] * id_count_map[node_id]

so i was recording the complexity for each node because theoritically they can be different node end with same query. However, if they are the same query, theoretically i would in general expect the complexity to be same or close, since that information is mainly used for estimation, i think we can just use the repetition_count to make things simpiler

sfc-gh-yzou added the NO-CHANGELOG-UPDATES This pull request does not need to update CHANGELOG.md label Oct 23, 2024

github-actions bot added the local testing Local Testing issues/PRs label Oct 23, 2024

sfc-gh-yzou marked this pull request as ready for review October 23, 2024 18:41

sfc-gh-yzou requested review from a team as code owners October 23, 2024 18:41

sfc-gh-yzou requested review from sfc-gh-jdu, sfc-gh-yixie, sfc-gh-yuwang and sfc-gh-aalam and removed request for sfc-gh-yixie October 23, 2024 18:41

sfc-gh-jdu reviewed Oct 23, 2024

View reviewed changes

sfc-gh-yzou requested a review from sfc-gh-jdu October 23, 2024 21:54

sfc-gh-jdu approved these changes Oct 23, 2024

View reviewed changes

sfc-gh-aalam reviewed Oct 23, 2024

View reviewed changes

sfc-gh-yzou added 5 commits October 23, 2024 22:29

fix error

e2053b6

fix error

a49e97c

fix error

b7304a0

fix error

a8153c7

fix erro

c3156e3

sfc-gh-yzou force-pushed the yzou-SNOW-1758768-query-complexity branch from 4889ffd to c3156e3 Compare October 24, 2024 06:09

sfc-gh-aalam approved these changes Oct 24, 2024

View reviewed changes

sfc-gh-yzou merged commit 78b3c63 into main Oct 24, 2024
37 checks passed

sfc-gh-yzou deleted the yzou-SNOW-1758768-query-complexity branch October 24, 2024 17:44

github-actions bot locked and limited conversation to collaborators Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SNOW-1758768] Add repeated node complexity distribution telemetry #2494

[SNOW-1758768] Add repeated node complexity distribution telemetry #2494

sfc-gh-yzou commented Oct 23, 2024 •

edited

Loading

sfc-gh-jdu Oct 23, 2024

sfc-gh-yzou Oct 23, 2024

sfc-gh-jdu Oct 23, 2024

sfc-gh-yzou Oct 23, 2024

sfc-gh-jdu Oct 23, 2024

sfc-gh-yzou Oct 23, 2024

sfc-gh-jdu Oct 23, 2024

sfc-gh-aalam Oct 23, 2024

sfc-gh-yzou Oct 24, 2024

sfc-gh-aalam Oct 23, 2024

sfc-gh-yzou Oct 24, 2024

		@@ -607,6 +607,7 @@ def test_execute_queries_api_calls(session, sql_simplifier_enabled):
		"query_plan_height": query_plan_height,
		"query_plan_num_duplicate_nodes": 0,

		for complexity_score in id_complexity_map[node_id]:
		if complexity_score <= 10000:

[SNOW-1758768] Add repeated node complexity distribution telemetry #2494

[SNOW-1758768] Add repeated node complexity distribution telemetry #2494

Conversation

sfc-gh-yzou commented Oct 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sfc-gh-yzou commented Oct 23, 2024 •

edited

Loading