-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SNOW-1758768] Add repeated node complexity distribution telemetry #2494
Conversation
return { | ||
PlanState.PLAN_HEIGHT: height, | ||
PlanState.NUM_SELECTS_WITH_COMPLEXITY_MERGED: num_selects_with_complexity_merged, | ||
PlanState.NUM_CTE_NODES: len(cte_nodes), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the same as "num_duplicate_nodes" before, right? I'm not sure whether we should change the name since it will break the existing query in dashboard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no change to the telemetry node, it is just a change for our internal code to be more clear at least from our internal developer's point of view. If you look at the telemetry sent here https://github.com/snowflakedb/snowpark-python/pull/2494/files#diff-c46aa591b19bb02d4b80a69dda3c5b4ff9e2ff4ad18cf1170779aa678ccbdafaR194 it is still using the same name
) -> List[int]: | ||
""" | ||
Calculate the complexity distribution for the detected repeated node. The complexity are categorized as following: | ||
1) low complexity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just curious, where are these ranges from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the upper bound 10,000,000 is from the default exhaust bound, the rest is really just by sense, if we have other suggestions about the bins, please feel free to put the suggestion, i just don't want to send a big list for the complexity if we have a lot of duplications
@@ -607,6 +607,7 @@ def test_execute_queries_api_calls(session, sql_simplifier_enabled): | |||
"query_plan_height": query_plan_height, | |||
"query_plan_num_duplicate_nodes": 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might want to remove this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, why is that? we are still sending this. So the query_plan_num_duplicate_nodes is the number of cte nodes. the query_plan_duplicated_node_complexity_distribution is the actual distribution of repeated nodes. so it is different, if we sum all of them together, we will get the actual number of repeated nodes, not the cte nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah I missed that
for complexity_score in id_complexity_map[node_id]: | ||
if complexity_score <= 10000: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we do
complexity_score = id_complexity_map[node_id]
repetition_count = id_count_map[node_id]
node_complexity_dist[..] += repetition_count
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
answered below
id_complexity_map[node.encoded_node_id_with_query].append( | ||
get_complexity_score(node) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't two nodes with same encoded_node_id_with_query
have the same complexity? we are simply going to make a list of [complexity] * id_count_map[node_id]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so i was recording the complexity for each node because theoritically they can be different node end with same query. However, if they are the same query, theoretically i would in general expect the complexity to be same or close, since that information is mainly used for estimation, i think we can just use the repetition_count to make things simpiler
4889ffd
to
c3156e3
Compare
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
SNOW-1758768
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
Add telemetry to record the distribution of repeated nodes complexity. Note, we are recording the number of repeated nodes, not just the cte nodes. For example, if n2 occured twice during the visit of the whole tree, it will be counted twice, instead of once.
The complexity bucket used today are :