-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SNOW-1728988: Cache attributes on SelectStatement to reduce describe query #2462
Conversation
@@ -1181,6 +1183,7 @@ def filter(self, col: Expression) -> "SelectStatement": | |||
new = SelectStatement( | |||
from_=self.to_subqueryable(), where=col, analyzer=self.analyzer | |||
) | |||
new._attributes = self._attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this also apply to set operations like union and intersection?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nah, set operations possibly change the data type, e.g., (select '1' as a) union (select 2 as a)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we first check if the input data types are the same, then they can be propagated to the output? I believe this would be the more common case, and we should be able to recognize it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually it's data coercion in Snowflake, and there are many complicated rules on the server side. Even if they are the same type, there may still be data coercion happening (such as two varchar values with different length). So I don't think it's a good idea to have such rules on the client side, and we should only rely on the server side for data coercion. This type of operation is not in the scope of reducing describe queries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think this will reduce describe queries. We're just worried that the rules we use on the client side may end up being different than the server side, right? Would we still be worried if the data types for both operands are exactly the same (including lengths for varchar, etc)? Or is it that we don't have this information on the client side so we cannot be 100% sure? I'm just trying to understand the concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original concern is just about data coercion, but as you said, if two types are exactly the same, we can try to infer too and we can investigate it in the future, if this is a common pattern that we should optimize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I agree it doesn't have to be in this PR. For Snowpark pandas, we do use union operations pretty commonly. And I think that in most cases, the data types are identical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it, yeah if it's pretty common, we can definitely do it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Good progress with reducing describe queries. Thanks, Jianzhun.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would you have to update some tests as a consequence because this change will reduce num describe queries?
@@ -1181,6 +1183,7 @@ def filter(self, col: Expression) -> "SelectStatement": | |||
new = SelectStatement( | |||
from_=self.to_subqueryable(), where=col, analyzer=self.analyzer | |||
) | |||
new._attributes = self._attributes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flag control
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we don't need this one because if the parameter is off, _attributes is always None. But we can also add
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can have parameter turned off during middle. Let's protect the code for extra safety, and when we do copy, don't we also need to copy the attribute over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure, let me add the parameter protection. I'd prefer not to copy the attributes for now, because select()
use copy to create a new select_statement, where attributes may not be the same
new = copy(self) |
if self._attributes is not None: | ||
return self._attributes | ||
assert ( | ||
self.schema_query is not None | ||
), "No schema query is available for the SnowflakePlan" | ||
self._attributes = analyze_attributes(self.schema_query, self.session) | ||
# We need to cache attributes on SelectStatement too because df._plan is not | ||
# carried over to next SelectStatement (e.g., check the implementation of df.filter()). | ||
if isinstance(self.source_plan, SelectStatement): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flag_control here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added
] | ||
|
||
# Create from Values | ||
create_from_values_funcs = [] | ||
create_from_values_funcs = [ | ||
lambda session: session.create_dataframe([[1, 2], [3, 4]], schema=["a", "b"]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we setup this test suite with the control on and off to make sure things all works as expected when the flag is on or off
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
yes, it does reduce some describe queries (though our tests haven't really add many describe queries check), but because the parameter is not enabled now, it won't affect other tests. |
Which Jira issue is this PR addressing? Make sure that there is an accompanying issue to your PR.
Fixes SNOW-1728988
Fill out the following pre-review checklist:
Please describe how your code solves the related issue.
When sql simplifier is on, it can resolve the scenarios where 1) select() is called because df._plan is not carried over, 2) sql simplifier can't flatten the query