-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat/parameterized sql queries #964
base: main
Are you sure you want to change the base?
Conversation
…ys register the dataframes so they can be used with sql queries.
This user interface looks nice 😎 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that whenever a file is queried now, it'll also be registered as a table. This does have some impact, as it means these tables are now also returned whenever a context is queried for all registered tables. Meaning any sort of visualization or automation based on that would behave differently depending on whether certain queries were run.
I personally find this very surprising. I would not expect read_parquet
to secretly register_parquet
as that is not what this library did before, nor is it what the rust library does.
Are you sure this is fine and won't affect people? At the very least, shouldn't there be a way to filter these out easily, to cleanly differentiate between auto-registered and explicitly registered tables?
I very much see the value of the parameterized sql feature, but this seems like a very crude way of doing it.
@@ -282,6 +283,16 @@ fn find_window_fn(name: &str, ctx: Option<PySessionContext>) -> PyResult<WindowF | |||
return Ok(agg_fn); | |||
} | |||
|
|||
// search default window functions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear to me how this relates to the rest of the change.
if named_dfs: | ||
for alias, df in named_dfs.items(): | ||
df_sql = f"({df.logical_plan().to_sql()})" | ||
query = query.replace(f"{{{alias}}}", df_sql) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some annoying unintended side effects to this approach. Imagine the following query
SELECT * FROM {alias} WHERE val="a string that happens to contain {alias} in it"
Since this code just replaces all occurences of {alias}
with an sql query it'll do so in the WHERE part as well.
As far as I can tell, there would be no way to escape {alias}
in such a way that the replacement does not occur.
This is obviously a contrived example, and it might be that this is acceptable.
Which issue does this PR close?
Closes #513
Rationale for this change
Users would like to use DataFrames as a parameter inside an SQL query. With this change, you can do the following:
The string
{df}
in the query will be replaced with the SQL equivalent of the logical plan of the DataFrame.What changes are included in this PR?
All of the
read_parquet
,read_avro
,read_json
, andread_csv
have been changed to callregister_
with a generated table name. This table name is the file name. If a table already exists with that file name, a generated UUID is used instead.One unit test is included.
Are there any user-facing changes?
There is an addition of an optional table name to each of the
read_
functions above, but it is a non breaking change for the users.