Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/parameterized sql queries #964

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

timsaucer
Copy link
Contributor

Which issue does this PR close?

Closes #513

Rationale for this change

Users would like to use DataFrames as a parameter inside an SQL query. With this change, you can do the following:

from datafusion import SessionContext
ctx = SessionContext()
df_customer = ctx.read_parquet("examples/tpch/data/customer.parquet")
ctx.sql("select c_custkey, c_name from {df}", df=df_customer)

The string {df} in the query will be replaced with the SQL equivalent of the logical plan of the DataFrame.

What changes are included in this PR?

All of the read_parquet, read_avro, read_json, and read_csv have been changed to call register_ with a generated table name. This table name is the file name. If a table already exists with that file name, a generated UUID is used instead.

One unit test is included.

Are there any user-facing changes?

There is an addition of an optional table name to each of the read_ functions above, but it is a non breaking change for the users.

@MrPowers
Copy link
Contributor

MrPowers commented Dec 6, 2024

This user interface looks nice 😎

Copy link

@matko matko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see that whenever a file is queried now, it'll also be registered as a table. This does have some impact, as it means these tables are now also returned whenever a context is queried for all registered tables. Meaning any sort of visualization or automation based on that would behave differently depending on whether certain queries were run.

I personally find this very surprising. I would not expect read_parquet to secretly register_parquet as that is not what this library did before, nor is it what the rust library does.

Are you sure this is fine and won't affect people? At the very least, shouldn't there be a way to filter these out easily, to cleanly differentiate between auto-registered and explicitly registered tables?

I very much see the value of the parameterized sql feature, but this seems like a very crude way of doing it.

@@ -282,6 +283,16 @@ fn find_window_fn(name: &str, ctx: Option<PySessionContext>) -> PyResult<WindowF
return Ok(agg_fn);
}

// search default window functions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not clear to me how this relates to the rest of the change.

if named_dfs:
for alias, df in named_dfs.items():
df_sql = f"({df.logical_plan().to_sql()})"
query = query.replace(f"{{{alias}}}", df_sql)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some annoying unintended side effects to this approach. Imagine the following query

SELECT * FROM {alias} WHERE val="a string that happens to contain {alias} in it"

Since this code just replaces all occurences of {alias} with an sql query it'll do so in the WHERE part as well.
As far as I can tell, there would be no way to escape {alias} in such a way that the replacement does not occur.

This is obviously a contrived example, and it might be that this is acceptable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Is it possible to pass query parameters? (:param or ?)
3 participants