Feat/parameterized sql queries #964

timsaucer · 2024-12-06T02:00:21Z

Which issue does this PR close?

Closes #513

Rationale for this change

Users would like to use DataFrames as a parameter inside an SQL query. With this change, you can do the following:

from datafusion import SessionContext
ctx = SessionContext()
df_customer = ctx.read_parquet("examples/tpch/data/customer.parquet")
ctx.sql("select c_custkey, c_name from {df}", df=df_customer)

The string {df} in the query will be replaced with the SQL equivalent of the logical plan of the DataFrame.

What changes are included in this PR?

All of the read_parquet, read_avro, read_json, and read_csv have been changed to call register_ with a generated table name. This table name is the file name. If a table already exists with that file name, a generated UUID is used instead.

One unit test is included.

Are there any user-facing changes?

There is an addition of an optional table name to each of the read_ functions above, but it is a non breaking change for the users.

…ys register the dataframes so they can be used with sql queries.

MrPowers · 2024-12-06T14:45:05Z

This user interface looks nice 😎

matko

I see that whenever a file is queried now, it'll also be registered as a table. This does have some impact, as it means these tables are now also returned whenever a context is queried for all registered tables. Meaning any sort of visualization or automation based on that would behave differently depending on whether certain queries were run.

I personally find this very surprising. I would not expect read_parquet to secretly register_parquet as that is not what this library did before, nor is it what the rust library does.

Are you sure this is fine and won't affect people? At the very least, shouldn't there be a way to filter these out easily, to cleanly differentiate between auto-registered and explicitly registered tables?

I very much see the value of the parameterized sql feature, but this seems like a very crude way of doing it.

matko · 2025-01-28T12:17:43Z

src/functions.rs

@@ -282,6 +283,16 @@ fn find_window_fn(name: &str, ctx: Option<PySessionContext>) -> PyResult<WindowF
        return Ok(agg_fn);
    }

+    // search default window functions


It is not clear to me how this relates to the rest of the change.

matko · 2025-01-28T12:18:41Z

python/datafusion/context.py

+        if named_dfs:
+            for alias, df in named_dfs.items():
+                df_sql = f"({df.logical_plan().to_sql()})"
+                query = query.replace(f"{{{alias}}}", df_sql)


There are some annoying unintended side effects to this approach. Imagine the following query

SELECT * FROM {alias} WHERE val="a string that happens to contain {alias} in it"

Since this code just replaces all occurences of {alias} with an sql query it'll do so in the WHERE part as well.
As far as I can tell, there would be no way to escape {alias} in such a way that the replacement does not occur.

This is obviously a contrived example, and it might be that this is acceptable.

timsaucer added 3 commits December 3, 2024 09:05

Search default window functions if no session context was provided

3bd30a0

When calling read_x to create dataframes in the session context, alwa…

fd5977f

…ys register the dataframes so they can be used with sql queries.

add unit test for parameterized sql statement

0f2dccf

timsaucer mentioned this pull request Jan 26, 2025

Why uuid is only assigned for create_dataframe, not assigned for read_xxx #996

Open

matko reviewed Jan 28, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat/parameterized sql queries #964

Feat/parameterized sql queries #964

Uh oh!

timsaucer commented Dec 6, 2024

Uh oh!

MrPowers commented Dec 6, 2024

Uh oh!

matko left a comment

Uh oh!

matko Jan 28, 2025

Uh oh!

matko Jan 28, 2025

Uh oh!

Uh oh!

Feat/parameterized sql queries #964

Are you sure you want to change the base?

Feat/parameterized sql queries #964

Uh oh!

Conversation

timsaucer commented Dec 6, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

MrPowers commented Dec 6, 2024

Uh oh!

matko left a comment

Choose a reason for hiding this comment

Uh oh!

matko Jan 28, 2025

Choose a reason for hiding this comment

Uh oh!

matko Jan 28, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!