-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bench: implement TPC-H queries using Narwhals #282
Comments
I'll implement Q9 and Q10 |
@MarcoGorelli I noticed that Narwhals doesn't yet have 'contains' in the ExprStringNamespace, which is used in query 9. |
ah, thanks @ugohuche ! looks like we've found the next issue :) would you like to work on implementing once that's done, we can get back to this |
Yeah, I'd love to give it a try |
Just for context: the idea of implementing these queries is both for performance benchmarking and features availability. Meaning that if we are able to run all (most) of them, then that is a proxy for a large enough feature set we support. Is this the case @MarcoGorelli ? |
yes, exactly! |
I'll take Q11 edit:
There is a user warning :
I haven't tried to rewrite the query. I'm just reporting my findings up to now. 😇 |
thanks for reporting these! I think we can live with the userwarning for now and not worry about it, as long as it executes fancy opening a separate issue about |
Findings from Q21:
|
Findings from Q20
|
yeah, great idea!
does this also happen for pyarrow-backed pandas? |
pyarrow backend has no issues! |
Hi @MarcoGorelli, query 13 has a method "not_" which isn't implemented in Narwhals yet. |
I think we might as well just use |
I'll use that instead. |
Thanks! I think there's already one open for that |
Oh okay, I didn't see it |
thanks @brentomagic for reporting - your comment is very hard to read, I'd suggest using triple-backticks to format code https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks |
here's a simpler reproducer for the issue: def func(df_any):
df = nw.from_native(df_any, eager_only=True)
df = df.filter(nw.col('a')<1).group_by('a').agg(nw.col('b').sum().round(2).alias('c'))
return nw.to_native(df)
print(func(pd.DataFrame({'a': [1,2,3], 'b': [4,3,2]}))) |
I'll implement Q17 and Q18 now that the blockers are resolved |
I'll look at Q22 |
Closing in favor of #863 |
In https://github.com/narwhals-dev/narwhals/tree/main/tpch/notebooks, we implement several TPC-H queries, and run them via Narwhals
We currently have the first 7 there, and should work on adding more
To track progress:
You can find reference implementations for these queries here: https://github.com/pola-rs/tpch/tree/main/queries/polars
The task here is:
q5_pandas_native
. We don't currently have reference implementations for them https://github.com/pola-rs/tpch/tree/main/queries/pandas, so we skip themdef q5
with something like the query you find in https://github.com/pola-rs/tpch/tree/main/queries/polars . You will need to modify it a bit, check the current notebooks and try to follow how they do it. You'll also need to get it to work for Narwhals, so that means:nw.col
instead ofpl.col
nw.from_native
at the start of the function for each input, andnw.to_native
at the endOnce you have a notebook which executes, please share it here!
Note: if you see
q5_pandas
orq5_ibis
, you can ignore them for the purpose of this issue, no need to copy them. Let's only focus on making sure that the Narwhals API is extensive enough for these queriesThe text was updated successfully, but these errors were encountered: