Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bench: implement TPC-H queries using Narwhals #282

Closed
9 of 15 tasks
MarcoGorelli opened this issue Jun 11, 2024 · 22 comments
Closed
9 of 15 tasks

bench: implement TPC-H queries using Narwhals #282

MarcoGorelli opened this issue Jun 11, 2024 · 22 comments

Comments

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Jun 11, 2024

In https://github.com/narwhals-dev/narwhals/tree/main/tpch/notebooks, we implement several TPC-H queries, and run them via Narwhals

We currently have the first 7 there, and should work on adding more

To track progress:

You can find reference implementations for these queries here: https://github.com/pola-rs/tpch/tree/main/queries/polars

The task here is:

  1. Make an account on kaggle.com
  2. choose one of the above queries (e.g. Q9 - don't
  3. Make a fork of https://www.kaggle.com/code/marcogorelli/narwhals-tpch-q5-s2 (you should see a "copy and edit" button on Kaggle)
  4. Remove q5_pandas_native. We don't currently have reference implementations for them https://github.com/pola-rs/tpch/tree/main/queries/pandas, so we skip them
  5. replace def q5 with something like the query you find in https://github.com/pola-rs/tpch/tree/main/queries/polars . You will need to modify it a bit, check the current notebooks and try to follow how they do it. You'll also need to get it to work for Narwhals, so that means:
  • using nw.col instead of pl.col
  • using nw.from_native at the start of the function for each input, and nw.to_native at the end

Once you have a notebook which executes, please share it here!


Note: if you see q5_pandas or q5_ibis, you can ignore them for the purpose of this issue, no need to copy them. Let's only focus on making sure that the Narwhals API is extensive enough for these queries

@MarcoGorelli MarcoGorelli changed the title feat: implement TPC-H queries using Narwhals bench: implement TPC-H queries using Narwhals Jun 11, 2024
@ugohuche
Copy link
Contributor

I'll implement Q9 and Q10

@ugohuche
Copy link
Contributor

@MarcoGorelli I noticed that Narwhals doesn't yet have 'contains' in the ExprStringNamespace, which is used in query 9.

@MarcoGorelli
Copy link
Member Author

ah, thanks @ugohuche ! looks like we've found the next issue :) would you like to work on implementing str.contains? If you look at str.ends_with, I think it might be kind of similar, feel free to ask if you get stuck

once that's done, we can get back to this

@ugohuche
Copy link
Contributor

Yeah, I'd love to give it a try

@FBruzzesi
Copy link
Member

Just for context: the idea of implementing these queries is both for performance benchmarking and features availability. Meaning that if we are able to run all (most) of them, then that is a proxy for a large enough feature set we support. Is this the case @MarcoGorelli ?

@MarcoGorelli
Copy link
Member Author

yes, exactly!

@DeaMariaLeon
Copy link
Member

DeaMariaLeon commented Jun 12, 2024

I'll take Q11

edit:
Narwhals doesn't have:

  • how="cross" for DataFrame.join()
  • round

There is a user warning :

UserWarning: Found complex group-by expression, which can't be expressed efficiently with the pandas API. If you can, please rewrite your query such that group-by aggregations are simple (e.g. mean, std, min, max, ...).
  return agg_pandas(

I haven't tried to rewrite the query. I'm just reporting my findings up to now. 😇

@MarcoGorelli @FBruzzesi

@MarcoGorelli
Copy link
Member Author

thanks for reporting these! I think we can live with the userwarning for now and not worry about it, as long as it executes

fancy opening a separate issue about how='cross' and round? they definitely seem generically useful enough that we should implement them

@FBruzzesi
Copy link
Member

FBruzzesi commented Jun 19, 2024

Findings from Q21:

  • DataFrame.join doesn't have on keyword (I believe by design), I replaced it with left_on=key, right_on=key. Annoying but doable 😂
  • Expr does not support .len() (opened a feature request already [Enh]: Support Expr.len() #322): while nw.len() can work in some situations, there is a broad spectrum which does not cover.

@FBruzzesi
Copy link
Member

FBruzzesi commented Jun 19, 2024

Findings from Q20

  • There is a group by expression which is tagged as "complex", while in reality is a simple expression multiplied by a constant, namely .agg((nw.col("l_quantity").sum() * 0.5).alias("sum_quantity")). Therefore the apply step is fired (which takes long long time - total query time 15+ mins). Breaking it down in two steps:
    .agg((nw.col("l_quantity").sum()).alias("sum_quantity"))
    .with_columns(sum_quantity = nw.col("sum_quantity") * 0.5)
    brings the total time down to ~3-4.5 secs. Maybe we can add examples of how rewrite your query such that group-by aggregations are simple (e.g. mean, std, min, max, ...). can be done. Also, related to [Enh]: support fast, non-simple pandas grouped operations #299.
  • If using date instead of datetime object, pandas backend ends up with an error when filtering on a datetime column. Polars has no issues

@MarcoGorelli
Copy link
Member Author

Maybe we can add examples of how rewrite your query such that group-by aggregations are simple (e.g. mean, std, min, max, ...). can be done

yeah, great idea!

If using date instead of datetime object, pandas backend ends up with an error when filtering on a datetime column. Polars has no issues

does this also happen for pyarrow-backed pandas?

@FBruzzesi
Copy link
Member

does this also happen for pyarrow-backed pandas?

pyarrow backend has no issues!

@ugohuche
Copy link
Contributor

Hi @MarcoGorelli, query 13 has a method "not_" which isn't implemented in Narwhals yet.
Is it something we can add ?

@MarcoGorelli
Copy link
Member Author

I think we might as well just use ~ for that (e.g.. , instead of nw.col('a').not_(), use (~nw.col('a')))

@ugohuche
Copy link
Contributor

ugohuche commented Jun 25, 2024

I think we might as well just use ~ for that (e.g.. , instead of nw.col('a').not_(), use (~nw.col('a')))

I'll use that instead.
Should I open a new issue for the when.then.otherwise() methods ?

@MarcoGorelli
Copy link
Member Author

Thanks! I think there's already one open for that

@ugohuche
Copy link
Contributor

Oh okay, I didn't see it

@MarcoGorelli
Copy link
Member Author

thanks @brentomagic for reporting - your comment is very hard to read, I'd suggest using triple-backticks to format code https://docs.github.com/en/get-started/writing-on-github/working-with-advanced-formatting/creating-and-highlighting-code-blocks

@MarcoGorelli
Copy link
Member Author

here's a simpler reproducer for the issue:

def func(df_any):
    df = nw.from_native(df_any, eager_only=True)
    df = df.filter(nw.col('a')<1).group_by('a').agg(nw.col('b').sum().round(2).alias('c'))
    return nw.to_native(df)


print(func(pd.DataFrame({'a': [1,2,3], 'b': [4,3,2]})))

@ugohuche
Copy link
Contributor

ugohuche commented Jul 9, 2024

I'll implement Q17 and Q18 now that the blockers are resolved

@mistShard
Copy link
Contributor

I'll look at Q22

@FBruzzesi
Copy link
Member

Closing in favor of #863

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants