-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: collect
for lazy-only libraries
#1479
Comments
I really like the idea of allowing collecting to different types of dataframes! What about having def collect(self, eager_backend: Literal['pandas', 'polars', 'pyarrow'] | None = None) -> DataFrame[Any]:
eager_backend = eager_backend or self._default_eager_backend
if eager_backend == 'pandas':
return self._collect_pandas()
elif eager_backend == 'polars':
... in this case I think we would not need to worry about any extra kwargs to be backward compatible or for "extensions" dataframes. If we use the @property
def _default_eager_backend(self) -> str:
if self._implementation is Implementation.PYSPARK:
return 'pandas'
elif self._implementation is Implementation.DUCK_DB:
return 'pyarrow'
def _collect_polars(self) -> DataFrame[Any]:
if self._implementation is Implementation.SPARK:
raise NotImplementedError("Cannot collect a Spark DataFrame to Polars.")
elif self._implementation is Implementation.DUCK_DB:
... Thinking out loud here, I may have missed something :D what do you think? |
I think someone may now want to specify the exact eager backend, but just say "at this point, we need to switch to eager computation" and, for each lazy backend, to switch to the corresponding eager backend For Polars and Dask, the corresponding eager backend is obvious. But DuckDB, not so much |
Totally agree!👌 Maybe I didn't explain very well in the message above. My point was just about the number of kwargs and the methods that would be required for an 'extension' LazyDataframe implementation. I think with only 1 kwarg (that define the eager dataframe type to collect to) we could cover everything. In other words, I miss the reason why we need more kwargs 😅 |
I'm just thinking that we could have:
how would they write or do we just not worry about it and make an opinionated choice about what the duckdb default is until someone complains? |
Jumping in the conversation :) I still believe that the solution 2 in #1042 is the most flexible to be able to pass along any other argument to the specific lazy backend. We didn't have yet, the need yet, but polars has 2 engines for collecting, plus a streaming mode. I envision this as someone writing agnostic code and being ready to receive whatever dataframe as input, they would want to specify all possible specific collect arguments. For Spark, DuckDB (and others), I really like the idea of having Another alternative could be to try-except which eager backend is available (with a custom prioritization/order). I am not a fan of this approach, but maybe someone is. One issue I would like to bring up is the following: |
Aaah now I get it! Assuming we had the below (and the private methods I mentioned above) class LazyFrame:
...
def collect(self, eager_backend: Literal['pandas', 'polars', 'pyarrow'] | None = None, **kwargs) -> DataFrame[Any]:
... We would be able to give a choice on the backend and pass any available kwargs.
If users wants to choose the eager backend based on the type of LazyFrame they get, we could suggest something like: # skbeer
if is_duckdb_lazyframe(df):
some_kwargs = {...}
eager_df = df.collect(eager_backend="pyarrow", **some_kwargs)
else:
eager_df = df.collect() |
that's true, they could if/then their way of it could we even say that they should do that for the kwargs? like, instead of if is_polars:
df.to_native().collect(streaming=True, cse=False) ? |
We currently have
LazyFrame.collect
:I think one solution could be to add extra keyword arguments to
LazyFrame.collect
to specify such ambiguous cases. Adding extra (non-required) keyword arguments would still be backwards-compatibleThe signature could look like this:
This still leave us with the question about what to do for "extension" dataframes, i.e. if someone implements
__narwhals_dataframe__
/__narwhals_lazyframe__
and extends Narwhals themself. We could require them to also implement a dunder method specifying which eager frame they want their lazy frame collected to, or which ones they support collecting into, and then have an extra kwarg incollect
for that@EdAbati any thoughts?
The text was updated successfully, but these errors were encountered: