API: `collect` for lazy-only libraries #1479

MarcoGorelli · 2024-12-01T11:57:39Z

We currently have LazyFrame.collect:

for Polars.LazyFrame, this goes to Polars.DataFrames
for dask.DataFrames, it goes to pandas.DataFrame
when we get to PySpark / DuckDB / etc ... where will it go?

I think one solution could be to add extra keyword arguments to LazyFrame.collect to specify such ambiguous cases. Adding extra (non-required) keyword arguments would still be backwards-compatible

The signature could look like this:

def collect(self, duckdb_eager_frame: Literal['pandas', 'polars', 'pyarrow'] = 'pyarrow', pyspark_eager_frame: Literal['pandas'] = 'pandas') -> DataFrame[Any]:

This still leave us with the question about what to do for "extension" dataframes, i.e. if someone implements __narwhals_dataframe__ / __narwhals_lazyframe__ and extends Narwhals themself. We could require them to also implement a dunder method specifying which eager frame they want their lazy frame collected to, or which ones they support collecting into, and then have an extra kwarg in collect for that

@EdAbati any thoughts?

The text was updated successfully, but these errors were encountered:

EdAbati · 2024-12-02T08:03:32Z

I really like the idea of allowing collecting to different types of dataframes!

What about having collect to do something like this:

    def collect(self, eager_backend: Literal['pandas', 'polars', 'pyarrow'] | None = None) -> DataFrame[Any]:
        eager_backend = eager_backend or self._default_eager_backend
        if eager_backend == 'pandas':
            return self._collect_pandas()
        elif eager_backend == 'polars':
            ...

in this case I think we would not need to worry about any extra kwargs to be backward compatible or for "extensions" dataframes.
The "extensions" will need to implement the private method _default_eager_backend and at least one of the methods _collect_<backend>(). Any of the _collect_<backend>() will default to NotImplementedError

If we use the SparkLikeDataframe for both DuckDB and Spark, we could use the Implementation Enum to decide what to do. e.g.

    @property
    def _default_eager_backend(self) -> str:
        if self._implementation is Implementation.PYSPARK:
            return 'pandas'
        elif self._implementation is Implementation.DUCK_DB:
            return 'pyarrow'

    def _collect_polars(self) -> DataFrame[Any]:
        if self._implementation is Implementation.SPARK:
            raise NotImplementedError("Cannot collect a Spark DataFrame to Polars.")
        elif self._implementation is Implementation.DUCK_DB:
            ...

Thinking out loud here, I may have missed something :D what do you think?

MarcoGorelli · 2024-12-02T08:24:37Z

I think someone may now want to specify the exact eager backend, but just say "at this point, we need to switch to eager computation" and, for each lazy backend, to switch to the corresponding eager backend

For Polars and Dask, the corresponding eager backend is obvious. But DuckDB, not so much

EdAbati · 2024-12-02T18:44:08Z

Totally agree!👌

Maybe I didn't explain very well in the message above. My point was just about the number of kwargs and the methods that would be required for an 'extension' LazyDataframe implementation.

I think with only 1 kwarg (that define the eager dataframe type to collect to) we could cover everything. In other words, I miss the reason why we need more kwargs 😅

MarcoGorelli · 2024-12-02T21:57:46Z

I'm just thinking that we could have:

skbeer wants to do df.collect() and have duckdb go to pyarrow, and everything else goes to its native eager backend
skwine wants to do df.collect() and have duckdb go to pandas, and everything else goes to its native eager backend

how would they write collect such that this occurs?

or do we just not worry about it and make an opinionated choice about what the duckdb default is until someone complains?

FBruzzesi · 2024-12-03T04:07:01Z

Jumping in the conversation :)

I still believe that the solution 2 in #1042 is the most flexible to be able to pass along any other argument to the specific lazy backend. We didn't have yet, the need yet, but polars has 2 engines for collecting, plus a streaming mode. I envision this as someone writing agnostic code and being ready to receive whatever dataframe as input, they would want to specify all possible specific collect arguments.

For Spark, DuckDB (and others), I really like the idea of having eager_backend argument in the .collect(..) spec.

Another alternative could be to try-except which eager backend is available (with a custom prioritization/order). I am not a fan of this approach, but maybe someone is.

One issue I would like to bring up is the following: df.collect().<insert_eager_operation>.lazy() in expectation would result in the same type of the original one, yet if we start from X and collect to pandas, then .lazy() will return PandasLikeDataFrame. Currently we don't have a way to determine and achieve this

EdAbati · 2024-12-03T07:15:53Z

Aaah now I get it!

Assuming we had the below (and the private methods I mentioned above)

class LazyFrame:
    ...
    def collect(self, eager_backend: Literal['pandas', 'polars', 'pyarrow'] | None = None, **kwargs) -> DataFrame[Any]:
        ...

We would be able to give a choice on the backend and pass any available kwargs.

how would they write collect such that this occurs?

If users wants to choose the eager backend based on the type of LazyFrame they get, we could suggest something like:

# skbeer
if is_duckdb_lazyframe(df):
    some_kwargs = {...}
    eager_df = df.collect(eager_backend="pyarrow", **some_kwargs)
else:
    eager_df = df.collect()

MarcoGorelli · 2024-12-03T13:07:16Z

If users wants to choose the eager backend based on the type of LazyFrame they get, we could suggest something like:

that's true, they could if/then their way of it

could we even say that they should do that for the kwargs? like, instead of polars_kwargs / duckdb_kwargs etc, people could do

if is_polars:
    df.to_native().collect(streaming=True, cse=False)

?
And the collect we have in Narwhals is just the default one (with eager_backend an optional argument)?

MarcoGorelli added api design needs discussion labels Dec 1, 2024

FBruzzesi mentioned this issue Dec 1, 2024

[Enh]: Allow for kwargs in LazyFrame.collect #1042

Closed

This was referenced Jan 5, 2025

feat: Implement partial "lazy" support for DuckDB (even with this PR, DuckDB support is work-in-progress!) #1725

Merged

RFC, feat: LazyFrame.collect kwargs #1734

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: `collect` for lazy-only libraries #1479

API: `collect` for lazy-only libraries #1479

MarcoGorelli commented Dec 1, 2024 •

edited

Loading

EdAbati commented Dec 2, 2024 •

edited

Loading

MarcoGorelli commented Dec 2, 2024

EdAbati commented Dec 2, 2024

MarcoGorelli commented Dec 2, 2024 •

edited

Loading

FBruzzesi commented Dec 3, 2024 •

edited

Loading

EdAbati commented Dec 3, 2024 •

edited

Loading

MarcoGorelli commented Dec 3, 2024 •

edited

Loading

API: collect for lazy-only libraries #1479

API: collect for lazy-only libraries #1479

Comments

MarcoGorelli commented Dec 1, 2024 • edited Loading

EdAbati commented Dec 2, 2024 • edited Loading

MarcoGorelli commented Dec 2, 2024

EdAbati commented Dec 2, 2024

MarcoGorelli commented Dec 2, 2024 • edited Loading

FBruzzesi commented Dec 3, 2024 • edited Loading

EdAbati commented Dec 3, 2024 • edited Loading

MarcoGorelli commented Dec 3, 2024 • edited Loading

API: `collect` for lazy-only libraries #1479

API: `collect` for lazy-only libraries #1479

MarcoGorelli commented Dec 1, 2024 •

edited

Loading

EdAbati commented Dec 2, 2024 •

edited

Loading

MarcoGorelli commented Dec 2, 2024 •

edited

Loading

FBruzzesi commented Dec 3, 2024 •

edited

Loading

EdAbati commented Dec 3, 2024 •

edited

Loading

MarcoGorelli commented Dec 3, 2024 •

edited

Loading