Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: collect for lazy-only libraries #1479

Open
MarcoGorelli opened this issue Dec 1, 2024 · 7 comments · May be fixed by #1734
Open

API: collect for lazy-only libraries #1479

MarcoGorelli opened this issue Dec 1, 2024 · 7 comments · May be fixed by #1734

Comments

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 1, 2024

We currently have LazyFrame.collect:

  • for Polars.LazyFrame, this goes to Polars.DataFrames
  • for dask.DataFrames, it goes to pandas.DataFrame
  • when we get to PySpark / DuckDB / etc ... where will it go?

I think one solution could be to add extra keyword arguments to LazyFrame.collect to specify such ambiguous cases. Adding extra (non-required) keyword arguments would still be backwards-compatible

The signature could look like this:

def collect(self, duckdb_eager_frame: Literal['pandas', 'polars', 'pyarrow'] = 'pyarrow', pyspark_eager_frame: Literal['pandas'] = 'pandas') -> DataFrame[Any]:

This still leave us with the question about what to do for "extension" dataframes, i.e. if someone implements __narwhals_dataframe__ / __narwhals_lazyframe__ and extends Narwhals themself. We could require them to also implement a dunder method specifying which eager frame they want their lazy frame collected to, or which ones they support collecting into, and then have an extra kwarg in collect for that

@EdAbati any thoughts?

@EdAbati
Copy link
Collaborator

EdAbati commented Dec 2, 2024

I really like the idea of allowing collecting to different types of dataframes!

What about having collect to do something like this:

    def collect(self, eager_backend: Literal['pandas', 'polars', 'pyarrow'] | None = None) -> DataFrame[Any]:
        eager_backend = eager_backend or self._default_eager_backend
        if eager_backend == 'pandas':
            return self._collect_pandas()
        elif eager_backend == 'polars':
            ...

in this case I think we would not need to worry about any extra kwargs to be backward compatible or for "extensions" dataframes.
The "extensions" will need to implement the private method _default_eager_backend and at least one of the methods _collect_<backend>(). Any of the _collect_<backend>() will default to NotImplementedError

If we use the SparkLikeDataframe for both DuckDB and Spark, we could use the Implementation Enum to decide what to do. e.g.

    @property
    def _default_eager_backend(self) -> str:
        if self._implementation is Implementation.PYSPARK:
            return 'pandas'
        elif self._implementation is Implementation.DUCK_DB:
            return 'pyarrow'

    def _collect_polars(self) -> DataFrame[Any]:
        if self._implementation is Implementation.SPARK:
            raise NotImplementedError("Cannot collect a Spark DataFrame to Polars.")
        elif self._implementation is Implementation.DUCK_DB:
            ...

Thinking out loud here, I may have missed something :D what do you think?

@MarcoGorelli
Copy link
Member Author

I think someone may now want to specify the exact eager backend, but just say "at this point, we need to switch to eager computation" and, for each lazy backend, to switch to the corresponding eager backend

For Polars and Dask, the corresponding eager backend is obvious. But DuckDB, not so much

@EdAbati
Copy link
Collaborator

EdAbati commented Dec 2, 2024

Totally agree!👌

Maybe I didn't explain very well in the message above. My point was just about the number of kwargs and the methods that would be required for an 'extension' LazyDataframe implementation.

I think with only 1 kwarg (that define the eager dataframe type to collect to) we could cover everything. In other words, I miss the reason why we need more kwargs 😅

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Dec 2, 2024

I'm just thinking that we could have:

  • skbeer wants to do df.collect() and have duckdb go to pyarrow, and everything else goes to its native eager backend
  • skwine wants to do df.collect() and have duckdb go to pandas, and everything else goes to its native eager backend

how would they write collect such that this occurs?

or do we just not worry about it and make an opinionated choice about what the duckdb default is until someone complains?

@FBruzzesi
Copy link
Member

FBruzzesi commented Dec 3, 2024

Jumping in the conversation :)

I still believe that the solution 2 in #1042 is the most flexible to be able to pass along any other argument to the specific lazy backend. We didn't have yet, the need yet, but polars has 2 engines for collecting, plus a streaming mode. I envision this as someone writing agnostic code and being ready to receive whatever dataframe as input, they would want to specify all possible specific collect arguments.

For Spark, DuckDB (and others), I really like the idea of having eager_backend argument in the .collect(..) spec.

Another alternative could be to try-except which eager backend is available (with a custom prioritization/order). I am not a fan of this approach, but maybe someone is.

One issue I would like to bring up is the following: df.collect().<insert_eager_operation>.lazy() in expectation would result in the same type of the original one, yet if we start from X and collect to pandas, then .lazy() will return PandasLikeDataFrame. Currently we don't have a way to determine and achieve this

@EdAbati
Copy link
Collaborator

EdAbati commented Dec 3, 2024

Aaah now I get it!

Assuming we had the below (and the private methods I mentioned above)

class LazyFrame:
    ...
    def collect(self, eager_backend: Literal['pandas', 'polars', 'pyarrow'] | None = None, **kwargs) -> DataFrame[Any]:
        ...

We would be able to give a choice on the backend and pass any available kwargs.

how would they write collect such that this occurs?

If users wants to choose the eager backend based on the type of LazyFrame they get, we could suggest something like:

# skbeer
if is_duckdb_lazyframe(df):
    some_kwargs = {...}
    eager_df = df.collect(eager_backend="pyarrow", **some_kwargs)
else:
    eager_df = df.collect()

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Dec 3, 2024

If users wants to choose the eager backend based on the type of LazyFrame they get, we could suggest something like:

that's true, they could if/then their way of it

could we even say that they should do that for the kwargs? like, instead of polars_kwargs / duckdb_kwargs etc, people could do

if is_polars:
    df.to_native().collect(streaming=True, cse=False)

?
And the collect we have in Narwhals is just the default one (with eager_backend an optional argument)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants