Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

enh: let Enum take arguments, allow it in construction #1541

Open
MarcoGorelli opened this issue Dec 8, 2024 · 5 comments
Open

enh: let Enum take arguments, allow it in construction #1541

MarcoGorelli opened this issue Dec 8, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@MarcoGorelli
Copy link
Member

We should be able to do

s = nw.new_series('foo', ['a', 'b', 'c'], dtype=nw.Enum(['a', 'b', 'c', ''d]), native_namespace=pd)

and have it create a Series with dtype Categorical and categories 'a', 'b', 'c', 'd'

@MarcoGorelli MarcoGorelli added the enhancement New feature or request label Dec 8, 2024
@camriddell
Copy link
Contributor

@MarcoGorelli I can take this one on

@MarcoGorelli
Copy link
Member Author

awesome! please do let us know if anything trips you up

@camriddell
Copy link
Contributor

@MarcoGorelli let me know if the following question is a bit out of scope and I should focus only on adding some extra code to patch narwhals/functions.py:_new_series_impl given the native dtype, instead of the broader handling of Categoricals/Enums
internally.


In the following code snippet what should the expected dtype of column "a" be?

 dfpd = pd.DataFrame({"a": ["a", "b"]}, dtype="category")
 df = nw.from_native(dfpd).select(nw.col("a").cast(nw.Enum(["a", "b"])))

The ambiguous portion is that internally pd.Category maps to nw.Category, then (to my understanding) the conversion is applied internally such that when we specify casting from nw.Categorynw.Enum(...), internally pandas would move from pd.CategoricalDtype to another pd.CategoricalDtype, but then narwhals reinterprets the new pd.CategoricalDtype to become a nw.Category.

Given the above consideration, would it be best to:

a. Converting to an nw.Enum from pandas native space is not allowed and will raise an error (current behavior). However we can accommodate a construction with nw.new_series via a conditional branch.
b. Converting to an nw.Enum from pandas native space creates pd.CategoricalDtype(..., ordered=True), then narwhals uses that ordered=True flag to signify whether a pd.CategoricalDtype should result in nw.Categroical or nw.Enum. The ordered flag here actually works quite nicely as polars.Enum respects the ordering of the values inputted.
c. The code snippet above returns a schema of {'a': nw.Categorical} which may surprise end users since there was an explicit request to convert to nw.Enum(...).

I have a working codebase for option B and can open a PR if you want to take a look for some more fine-grained discussion.

@MarcoGorelli
Copy link
Member Author

thanks for looking at this so closely - i'm gonna have to think about this one, not 100% sure atm 😄

@stanmart
Copy link

Just chiming in here as I'd find this features super useful. My main use case would be having a narwhals-land alternative to pandas's Series.cat.set_categories (and, by extension, a way to implement the functionality of remove_categories, add_categories, reorder_categories, etc.). For that purpose it would be super convenient if

 df = nw.from_native(dfpd).select(nw.col("a").cast(nw.Enum(["a", "b"])))

would work. I don't have a preference between options b. and c., but I'd if we could simply cast pandas series without manually creating a new one via nw.new_series :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants