Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC Perfect backwards compatibility policy #259

Closed
MarcoGorelli opened this issue Jun 6, 2024 · 17 comments · Fixed by #410
Closed

RFC Perfect backwards compatibility policy #259

MarcoGorelli opened this issue Jun 6, 2024 · 17 comments · Fixed by #410

Comments

@MarcoGorelli
Copy link
Member

Curious to hear thoughts on this one - especially from @baggiponte

Narwhals follows a strict subset of the Polars API. What happens if/when some part of this subset gets changed in Polars itself?

For example:

  • pre 1.0, Polars coalesces overlapping columns in left joins by default
  • post 1.0, it won't by default

So, what do we do in Narwhals? I'd suggest allowing people to specify which API version they want to follow in pyproject.toml, e.g.:

[tool.narwhals]
api_version = "0.20"

Then:

  • if somebody specifies api_version = "0.20" in pyproject.toml, then we promise to keep giving them the pre 1.0 behaviour indefinitely (coalesce by default in left joins)
  • if somebody specifies api_version = "1.0" in pyproject.toml (or if they don't specify it), then we give them the post 1.0 Polars behaviour (don't coalesce by default in left joins)

The idea is that if you implement your library using Narwhals, then so long as you specify an api_version in pyproject.toml, then your code should preserve the same behaviour indefinitely. You may occasionally need to bump your minimum Narwhals package version (so we can deal with working around upstream changes in pandas/Polars/etc), but you should never need to rewrite your dataframe logic, it'll be perfectly backwards-compatible.


Very keen to also get @FBruzzesi and @koaning 's take on this, as consumers. Would you appreciate / be open to specifying a Narwhals api_version in your pyproject.toml, in exchange for the promise that the Narwhals API you use won't change?


So long as we stick to fundamental dataframe operations, and a strict subset of Polars (rather than trying to reimplement everything...) then I think this should be doable!

@koaning
Copy link
Contributor

koaning commented Jun 6, 2024

That can totally work for me, but I am not 100% sure if my tools would really suffer that much from the problems that you describe specifically. A lot of my transformations are stateless things.

@MarcoGorelli
Copy link
Member Author

I think it might still affect even stateless things - suppose for example that Polars renamed shift to lag. scikit-lego uses nw.col(...).shift, so if we were to follow Polars in Narwhals and also rename it, we'd break your code

If, however, you pinned your Narwhals API version, then we'd take care of the renamed method within Narwhals, and you code would just keep working without you needing to update

@FBruzzesi
Copy link
Member

From a narwhals user perspective I think this is ideal scenario, leading to minimal/no changes.

I am mostly thinking and concerned about all the issues with narwhals internals rather than the benefits of this multiple polars api version support. Here a few that comes to mind:

  • Behaviour changes: it may be some manual effort but fairly straightforward to deal with when one know which version to support.
  • Methods renaming and/or deprecation: same as above
  • Keyword renaming and/or deprecation: I noticed that it is happening quite a bit now, before the 1.0 release. This feels significantly harder to deal with in my opinion.

On a similar consideration, how are pandas-like changes dealt with internally to be mapped to the same result? Is there any such situation?

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Jun 6, 2024

Yeah we already have to do this for some older pandas versions:

if parse_version(pd.__version__) < parse_version("2.2.0"): # pragma: no cover
result_complex = grouped.apply(func)
else:
result_complex = grouped.apply(func, include_groups=False)

Most of the Polars renamings deprecations happened for methods a bit on the fringe rather than the fundamental ones - but still, even then, we have to do deal with it sometimes:

def with_row_index(self, name: str = "index") -> Self:
if self._is_polars and parse_version(get_polars().__version__) < parse_version(
"0.20.4"
): # pragma: no cover
return self._from_dataframe(
self._dataframe.with_row_count(name),
)
return self._from_dataframe(
self._dataframe.with_row_index(name),
)

This feels significantly harder to deal with in my opinion.

true, but I think we can manage

From a narwhals user perspective I think this is ideal scenario, leading to minimal/no changes.

glad to read this 😄 going to see how hard it is to make this happen then, because API changes in pandas/polars do seem to be a major source of headaches for people trying to build tools on top of them


note on the # pragma: no cover: these lines are still covered by tests, but the CI job that reports coverage only has the most recent version and so doesn't go there. my experience with using codecov to combine coverage reports across CI jobs is that it's pretty flaky unfortunately..

@koaning
Copy link
Contributor

koaning commented Jun 6, 2024

I think it might still affect even stateless things - suppose for example that Polars renamed shift to lag. scikit-lego uses nw.col(...).shift, so if we were to follow Polars in Narwhals and also rename it, we'd break your code

Fair! But still, the proposed solution sounds totally fine to me.

The only awkward thing that comes to mind is if there is a conflicting dependency. Say, scikit-playtime uses v20 and scikit-lego uses v21. I don't know how realistic this scenario is, but I might imagine a headacke happening here. I don't think there's a way to have two narwhals versions in a single venv.

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Jun 6, 2024

I think that could still be fine - for code called from scikit-playtime, Narwhals would use the v20 API, and for code called from scikit-lego, it would use the v21 API

(side-note, and I know it's not the point of your comment, but there won't be a v0.20, because the next Polars version is 1.0! 🥳 🍾 🌶️ )


for code called from scikit-playtime, Narwhals would use the v20 API, and for code called from scikit-lego, it would use the v21 API

After having tried this out, I'm not even sure it's possible, at least not without extra perf overhead...


an alternative would be to let people specify the api version when they call nw.from_native. I.e. a consumer could choose to call:

df = nw.from_native(df_any, api_version='0.20')
answer = df.with_columns(nw.col(col).shift(-lag).alias(col + str(lag)) for col in cols for lag in lags)
return nw.to_native(answer)

🤔 not very user-friendly to have to pass api_version to every from_native call... @baggiponte perhaps might have a better suggestion?


Another suggestion: consumers could do:

from narwhals import StableNarwhals

nw = StableNarwhals('0.20.0')

def func_1(df_any):
    df = nw.from_native(df_any)
    df = df.group_by(...).agg(...)
    return nw.to_native(df)

def func_2(df_any):
    df = nw.from_native(df_any)
    df = df.with_column(...)
    return nw.to_native(df)

Like this, they can choose to only define their Narwhals API version once, and then they reuse the StableNarwhals instance in each function, and they know that their API is going to be perfectly backwards-compatible

I think this addresses all the issues noted above?

Well, other than the maintenance complexity in Narwhals, but that's exactly what this projects exists to address 🤠

@glemaitre
Copy link

Just jumping in the discussion ;).

Does having the info only in the pyproject.toml and thus metadata of the package kind of brittle: is there some settings that this info can be completely overlooked depending on the way the package is installed?

nw = StableNarwhals('0.20.0')

So here the solution is more internal to the source code and thus more explicit. In scikit-learn, it makes me think about the config_context or the set_config that will set some global setting.

with nw.config_context(api_version="0.20"):
    def func_1(df_any):
        ...

or without a context manager:

nw.set_config(api_version="0.20")

def func_1(df_any):
    ...

I can imagine a third-party library to call the set_config into the my_project/__init__.py to set the api version.

@MarcoGorelli
Copy link
Member Author

Thanks @glemaitre for your input! Agree that pyproject.toml is too brittle, once I tried it out I realised it couldn't work

But, I think we're getting there 💪 :

  • from_native could take a non-required api_version argument

  • feat: add narwhalify decorator #266 introduces a narwhalify decorator, which would also get an api_version argument

  • we then introduce narwhals.StableAPI, so that a consumer (suppose, for example, a package named scikit-beer 🍺 ) could put the following in their skbeer/__init__.py:

    from narwhals import StableAPI
    
    nw = StableAPI('0.20')

    and then, in their other modules (say, skbeer/brew.py):

    from skbeer import nw
    
    @nw.narwhalify
    def func(df):
        return df.filter((nw.col('a') + nw.col('b')) % 3 == 0)

    Like this, because nw was created using StableAPI, then all calls to narwhalify / from_native will happen with the api version which they pinned

    Like this, they can develop with confidence, knowing that the code they write in 2024 will still run in 2030

I'm liking the look for this more-and-more..


BTW, if anyone here's interested, we have our first Narwhals community call today #264 📆

@baggiponte
Copy link

I'm late to the party! I basically agree with Guillaume. While it's cool to have the StableAPI (though I'd like to put forward the motion to call it Narwhalificator), I think a global config that can be used as a decorator/context manager makes sense. It's low effort and can be used locally in an efficient manner. Also it does not force you to import the nw object you created somewhere in your library.

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Jun 8, 2024

Thanks @baggiponte ! I also like the look of just putting

nw.set_config(api_version='0.20')

in skbeer/__init__.py, I just don't know how that works in practice. How would you implement this, so that the api_version that's set in skbeer/__init__.py only takes effect for calls to Narwhals made from within skbeer?

It needs to be possible that:

  • skbeer uses Narwhals with api_version 0.20
  • skwine uses Narwhals with api_version 1.0
  • a user has both installed, and nothing breaks

If both skbeer and skwine pin their own Narwhals version using StableAPI / Narwhals, then this works. If they set a global config, then I'm not sure I see how it works any more, but I might be missing something

@baggiponte
Copy link

It needs to be possible that:

  • skbeer uses Narwhals with api_version 0.20
  • skwine uses Narwhals with api_version 1.0
  • a user has both installed, and nothing breaks

That's something I haven't considered! @glemaitre how does it work with sklearn?

@glemaitre
Copy link

@glemaitre how does it work with sklearn?

We don't implement such thing so I did not think about the problem :)

@glemaitre
Copy link

a user has both installed, and nothing breaks

So somehow, you want to register the api_version to a package level. The only thing that I can think of is to have a global dict, where you either use the default or then register a my_package_api_version to not overwrite.

But devil is in the details, maybe implementing this is just to much of a spaghetti code.

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Jun 23, 2024

Thanks everyone for your inputs 🙏

I've tried implementing this in #331 . It's not finished, but it works well enough that I can test it out in scikit-lego and see what the impact is

TL;DR: If you use narwhals.StableAPI, then we promise that any public method you use will never intentionally break (and if it breaks unintentionally, we'll fix it with upmost priority)

Here is what the diff looks like (all tests pass, and this is backwards-compatible - you have the option to use narwhals.StableAPI, but you don't have to, import narwhals as nw will continue to work as it does today), it's just:

  1. add a sklego/_narwhals.py file which instantiates narwhals.StableAPI
  2. import that StableAPI object, instead of import narwhals as nw, in the rest of the codebase
diff --git a/sklego/_narwhals.py b/sklego/_narwhals.py
new file mode 100644
index 0000000..5286f14
--- /dev/null
+++ b/sklego/_narwhals.py
@@ -0,0 +1,3 @@
+from narwhals import StableAPI
+
+nw = StableAPI('0.20')
diff --git a/sklego/linear_model.py b/sklego/linear_model.py
index 99f689b..cc896cc 100644
--- a/sklego/linear_model.py
+++ b/sklego/linear_model.py
@@ -9,7 +9,7 @@ from abc import ABC, abstractmethod
 from inspect import signature
 from warnings import warn
 
-import narwhals as nw
+from sklego._narwhals import nw
 import numpy as np
 from scipy.optimize import minimize
 from scipy.special._ufuncs import expit

[...]

I think this is fairly minimal, and gives a way to promise consumers: code you write today will keep working x years from now.

This is inspired by Rust editions, so there is a working precedent for doing this, just not (as far as I'm aware) in the Python space

Any feedback on the design? Would you use this? Does any part of it risk being confusing? 🙏

@MarcoGorelli MarcoGorelli changed the title Perfect backwards compatibility policy RFC Perfect backwards compatibility policy Jun 23, 2024
@jcheng5
Copy link

jcheng5 commented Jun 26, 2024

@MarcoGorelli Have you considered having different classes for the different editions? StableAPIv1, StableAPIv2...

Without that, I don't know how static types can reflect things like Expr.cum_sum() in 1.0 becoming Expr.cumulative_sum() in 2.0.

(Also, hello from the Shiny team--we're thinking about using Narwhals for our various dataframe related features!)

@MarcoGorelli
Copy link
Member Author

Thanks @jcheng5 for your input! Indeed, static types are in issue in the proposal I'd written previously

Something I'm trying out is

import narwhals.stable.v0_20 as nw
# import narwhals.stable.v1_0 as nw

in which things seem to be behaving a lot more nicely

Also, that's amazing that you're considering Narwhals for use in Shiny 🙏 ! Please do let me know if there's anything I can do on the Narwhals side that would make it easier for (e.g. any missing features you need?), and feel free to book time on https://calendly.com/marcogorelli if that would be easier over a call

@MarcoGorelli
Copy link
Member Author

MarcoGorelli commented Jul 6, 2024

Have you considered having different classes for the different editions? StableAPIv1, StableAPIv2...

Alright, this took some effort, but narwhals.stable.v1 is now implemented in #410

There's several check to ensure it stays up-to-date with the main namespace

I've also sneakily added Expr._taxicab_norm, so we can make as-realistic a test as possible that, in narwhals.stable.v1_0, we can rename it to Expr._l1_norm with this affecting the main namespace

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants