Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Different behaviour on polars and pandas with dates: AttributeError: 'datetime.datetime' object has no attribute 'floor' #1660

Open
sergiocalde94 opened this issue Dec 26, 2024 · 3 comments

Comments

@sergiocalde94
Copy link

Describe the bug

There is an inconsistency in behavior when using narwhals.stable.v1 with Polars and Pandas DataFrames for the max() operation followed by floor("D") on a datetime column.

As far as I understood it should be agnostic but maybe I'm doing something wrong since it's my first day using narwhals 😬

Steps or code to reproduce the bug

Setup:

import polars as pl
import narwhals.stable.v1 as nw
from datetime import datetime

start_train_date = "2024-01-01"
end_train_date = "2024-03-05"

df_polars = (
    pl.DataFrame({
        "application_started_at": [
            datetime.strptime("2024-01-01", "%Y-%m-%d"),
            datetime.strptime("2024-02-01", "%Y-%m-%d"),
            datetime.strptime("2024-03-01", "%Y-%m-%d"),
            datetime.strptime("2024-04-01", "%Y-%m-%d"),
            datetime.strptime("2024-04-06", "%Y-%m-%d")
        ]
    })
)

df_native = nw.from_native(df_polars)

Error:

df_native["application_started_at"].max().floor("D")

No error if using pandas:

df_pandas = df_polars.to_pandas()

df_native = nw.from_native(df_pandas)

df_native["application_started_at"].max().floor("D")

Expected results

Timestamp('2024-04-06 00:00:00')?

Actual results

AttributeError: 'datetime.datetime' object has no attribute 'floor'

Please run narwhals.show_version() and enter the output below.

System:
    python: 3.11.0rc1 (main, Aug 12 2022, 10:02:14) [GCC 11.2.0]
executable: /local_disk0/.ephemeral_nfs/envs/pythonEnv-9aa03ec4-6801-4d0a-80cc-2457b7007489/bin/python
   machine: Linux-5.15.0-1075-azure-x86_64-with-glibc2.35

Python dependencies:
     narwhals: 1.19.1
       pandas: 1.5.3
       polars: 1.17.1
         cudf: 
        modin: 
      pyarrow: 14.0.1
        numpy: 1.24.4

Relevant log output

No response

@FBruzzesi
Copy link
Member

Hey @sergiocalde94 thanks for reporting the issue.
The output from the .max() operation is:

  • pd.Timestamp instance for pandas case
  • datetime.datetime for polars case

As we aim to replicate polars behaviour, you should not expect to be able to use .floor() on the output.

For a temporary workaround, you can use nw.to_py_scalar(df_native["application_started_at"].max()).

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Dec 26, 2024

Thanks for the report!

I'd say this is expected, but that we need a documentation page about scalars.

In this case, as Francesco noted:

  • Polars is returning a datetime.datetime object
  • pandas is returning a subclass of it (pd.Timestamp)

.floor is a pandas-specific operation, and outside of the Narwhals API

However, datetime.datetime.date is part of the Python standard library, so you could just use that. Then, your code really will be library-agnostic:

import polars as pl
import pandas as pd
import narwhals.stable.v1 as nw
from datetime import datetime

start_train_date = "2024-01-01"
end_train_date = "2024-03-05"

df_polars = pl.DataFrame(
    {
        "application_started_at": [
            datetime.strptime("2024-01-01", "%Y-%m-%d"),
            datetime.strptime("2024-02-01", "%Y-%m-%d"),
            datetime.strptime("2024-03-01", "%Y-%m-%d"),
            datetime.strptime("2024-04-01", "%Y-%m-%d"),
            datetime.strptime("2024-04-06", "%Y-%m-%d"),
        ]
    }
)
df = nw.from_native(df_polars)
print(df["application_started_at"].max().date())
df_pandas = df_polars.to_pandas()
df = nw.from_native(df_pandas)
print(df["application_started_at"].max().date())

This outputs

2024-04-06
2024-04-06

It also works for PyArrow:

df_pyarrow = df_polars.to_arrow()
df = nw.from_native(df_pyarrow)
print(df["application_started_at"].max().date())

@sergiocalde94
Copy link
Author

Excellent solutions, thanks 🙏 I didn't notice about the nw.to_py_scalar!

Also, if it's not part of the narwhals API for me is a little bit strange to have the possibility of using it (the .floor method). I mean, after passing it to native again in my head is perfectly ok to have one type with polars and another one with pandas, as that's the reality (pd.Timestamp vs datetime.datetime), but in the same func that I am using narwhals I expect that everything is agnostic as far as it works with narwhals and I don't cast anything back to native.

It's a little bit confusing that I can use .date for both but not .floor and that's something that you need to know beforehand, don't know if that can be managed by narwhals so the user can only apply the methods that narwhals accept as agnostic so it would be easier to implement this kind of libraries/scripts.

I don't know if I am saying stupid things but that was my first thought when using this library 😃.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants