[Enh]: add `cov` to `Expr` and `Series` #1607

e10v · 2024-12-17T10:53:15Z

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

Please describe the purpose of the new feature or describe the problem to solve.

In tea-tasting, there is a need to calculate variance and covariance. There is a workaround: here and here. But, apparently, it's not optimal for pandas dataframes. Currently this warning is suppressed in tea-tasting.

I see that variance is already ~~in progress~~ added in #1603 ❤️ Could you please also consider adding covariance as well? It's not critical: tea-tasting works without it. But it would make calculations faster for users who prefer pandas-like dataframes.

Not all dataframes have a covariance function. For example, pyarrow doesn't have it (though it's still possible to calculate it the same way as tea-tasting does). In this case it would be very useful to have a method similar to Ibis has_operation (example usage). Could you consider adding it as well? I can create a separate feature request if needed.

Suggest a solution if possible.

No response

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

No response

The text was updated successfully, but these errors were encountered:

MarcoGorelli · 2024-12-17T11:09:18Z

thanks @e10v for the request!

and sure, a separate has_operation issue might be a good idea

FBruzzesi · 2024-12-17T11:11:16Z

Thanks for your request @e10v !

At first sight, it seems quite hard to support pandas native .cov operation in group by's to match polars.cov expression.
Possibly there is a way to not enter the "complex aggregation space" yet without implementing such function. I will take a deeper look and come back to this

e10v · 2024-12-17T11:16:28Z

@FBruzzesi thank you for quick response! Just in case there is a Series.cov in pandas: https://pandas.pydata.org/docs/reference/api/pandas.Series.cov.html

e10v · 2024-12-17T11:24:29Z

@MarcoGorelli thank you for accepting the request!

It's surely not urgent, at least for me. So even low priority would be good, considering the possible complexity of a solution.

I've also created a separate feature request #1610 for has_operation-like.

FBruzzesi · 2024-12-17T12:05:05Z

Not too pleasing, yet the following solution does not incur in complex aggregations hence it will work for pyarrow as well:

_GRP_SIZE = "_group_size__"

data = data.with_columns(**{
    _DEMEAN.format(col): _demean_nw_col(col, group_col)
    for col in covar_cols
})

# Pre computes left * right
data = data.with_columns(**{
    _COV.format(left, right): (
        nw.col(_DEMEAN.format(left)) * nw.col(_DEMEAN.format(right))
     )
     for left, right in cov_cols
})

# Covariance: numerator only
cov_expr = {
    _COV.format(left, right): nw.col(_COV.format(left, right)).sum()
    for left, right in cov_cols
}

# Group size: needed for covariance denominator
group_size_expr = {_GRP_SIZE: nw.len()}

all_expr = count_expr | mean_expr | var_expr | cov_expr | group_size_expr

result = (
    data
    .group_by(group_col)
    .agg(**all_expr)
    .with_columns(**{
        _COV.format(left, right): nw.col(_COV.format(left, right))/(nw.col(_GRP_SIZE) -1)
        for left, right in cov_cols
    })
)

e10v · 2024-12-17T13:50:56Z

@FBruzzesi thanks! In my case, count and group size are the same. So, I can just reuse it. Will fix it in the next version of tea-tasting.

FBruzzesi · 2024-12-17T13:53:43Z

Awesome! I cut some part of the code and focused only on the covariance while implementing this workaround. I apologize if I missed other parts that can be re-used.

Feel free to ping me/us in the tea-tasting PR if needed 🚀

e10v · 2024-12-24T17:38:04Z

Looks like it works: e10v/tea-tasting#110
@FBruzzesi thank you!

FBruzzesi · 2024-12-26T18:44:55Z

Great to hear that it worked out 🚀

MarcoGorelli added enhancement New feature or request Medium priority labels Dec 17, 2024

e10v mentioned this issue Dec 17, 2024

[Enh]: a method similar to has_operation in Ibis #1610

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Enh]: add `cov` to `Expr` and `Series` #1607

[Enh]: add `cov` to `Expr` and `Series` #1607

e10v commented Dec 17, 2024

MarcoGorelli commented Dec 17, 2024

FBruzzesi commented Dec 17, 2024

e10v commented Dec 17, 2024

e10v commented Dec 17, 2024 •

edited

Loading

FBruzzesi commented Dec 17, 2024 •

edited

Loading

e10v commented Dec 17, 2024 •

edited

Loading

FBruzzesi commented Dec 17, 2024

e10v commented Dec 24, 2024

FBruzzesi commented Dec 26, 2024

[Enh]: add cov to Expr and Series #1607

[Enh]: add cov to Expr and Series #1607

Comments

e10v commented Dec 17, 2024

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

Please describe the purpose of the new feature or describe the problem to solve.

Suggest a solution if possible.

If you have tried alternatives, please describe them below.

Additional information that may help us understand your needs.

MarcoGorelli commented Dec 17, 2024

FBruzzesi commented Dec 17, 2024

e10v commented Dec 17, 2024

e10v commented Dec 17, 2024 • edited Loading

FBruzzesi commented Dec 17, 2024 • edited Loading

e10v commented Dec 17, 2024 • edited Loading

FBruzzesi commented Dec 17, 2024

e10v commented Dec 24, 2024

FBruzzesi commented Dec 26, 2024

[Enh]: add `cov` to `Expr` and `Series` #1607

[Enh]: add `cov` to `Expr` and `Series` #1607

e10v commented Dec 17, 2024 •

edited

Loading

FBruzzesi commented Dec 17, 2024 •

edited

Loading

e10v commented Dec 17, 2024 •

edited

Loading