Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enh]: add cov to Expr and Series #1607

Open
e10v opened this issue Dec 17, 2024 · 9 comments
Open

[Enh]: add cov to Expr and Series #1607

e10v opened this issue Dec 17, 2024 · 9 comments
Labels
enhancement New feature or request Medium priority

Comments

@e10v
Copy link
Contributor

e10v commented Dec 17, 2024

We would like to learn about your use case. For example, if this feature is needed to adopt Narwhals in an open source project, could you please enter the link to it below?

https://github.com/e10v/tea-tasting

Please describe the purpose of the new feature or describe the problem to solve.

In tea-tasting, there is a need to calculate variance and covariance. There is a workaround: here and here. But, apparently, it's not optimal for pandas dataframes. Currently this warning is suppressed in tea-tasting.

I see that variance is already in progress added in #1603 ❤️ Could you please also consider adding covariance as well? It's not critical: tea-tasting works without it. But it would make calculations faster for users who prefer pandas-like dataframes.

Not all dataframes have a covariance function. For example, pyarrow doesn't have it (though it's still possible to calculate it the same way as tea-tasting does). In this case it would be very useful to have a method similar to Ibis has_operation (example usage). Could you consider adding it as well? I can create a separate feature request if needed.

Suggest a solution if possible.

No response

If you have tried alternatives, please describe them below.

No response

Additional information that may help us understand your needs.

No response

@MarcoGorelli
Copy link
Member

thanks @e10v for the request!

and sure, a separate has_operation issue might be a good idea

@MarcoGorelli MarcoGorelli added enhancement New feature or request Medium priority labels Dec 17, 2024
@FBruzzesi
Copy link
Member

Thanks for your request @e10v !

At first sight, it seems quite hard to support pandas native .cov operation in group by's to match polars.cov expression.
Possibly there is a way to not enter the "complex aggregation space" yet without implementing such function. I will take a deeper look and come back to this

@e10v
Copy link
Contributor Author

e10v commented Dec 17, 2024

@FBruzzesi thank you for quick response! Just in case there is a Series.cov in pandas: https://pandas.pydata.org/docs/reference/api/pandas.Series.cov.html

@e10v
Copy link
Contributor Author

e10v commented Dec 17, 2024

@MarcoGorelli thank you for accepting the request!

It's surely not urgent, at least for me. So even low priority would be good, considering the possible complexity of a solution.

I've also created a separate feature request #1610 for has_operation-like.

@FBruzzesi
Copy link
Member

FBruzzesi commented Dec 17, 2024

Not too pleasing, yet the following solution does not incur in complex aggregations hence it will work for pyarrow as well:

_GRP_SIZE = "_group_size__"

data = data.with_columns(**{
    _DEMEAN.format(col): _demean_nw_col(col, group_col)
    for col in covar_cols
})

# Pre computes left * right
data = data.with_columns(**{
    _COV.format(left, right): (
        nw.col(_DEMEAN.format(left)) * nw.col(_DEMEAN.format(right))
     )
     for left, right in cov_cols
})

# Covariance: numerator only
cov_expr = {
    _COV.format(left, right): nw.col(_COV.format(left, right)).sum()
    for left, right in cov_cols
}

# Group size: needed for covariance denominator
group_size_expr = {_GRP_SIZE: nw.len()}

all_expr = count_expr | mean_expr | var_expr | cov_expr | group_size_expr

result = (
    data
    .group_by(group_col)
    .agg(**all_expr)
    .with_columns(**{
        _COV.format(left, right): nw.col(_COV.format(left, right))/(nw.col(_GRP_SIZE) -1)
        for left, right in cov_cols
    })
)

@e10v
Copy link
Contributor Author

e10v commented Dec 17, 2024

@FBruzzesi thanks! In my case, count and group size are the same. So, I can just reuse it. Will fix it in the next version of tea-tasting.

@FBruzzesi
Copy link
Member

Awesome! I cut some part of the code and focused only on the covariance while implementing this workaround. I apologize if I missed other parts that can be re-used.

Feel free to ping me/us in the tea-tasting PR if needed 🚀

@e10v
Copy link
Contributor Author

e10v commented Dec 24, 2024

Looks like it works: e10v/tea-tasting#110
@FBruzzesi thank you!

@FBruzzesi
Copy link
Member

Great to hear that it worked out 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Medium priority
Projects
None yet
Development

No branches or pull requests

3 participants