Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Ensure ordering is passed through model_matrix #213

Closed
wants to merge 2 commits into from

Conversation

bashtage
Copy link
Contributor

model_matrix silently loses Formula ordering information and changes it to degree

@bashtage
Copy link
Contributor Author

This is needed to allow formulas to be reordered, and for the order to be preserved when creating a model matrix. If you pass a non-structured Formula, it works. It fails when the Formula has structure.

Not 100% sure that this is the best way to achieve this, but gets the jobs done.

@matthewwardrop
Copy link
Owner

Thanks for catching this, and for the patch! I'll take a closer look tomorrow!

@bashtage bashtage force-pushed the preserve-formula-order branch from 5b6292a to e3f1012 Compare November 26, 2024 09:21
@matthewwardrop
Copy link
Owner

matthewwardrop commented Nov 27, 2024

I haven't forgotten about this! There are lots of things at play at the moment (e.g. #216) as I try to dash off more of the open issues... and those changes may impact how we think about this too.

I think that this patch may be masking a deeper problem around how the Formula constructor is reshuffling formulae... so having this unmerged is a forcing function for me to really think about it. I'll definitely make sure this is resolved before the 1.1 release which will be released soon-ish (once you are convinced it is ready to cater to statsmodels, and once the other PRs land); but let me know if having this on a branch is problematic to you, and we can merge it in and I'll just make a note to revisit it before the 1.1.0 release.

@bashtage
Copy link
Contributor Author

I'm happy to wait. I'm just using a personal branch as the target for development that has this patch and has in fact some of the other patches that haven't merged too.

@bashtage
Copy link
Contributor Author

bashtage commented Nov 27, 2024

I'm happy to wait. I'm just using a personal branch as the target for development that has this patch and has in fact some of the other patches that haven't merged too.

One thing I noticed when writing this patch was that even if you build formulae recursively where you attach a formula to another formula all with ordering, that inside the recursive calls in this function, the sub formulae will be passed as a lists of terms and not a Formula. This is why I ended up with this internal solution using partial since only the outer Formula seemed to retain its ordering information.

model_matrix silently loses Formula ordering information and
changes it to degree
Ensure that nested formulas have their order preserved
@bashtage bashtage force-pushed the preserve-formula-order branch from e3f1012 to 8fb8ca7 Compare November 28, 2024 09:14
@bashtage
Copy link
Contributor Author

When I pull #213 and #217 onto main, I can get statsmodels to pass all tests.

There is one small issue that I think could be classified as "won't (can't) fix". patsy can sometimes suppress the T. in categorical variables. This happens when a categorical variable spans the intercept. An example is something like

import pandas as pd
import numpy as np
from patsy import dmatrices
from formulaic import model_matrix

data = pd.DataFrame(
    {
        "y": np.random.standard_normal(30),
        "a": np.random.choice(["a", "b", "c"], size=30),
        "b": np.random.choice(["x", "y", "z"], size=30),
    }
)
formula = "y ~ C(a) + C(b) - 1"
print(dmatrices(formula, data, return_type="dataframe")[1].head(5))
print(model_matrix(formula, data).rhs.head(5))

This outputs

   C(a)[a]  C(a)[b]  C(a)[c]  C(b)[T.y]  C(b)[T.z]
0      0.0      0.0      1.0        0.0        0.0
1      1.0      0.0      0.0        1.0        0.0
2      0.0      0.0      1.0        1.0        0.0
3      0.0      0.0      1.0        0.0        1.0
4      1.0      0.0      0.0        1.0        0.0
   C(a)[T.a]  C(a)[T.b]  C(a)[T.c]  C(b)[T.y]  C(b)[T.z]
0          0          0          1          0          0
1          1          0          0          1          0
2          0          0          1          1          0
3          0          0          1          0          1
4          1          0          0          1          0

@bashtage
Copy link
Contributor Author

I have poked around in Formula a bit and I noticed a (perhaps) strange behavior. When I make deeply nested formulae, the first layer will still be a formula, while the bottom layer will not be. Perhaps there is some special casing for root? If this could be removed, it might fix the problem in the sense that one could directly construct the formula as
Formula(lhs=Formula("y"), rhs=Formula("1+x+w+C(d)", order="none")). Right now this doesn't work because rhs is stored as list[Term].

    lhs = Formula("x:y + y - 1", _ordering="none")
    rhs = Formula("1 + w + d:x +x + e ", _ordering="sort")
    formula = Formula(lhs=lhs, rhs=rhs, _ordering="degree")
    larger_formula = Formula(left=formula, right=formula)
    print(type(larger_formula))
    for key, value in larger_formula._structure.items():
        print(key, type(value))
        for inner_key, inner_value in value._structure.items():
            print((key, inner_key), type(inner_value))

returns

<class 'formulaic.formula.Formula'>
left <class 'formulaic.formula.Formula'>
('left', 'lhs') <class 'list'>
('left', 'rhs') <class 'list'>
right <class 'formulaic.formula.Formula'>
('right', 'lhs') <class 'list'>
('right', 'rhs') <class 'list'>

@matthewwardrop
Copy link
Owner

matthewwardrop commented Nov 29, 2024

There is one small issue that I think could be classified as "won't (can't) fix". patsy can sometimes suppress the T. in categorical variables.

I don't know why I didn't think about this before, but I've patched formulaic to mimic this behaviour in #220 . Thanks for highlighting.

I have poked around in Formula a bit and I noticed a (perhaps) strange behavior. When I make deeply nested formulae, the first layer will still be a formula, while the bottom layer will not be. Perhaps there is some special casing for root? If this could be removed, it might fix the problem in the sense that one could directly construct the formula as
Formula(lhs=Formula("y"), rhs=Formula("1+x+w+C(d)", order="none")). Right now this doesn't work because rhs is stored as list[Term].

Technically a Formula is just a Structured[List[Term]] instance with some extra utility methods on the Formula class. The nested structure is always stored as list of terms. To keep the Formula helper methods, when you access the structure via the top-level formula class attributes/indexing, we wrap the sub-structure in a new Formula class. We special case .root so that you can access the raw list more easily. (A Formula instance with just a .root attribute can be enumerated just a like the list, fwiw).

The issue with the example you provided above is that the Formula class re-interprets (and hence re-orders) the incoming formulae in order to guarantee consistency within the new formula. This is a sensible behaviour, but is being caught up in the structured currying.

My plan is to rework the base Formula class a bit, which should land as early as tomorrow.

Thanks again @bashtage !

@matthewwardrop
Copy link
Owner

matthewwardrop commented Dec 1, 2024

@bashtage Thanks again for raising these issues. #222 is ready to go in, and so shortly this PR will be closed... but likely it would never have been done if not for your poking around, so again... many thanks! Let me know if you find any remaining issues.

@bashtage bashtage deleted the preserve-formula-order branch December 2, 2024 13:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants