Replace all occurrences of get Pandas' get_dummies() with skLearn OneHotEncoder #1134

drawlinson · 2024-01-16T03:51:38Z

An earlier issue #1111 observed inconsistent behaviour from RegressionEstimator subclasses when new data for do() method had different rows than the originally fitted data, which caused categorical variables to be encoded inconsistently. This is because the do() operator allows unseen data to be processed with an existing Estimator.

This issue occurs because categorical encoding was using Pandas' get_dummies(), which does not allow additional data to be encoded using an existing encoder. An alternative, skLearn OneHotEncoder, returns an Encoder object which can be used to encode additional data consistently. skLearn is already a DoWhy dependency. For this reason skLearn is preferred over get_dummies.

This additional change goes further to replace all occurrences of get_dummies with OneHotEncoder, so that if functionality to process additional data is added to other classes in future (e.g. do operator), the consistency bug won't happen again.

Features added to RegressionEstimator which remember a set of Encoders are pushed down to the base class CausalEstimator.
All CausalEstimator subclasses call reset_encoders() on each fit(), implementing the lifecycle assumption that fit() implies entirely new data and to forget existing data.
get_dummies was also used by the UnobservedCommonCause Refuter, but this usage has no side-effects and references to the encoded data are not retained. It was replaced simply for consistency of using skLearn.
get_dummies was also used by the do-sampler's propensity score utility function binarize_discrete. Elsewhere in these utility functions skLearn LabelEncoder is used. So, for consistency, this occurrence is also replaced by skLearn OneHotEncoder.

After the swap, all these changes are heavily covered by existing tests.

…es of Pandas' get_dummies with skLearn's OneHotEncoder. Encoder lifespan: Reuses encoders for new estimate_effect() calls, and replaces existing encoders on CausalEstimator.fit(). Additional uses of get_dummies without side-effects or consistent encoding issues in do-Sampler Propensity Scores utilities also replaced for consistency. Signed-off-by: DAVID RAWLINSON <dave@causalwizard.app>

…ccurrences-of-get_dummies

drawlinson · 2024-01-16T03:54:03Z

PR has a spurious commit from main in it, will try again

DAVID RAWLINSON added 2 commits January 16, 2024 14:37

Merge branch 'main' of github.com:drawlinson/dowhy into replace-all-o…

967f0ea

…ccurrences-of-get_dummies

drawlinson closed this Jan 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace all occurrences of get Pandas' get_dummies() with skLearn OneHotEncoder #1134

Replace all occurrences of get Pandas' get_dummies() with skLearn OneHotEncoder #1134

drawlinson commented Jan 16, 2024

drawlinson commented Jan 16, 2024

Replace all occurrences of get Pandas' get_dummies() with skLearn OneHotEncoder #1134

Replace all occurrences of get Pandas' get_dummies() with skLearn OneHotEncoder #1134

Conversation

drawlinson commented Jan 16, 2024

drawlinson commented Jan 16, 2024