nn.Embedding to avoid OneHotEncoding all categorical columns #425

ravinkohli · 2022-03-31T12:41:56Z

This PR replaces the LearnedEntityEmbedding with pytorch's nn.Embedding which implicitly one hot encodes categorical columns. This leads to a reduction in memory usage compared to the old version.

Types of changes

New feature (non-breaking change which adds functionality)

Motivation and Context

One hot encoding can lead to explosion in memory when the categories per column is high. Using nn.Embedding for such categorical columns will significantly reduce memory usage. Moreover, it is a more robust and simpler implementation of the embedding module. To do this, I have introduced a new pipeline step called ColumnSplitter (I am up for better name suggestions) which has min_values_for_embedding as a hyperparameter.
It also makes minor changes which optimise some parts of the library. These include

We were loading the data from the datamanager and preprocessing some part of it when we needed the shape of the data after preprocessing. Loading data from disk is a time heavy process, and preprocessing even a small part of it is unnecessary. Now, the shape after preprocessing is passed from the EarlyPreprocessing node making it more efficient.
remove self.categories from tabular feature validator which according to [memo] High memory consumption and the places of doubts #180 takes a lot of memory. We dont really need to store all the categories anyways we only need num_categories_per_col.

How has this been tested?

I have successfully run example_tabular_classification on Australian datasets where the default configuration allows us to verify the features introduced in this PR.

ravinkohli · 2022-03-31T13:06:25Z

autoPyTorch/pipeline/components/setup/network_embedding/LearnedEntityEmbedding.py

+                # allows us to pass embed_columns to the dataset properties.
+                # TODO: test the trade off
+                # Another solution is to combine `OneHotEncoding`, `Embedding` and `NoEncoding` in one custom transformer.
+                # this will also allow users to use this transformer outside the pipeline


Suggested change

# this will also allow users to use this transformer outside the pipeline

# this will also allow users to use this transformer outside the pipeline, see [this](https://github.com/manujosephv/pytorch_tabular/blob/main/pytorch_tabular/categorical_encoders.py#L132)

theodorju

As discussed in the meeting, I reviewed the changes.

theodorju · 2022-07-14T15:38:09Z

autoPyTorch/api/base_task.py

@@ -111,7 +111,7 @@ def send_warnings_to_log(
    return prediction


-def get_search_updates(categorical_indicator: List[bool]):
+def get_search_updates(categorical_indicator: List[bool]) -> HyperparameterSearchSpaceUpdates:


The method argument is not used, I believe it could be removed.

theodorju · 2022-07-14T15:39:26Z

autoPyTorch/api/base_task.py

@@ -267,7 +267,8 @@ def __init__(

        self.input_validator: Optional[BaseInputValidator] = None

-        self.search_space_updates = search_space_updates if search_space_updates is not None else get_search_updates(categorical_indicator)
+        # if search_space_updates is not None else get_search_updates(categorical_indicator)


I think this could also be removed.

theodorju · 2022-07-14T15:52:10Z

autoPyTorch/evaluation/train_evaluator.py

-                    self.logger.debug(f"run_summary_dict {json.dumps(run_summary_dict)}")
-                    with open(os.path.join(self.backend.temporary_directory, 'run_summary.txt'), 'a') as file:
-                        file.write(f"{json.dumps(run_summary_dict)}\n")
+            # self._write_run_summary(pipeline)


Based on the functionality that was encapsulated in the function, I think this should be called here, right?

theodorju · 2022-07-14T16:22:41Z

autoPyTorch/pipeline/base_pipeline.py

@@ -297,7 +296,7 @@ def _get_hyperparameter_search_space(self,
        """
        raise NotImplementedError()

-    def _add_forbidden_conditions(self, cs):
+    def _add_forbidden_conditions(self, cs: ConfigurationSpace) -> ConfigurationSpace:
        """
        Add forbidden conditions to ensure valid configurations.
        Currently, Learned Entity Embedding is only valid when encoder is one hot encoder


Based on the chances introduced in the PR I think the first condition mentioned in the docstring regarding Learned Entity Embedding should be removed.

theodorju · 2022-07-14T16:43:02Z

...h/pipeline/components/preprocessing/tabular_preprocessing/column_splitting/ColumnSplitter.py

+        return self
+
+    def transform(self, X: Dict[str, Any]) -> Dict[str, Any]:
+        if self.num_categories_per_col is not None:


self.num_categories_per_col is initialized as an empty list, which means that it will not be None also for the encoded columns. Maybe this conditions should be changed to:

if self.num_categories_per_col: ...

it will be none when there were no categorical column, see line 38

Hm, but line 38 initializes self.num_categories_per_col to an empty list if there are categorical columns, and [] is not None returns True.

I'm mentioning this because I thought in line 53 we check if there are columns to be embedded, currently the if conditions evaluates to true both for embedded and encoded columns.

theodorju · 2022-07-16T09:31:58Z

autoPyTorch/api/base_task.py

+    # has_cat_features = any(categorical_indicator)
+    # has_numerical_features = not all(categorical_indicator)


I think this should be removed.

theodorju · 2022-07-16T09:37:16Z

autoPyTorch/pipeline/components/setup/network_embedding/LearnedEntityEmbedding.py

@@ -19,69 +19,59 @@
 class _LearnedEntityEmbedding(nn.Module):
    """ Learned entity embedding module for categorical features"""

-    def __init__(self, config: Dict[str, Any], num_input_features: np.ndarray, num_numerical_features: int):
+    def __init__(self, config: Dict[str, Any], num_categories_per_col: np.ndarray, num_features_excl_embed: int):
        """
        Args:
            config (Dict[str, Any]): The configuration sampled by the hyperparameter optimizer
            num_input_features (np.ndarray): column wise information of number of output columns after transformation


I think num_input_features should be replaced with num_categories_per_col (np.ndarray): number of categories for categorical columns that will be embedded

theodorju · 2022-07-16T09:47:22Z

autoPyTorch/pipeline/tabular_classification.py

@@ -289,6 +277,7 @@ def _get_pipeline_steps(
            ("imputer", SimpleImputer(random_state=self.random_state)),
            # ("variance_threshold", VarianceThreshold(random_state=self.random_state)),
            # ("coalescer", CoalescerChoice(default_dataset_properties, random_state=self.random_state)),
+            ("column_splitter", ColumnSplitter(random_state=self.random_state)),


I think the docstring of the class should be updated to also include column_splitter as a step.

theodorju · 2022-07-16T09:48:47Z

autoPyTorch/pipeline/tabular_regression.py

-            ("coalescer", CoalescerChoice(default_dataset_properties, random_state=self.random_state)),
+            # ("variance_threshold", VarianceThreshold(random_state=self.random_state)),
+            # ("coalescer", CoalescerChoice(default_dataset_properties, random_state=self.random_state)),
+            ("column_splitter", ColumnSplitter(random_state=self.random_state)),


Same as tabular_classification.py, it would be nice to add this step in the docstring as well.

…edding) (#437) * add updates for apt1.0+reg_cocktails * debug loggers for checking data and network memory usage * add support for pandas, test for data passing, remove debug loggers * remove unwanted changes * : * Adjust formula to account for embedding columns * Apply suggestions from code review Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com> * remove unwanted additions * Update autoPyTorch/pipeline/components/preprocessing/tabular_preprocessing/TabularColumnTransformer.py Co-authored-by: nabenabe0928 <47781922+nabenabe0928@users.noreply.github.com>

autoPyTorch/api/base_task.py

* reduce number of hyperparameters for pytorch embedding * remove todos for the preprocessing PR, and apply suggestion from code review * remove unwanted exclude in test

ravinkohli · 2022-08-09T14:21:14Z

This branch will be merged to reg_cocktails. Therefore, this PR has been shifted to #451

ravinkohli added 3 commits March 23, 2022 12:19

have working embedding from pytroch

05d187c

divide columns to encode and embed based on threshold

769b51e

cleanup unwanted changes

0d9beae

ravinkohli requested review from nabenabe0928 and ArlindKadra March 31, 2022 12:59

ravinkohli commented Mar 31, 2022

View reviewed changes

ravinkohli added the enhancement New feature or request label Mar 31, 2022

use shape after preprocessing in base network backbone

a2d84e5

ravinkohli linked an issue Apr 5, 2022 that may be closed by this pull request

Replace Embedding to use nn.Embedding from pytorch #428

Open

ravinkohli added 4 commits April 5, 2022 19:21

remove redundant call to load datamanager

6b188a4

add init file for column splitting

539fdba

fix tests

adc26d5

fix precommit and add test changes

9573358

theodorju reviewed Jul 16, 2022

View reviewed changes

ravinkohli and others added 2 commits July 16, 2022 17:31

suggestions from review

b2c0ecc

ravinkohli commented Jul 18, 2022

View reviewed changes

autoPyTorch/api/base_task.py Outdated Show resolved Hide resolved

ravinkohli added 2 commits July 18, 2022 20:42

Update autoPyTorch/api/base_task.py

6830116

Reg cocktails apt1.0+reg cocktails pytorch embedding reduced (#454)

3761b53

* reduce number of hyperparameters for pytorch embedding * remove todos for the preprocessing PR, and apply suggestion from code review * remove unwanted exclude in test

ravinkohli closed this Aug 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nn.Embedding to avoid OneHotEncoding all categorical columns #425

nn.Embedding to avoid OneHotEncoding all categorical columns #425

ravinkohli commented Mar 31, 2022 •

edited

Loading

ravinkohli Mar 31, 2022 •

edited

Loading

theodorju left a comment •

edited

Loading

theodorju Jul 14, 2022

theodorju Jul 14, 2022

theodorju Jul 14, 2022

theodorju Jul 14, 2022

theodorju Jul 14, 2022

ravinkohli Jul 18, 2022

theodorju Jul 18, 2022 •

edited

Loading

theodorju Jul 16, 2022

theodorju Jul 16, 2022

theodorju Jul 16, 2022

theodorju Jul 16, 2022

ravinkohli commented Aug 9, 2022

	# this will also allow users to use this transformer outside the pipeline
	# this will also allow users to use this transformer outside the pipeline, see [this](https://github.com/manujosephv/pytorch_tabular/blob/main/pytorch_tabular/categorical_encoders.py#L132)

		# has_cat_features = any(categorical_indicator)
		# has_numerical_features = not all(categorical_indicator)

nn.Embedding to avoid OneHotEncoding all categorical columns #425

nn.Embedding to avoid OneHotEncoding all categorical columns #425

Conversation

ravinkohli commented Mar 31, 2022 • edited Loading

Types of changes

Motivation and Context

How has this been tested?

ravinkohli Mar 31, 2022 • edited Loading

Choose a reason for hiding this comment

theodorju left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

theodorju Jul 18, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ravinkohli commented Aug 9, 2022

ravinkohli commented Mar 31, 2022 •

edited

Loading

ravinkohli Mar 31, 2022 •

edited

Loading

theodorju left a comment •

edited

Loading

theodorju Jul 18, 2022 •

edited

Loading