Cache embedding_weights_by_table for EmbeddingFusedOptimizer

Summary: The `split_embedding_weights()` method in the `emb_module` is a time-consuming operation. Currently, it is placed in the constructor of the `EmbeddingFusedOptimizer`. As a result, every time an `EmbeddingFusedOptimizer` instance is created, this method is executed. Since `_gen_named_parameters_by_table_fused` generates EmbeddingFusedOptimizer instances **thousands of times in a loop**, a significant amount of time is spent executing this method. By extracting this operation out of the loop and passing it as a parameter to achieve a caching effect, we can save a lot of time. Specifically, the current **CREATE_TRAIN_MODULE.SHARD_MODEL** takes approximately **22 seconds** to run, but with this caching mechanism, the runtime can be reduced to around **15 seconds**. Reviewed By: dstaay-fb Differential Revision: D68578829
pytorch · Jan 29, 2025 · e9d2e0f · e9d2e0f
1 parent c6f41aa
commit e9d2e0f
Showing 1 changed file with 7 additions and 2 deletions.
diff --git a/torchrec/distributed/batched_embedding_kernel.py b/torchrec/distributed/batched_embedding_kernel.py
@@ -13,7 +13,6 @@
 import itertools
 import logging
 import tempfile
-from collections import OrderedDict
 from dataclasses import dataclass
 from typing import (
     Any,
@@ -216,6 +215,7 @@ def __init__(  # noqa C901
         pg: Optional[dist.ProcessGroup] = None,
         create_for_table: Optional[str] = None,
         param_weight_for_table: Optional[nn.Parameter] = None,
+        embedding_weights_by_table: Optional[List[torch.Tensor]] = None,
     ) -> None:
         """
         Implementation of a FusedOptimizer. Designed as a base class Embedding kernels
@@ -391,7 +391,9 @@ def get_optimizer_pointwise_shard_metadata_and_global_metadata(
         # that state_dict look identical to no-fused version.
         table_to_shard_params: Dict[str, ShardParams] = {}
 
-        embedding_weights_by_table = emb_module.split_embedding_weights()
+        embedding_weights_by_table = (
+            embedding_weights_by_table or emb_module.split_embedding_weights()
+        )
 
         all_optimizer_states = emb_module.get_optimizer_state()
         optimizer_states_keys_by_table: Dict[str, List[torch.Tensor]] = {}
@@ -674,6 +676,8 @@ def _gen_named_parameters_by_table_fused(
     pg: Optional[dist.ProcessGroup] = None,
 ) -> Iterator[Tuple[str, TableBatchedEmbeddingSlice]]:
     # TODO: move logic to FBGEMM to avoid accessing fbgemm internals
+    # Cache embedding_weights_by_table
+    embedding_weights_by_table = emb_module.split_embedding_weights()
     for t_idx, (rows, dim, location, _) in enumerate(emb_module.embedding_specs):
         table_name = config.embedding_tables[t_idx].name
         if table_name not in table_name_to_count:
@@ -709,6 +713,7 @@ def _gen_named_parameters_by_table_fused(
                 pg=pg,
                 create_for_table=table_name,
                 param_weight_for_table=weight,
+                embedding_weights_by_table=embedding_weights_by_table,
             )
         ]
         yield (table_name, weight)