Engine Caching #2957

zewenli98 · 2024-06-26T01:19:28Z

zewenli98
Jun 26, 2024
Collaborator

Engine Caching

Goal(s)

Boost performance while calling torch.compile() via reusing previously compiled TensorRT Engines rather than recompiling it every time, thereby avoiding recompilation time.

Proposed APIs

The API would be invoked via argument to torch_tensorrt.compile, as so:

torch_tensorrt.compile(..., ignore_engine_cache=False, ...)

If ignore_engine_cache=False (by default), the backend would attempt to retrieve previously saved TensorRT Engines on the disk. If there is a hit, reuse it rather than recompiling the model.
If ignore_engine_cache=True, the backend would ignore saved TensorRT Engines anyway, instead, recompile the model and then save the new engine to the disk.
This argument provides a layer of abstraction to the user, where the engine caching is handled by Torch-TensorRT and the acceleration benefits are immediate.

Design

Basically, there are four functions: get_hash, query, save, and load. Their functionalities are described in the code as follows.

import tempfile

ENGINE_CACHE_DIR = os.path.join(tempfile.gettempdir(), "torch_tensorrt_engine_cache")

class EngineCache:
    hash_to_serialized_engine_map: Dict[str, bytes] = {}
    
    @staticmethod
    def get_hash(gm: torch.fx.GraphModule) -> str:
        """ Get the hash value of the GraphModule

        Args:
            gm (torch.fx.GraphModule): GraphModule to hash

        Returns:
            str: hash value of the GraphModule
        """
        # parameters are set to 0
        for name, param in gm.named_parameters():
            param.data.zero_()

        hash_val = FxGraphCachePickler.get_hash(gm)
        return hash_val
    

    def query(self, hash: str) -> Optional[bytes]:
        """ Query the serialized engine from the cache or `.../{hash}/engine.bin`

        Args:
            hash (str): hash value of the GraphModule

        Returns:
            Optional[bytes]: serialized TRT engine
        """
        serialized_engine = self.hash_to_serialized_engine_map.get(hash)
        if serialized_engine is not None:
            return serialized_engine
        else:
            serialized_engine = EngineCache.load(hash)
            if serialized_engine is not None:
                self.hash_to_serialized_engine_map[hash] = serialized_engine
            return serialized_engine

    
    @staticmethod
    def save(hash: str, serialized_engine: bytes) -> None:
        """ Save the serialized engine to `.../{hash}/engine.bin`

        Args:
            hash (str): hash value of the GraphModule
            serialized_engine (bytes): serialized TRT engine

        Returns:
            None
        """
        path = os.path.join(ENGINE_CACHE_DIR, f"/{hash}/engine.bin")
        with open(path, "wb") as f:
            f.write(serialized_engine)


    @staticmethod
    def load(hash: str) -> Optional[bytes]:
        """ Load the serialized engine from `.../{hash}/engine.bin`

        Args:
            hash (str): hash value of the GraphModule

        Returns:
            Optional[bytes]: serialized TRT engine
        """
        path = os.path.join(ENGINE_CACHE_DIR, f"/{hash}/engine.bin")
        if os.path.exists(path):
            with open(path, "rb") as f:
                serialized_engine = f.read()
                return serialized_engine
        else:
            return None

The pipeline is as below:

We do the engine caching for sub-modules after the partition phase.

When querying whether there is a hit in the engine cache.

If hit, reuse the architecture and then call the refitting module.
If miss, compile as normal

Implementation

Isomorphic graph

If we want to reuse a compiled graph, the first question comes to mind is how to determine two graphs are isomorphic, since we only reuse the old engines if they are same as the new one.

Considering that refit is used to reassign a new GraphModule's weights to old TRT engine, in the engine cache, we can reuse refit in this feature. Hence, we only care about the architecture of the GraphModule, ignoring its weights. This means whatever the weights are, if two GraphModules have the same architecture, they are considered the same GraphModules.

Hash graph

Since we only need to hash the architecture of GraphModule, we get rid of the weights from GraphModule. In the implementation, all weights will be replaced by 0. Then, we reuse PyTorch Inductor's FxGraphCachePickler to hash the GraphModule.

example code

from torch._inductor.codecache import FxGraphCachePickler

module = MyModule()
gm = torch.fx.symbolic_trace(module)

# parameters are set to 0
for name, param in gm.named_parameters():
    param.data.zero_()

hash_val = FxGraphCachePickler.get_hash(gm)

Cache eviction

The Least Recently Used (LRU) algorithm will be used as the cache eviction strategy. We will preset a hard-disk size for the storage of TRT Engines. Users are able to change the size.

Cache structure

narendasan · 2024-06-26T14:00:00Z

narendasan
Jun 26, 2024
Collaborator

Couple notes:

API Name: I think something like ignore_engine_cache or something to that effect is more direct, since recompile can be interpreted a number of ways.
Do you have more information on the fx.graph hash function? how would this work? What does it mean for two graphs to be isomorphic. For example do weights distinguish two fx graphs? Are there cases where two graphs may have different structures but could be equivalent still (for example if they lower to the same graph)?
Theoretically we can leverage this feature for export or really any call to compile_module as well. Can you graph out where in the compilation pipeline this feature would live?
Can you describe the structure of the cache?, is it just a flat list of directories keyed to their hash value? Also what controls are available to users
Possible other data should live in our cache directory including the timing cache and if possible network definitions. Can we reuse your cache structure for these data too?
Is there a way for people to define their own cache scheme? Say they want to do a remote cache where engines are save in a database on another server?

0 replies

narendasan · 2024-07-03T23:31:07Z

narendasan
Jul 3, 2024
Collaborator

There may be additional information like the weight map cache that we should save with the TRT engine that would save the EngineCache time. See here: https://github.com/pytorch/TensorRT/pull/2983/files

0 replies

zewenli98 · 2024-07-18T00:33:50Z

zewenli98
Jul 18, 2024
Collaborator Author

Possible other data should live in our cache directory including the timing cache and if possible network definitions. Can we reuse your cache structure for these data too?

@narendasan Currently, global timing cache (#2898) is used by default. Users seem unable to disable it. It is separated with engine caching and used during building TRT engines. Do you want to add controls or anything else? A simple use case would be perfect!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Engine Caching #2957

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Engine Caching #2957

zewenli98 Jun 26, 2024 Collaborator

Engine Caching

Goal(s)

Proposed APIs

Design

Implementation

Isomorphic graph

Hash graph

example code

Cache eviction

Cache structure

Replies: 3 comments

narendasan Jun 26, 2024 Collaborator

narendasan Jul 3, 2024 Collaborator

zewenli98 Jul 18, 2024 Collaborator Author

zewenli98
Jun 26, 2024
Collaborator

narendasan
Jun 26, 2024
Collaborator

narendasan
Jul 3, 2024
Collaborator

zewenli98
Jul 18, 2024
Collaborator Author