feat: Support weight streaming #3111

keehyuna · 2024-08-22T13:50:58Z

Description

Weight streaming feature is exposed as compiler option to set percent or weight streaming bytes
Create a network with kSTRONGLY_TYPED and set kWEIGHT_STREAMING to builder config
Same dtypes are required for layers in strongly typed network

Fixes # (issue)

Type of change

Please delete options that are not relevant and/or add your own.

New feature (non-breaking change which adds functionality)
This change requires a documentation update

Checklist:

My code follows the style guidelines of this project (You can use the linters)
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas and hacks
I have made corresponding changes to the documentation
I have added tests to verify my fix or my feature
New and existing unit tests pass locally with my changes
I have added the relevant labels to my PR in so that relevant reviewers are notified

narendasan · 2024-08-22T17:08:58Z

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

@@ -109,10 +111,119 @@ def __init__(
        if self.serialized_engine is not None and not self.settings.lazy_engine_init:
            self.setup_engine()

+    def set_weight_streaming_budget(self) -> None:


@keehyuna do you need to add something similar to the C++ API?

I'm sorry for confusion. It's dead code, all are moved to py/torch_tensorrt/runtime/_weight_streaming.py. C++ apis are updated in execute_engine.cpp

narendasan · 2024-08-26T17:31:17Z

core/runtime/execute_engine.cpp

@@ -95,11 +95,13 @@ bool _cudagraphs_validate_shapes(std::vector<at::Tensor> inputs, c10::intrusive_
 }

 std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intrusive_ptr<TRTEngine> compiled_engine) {
+  compiled_engine->init_context();


This will add first run latency. Why cant it run in the constructor?

Thanks for advice. I added it in constructor. Latency is in compiler() context creation in forward() will be skipped when weight streaming is not used.

core/runtime/TRTEngine.cpp

py/torch_tensorrt/dynamo/_compiler.py

narendasan · 2024-08-26T17:37:25Z

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

@@ -218,13 +219,25 @@ def set_weight_streaming_budget_v1(
                    self.engine.minimum_weight_streaming_budget
                )

+    def reset_context(self):


Do these context resets atomically with whatever runtime settting change. Leave as much out of the forward function as we can

I came crossed two ideas and tried #1. Please let me know if there is better way to handle it automatically.

reset_context(delete context) and apply set_weight_streaming_budget() api. context is created at forward()

Enqueue runtime setting change like set_weight or profile enable. Then delete context->apply pending api-> create context in forward()

narendasan · 2024-08-26T23:08:54Z

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

+        assert self.engine, f"Context is used before setting up the engine"
+
+        if self.context is None:
+            self.context = self.engine.create_execution_context()


we already have a setup engine function, not sure why we need to handle this at exec time?

weight streaming needs to be set before context is created. Or TRT throw the error. engine setup was completed in compile of runtime trt module. context needs to be recreated.

engine = runtime.deserialize_cuda_engine()
engine.weight_streaming_budget_v2 = budget_bytes
engine.create_execution_context()

fine

engine = runtime.deserialize_cuda_engine()
engine.create_execution_context()
engine.weight_streaming_budget_v2 = budget_bytes

ERROR:torch_tensorrt [TensorRT Conversion Context]:ICudaEngine::setWeightStreamingBudgetV2: Error Code 3: API Usage Error (Parameter check failed, condition: mExecutionContextCounter.use_count() == 1. The weight streaming budget cannot be modified while there are active IExecutionContexts.)

core/runtime/TRTEngine.cpp

keehyuna · 2024-08-27T12:27:19Z

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

@@ -109,10 +111,119 @@ def __init__(
        if self.serialized_engine is not None and not self.settings.lazy_engine_init:
            self.setup_engine()

+    def set_weight_streaming_budget(self) -> None:


I'm sorry for confusion. It's dead code, all are moved to py/torch_tensorrt/runtime/_weight_streaming.py. C++ apis are updated in execute_engine.cpp

keehyuna · 2024-08-27T12:49:34Z

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

+        assert self.engine, f"Context is used before setting up the engine"
+
+        if self.context is None:
+            self.context = self.engine.create_execution_context()


weight streaming needs to be set before context is created. Or TRT throw the error. engine setup was completed in compile of runtime trt module. context needs to be recreated.

engine = runtime.deserialize_cuda_engine()
engine.weight_streaming_budget_v2 = budget_bytes
engine.create_execution_context()

fine

engine = runtime.deserialize_cuda_engine()
engine.create_execution_context()
engine.weight_streaming_budget_v2 = budget_bytes

ERROR:torch_tensorrt [TensorRT Conversion Context]:ICudaEngine::setWeightStreamingBudgetV2: Error Code 3: API Usage Error (Parameter check failed, condition: mExecutionContextCounter.use_count() == 1. The weight streaming budget cannot be modified while there are active IExecutionContexts.)

keehyuna · 2024-08-27T15:50:51Z

core/runtime/execute_engine.cpp

@@ -95,11 +95,13 @@ bool _cudagraphs_validate_shapes(std::vector<at::Tensor> inputs, c10::intrusive_
 }

 std::vector<at::Tensor> execute_engine(std::vector<at::Tensor> inputs, c10::intrusive_ptr<TRTEngine> compiled_engine) {
+  compiled_engine->init_context();


Thanks for advice. I added it in constructor. Latency is in compiler() context creation in forward() will be skipped when weight streaming is not used.

keehyuna · 2024-08-27T16:04:24Z

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

@@ -218,13 +219,25 @@ def set_weight_streaming_budget_v1(
                    self.engine.minimum_weight_streaming_budget
                )

+    def reset_context(self):


I came crossed two ideas and tried #1. Please let me know if there is better way to handle it automatically.

reset_context(delete context) and apply set_weight_streaming_budget() api. context is created at forward()

Enqueue runtime setting change like set_weight or profile enable. Then delete context->apply pending api-> create context in forward()

keehyuna · 2024-08-27T16:10:50Z

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

+    def get_weight_streaming_budget(self):
+        return self.engine.streamable_weights_size
+
+    def set_weight_streaming_budget(self, budget_bytes):
+        self.reset_context()
+        self.engine.weight_streaming_budget_v2 = budget_bytes
+        if self.engine.weight_streaming_budget_v2 != budget_bytes:
+            logger.error(f"Failed to set weight streaming budget to {budget_bytes}")
+            budget_bytes = self.engine.weight_streaming_budget_v2
+        if self.engine.streamable_weights_size == budget_bytes:
+            logger.warning("Weight streaming is disabled")
+
+        return budget_bytes
+
+    def set_automatic_streaming_budget(self):
+        budget_bytes = self.engine.get_weight_streaming_automatic_budget()
+        return self.set_weight_streaming_budget(budget_bytes)
+


This api is same as in TorchTensorRTModule class. If this interface is good to go, parent class can used to share it and other some methods.

narendasan · 2024-08-27T17:51:15Z

We probably need to think about what the user flow is here:

So @ compile-time:

Users tell us to make an engine that is weight streamable

@ runtime

How do we set up the engine as a default? should there be a default weight budget?
User now wants to explicitly set the engine weight budgets
1. How can they do this from the module level? What happens if there are multiple engines in the graph?
2. We need to recreate the execution context, imo this needs to be done atomically with this call to keep it out of the forward function. i.e. as part of set_weight_budget we recreate the execution context.

narendasan · 2024-08-27T17:53:18Z

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

+        return budget_bytes
+
+    def set_automatic_streaming_budget(self):
+        budget_bytes = self.engine.get_weight_streaming_automatic_budget()


Seems like a good default we can use in setup_engine

I agree. I set automatic weight streaming when compiler options is set

narendasan · 2024-08-27T17:54:04Z

py/torch_tensorrt/dynamo/runtime/_PythonTorchTensorRTModule.py

@@ -191,6 +221,7 @@ def __del__(self) -> None:
            self.cudagraph.reset()

    def forward(self, *inputs: torch.Tensor) -> torch.Tensor | Tuple[torch.Tensor, ...]:
+        self.init_context()


I really want to pull these calls out, It should assume that the engine is setup and error if not

Understood. recreation of context happens only when set_weight_streaming_budget is called

keehyuna · 2024-08-28T15:37:47Z

Hi @narendasan

When only compiler option is provided,

automatic weight streaming budget is applied. test case

Weight stream size is set manually

torchtrt.runtime.weight_streaming_context() runtime api is added to get streamable size and set budget. test case

Context recreation

added decorator in set_weight_streaming_budget.

multiple subgraph

I'm thinking of applying normalized size if multiple module is in runtime module. impl. Too small value will have bad impact. I will test various size.

narendasan · 2024-08-28T17:54:51Z

tests/py/dynamo/runtime/test_004_weight_streaming.py

+            enable_weight_streaming=True,
+        )
+        # Weight streaming budget is applied manually.
+        ws_context = torchtrt.runtime.weight_streaming_context(optimized_model)


Can we use the context manager syntax to use this?

with torch_tensorrt.runtime.weight_streaming(model) as weight_streaming_ctx: current_budget = weight_streaming_ctx.device_budget weight_streaming_ctx.device_budget = current_budget * 0.7 # Can add listeners to __setattr__ to trigger functions optimized_model(*input)

peri044

Thoughts :

If we use weight streaming as default, is there any problem with perf ? assuming we don't allocate any budget or if automatic is chosen, and the model can fit on GPU memory completely

peri044 · 2024-08-23T00:50:09Z

py/torch_tensorrt/dynamo/conversion/converter_utils.py

+        cast_layer = ctx.net.add_cast(input_val, trt_dtype)
+        cast_layer.name = f"Cast ITensor {input_val.name} from {input_val.dtype} to {trt_dtype} - [{target_name}]-[{name}]"
+
+        return cast_layer.get_output(0)


these are currently in llm_examples_main PR

rebase with main as the llm_examples PR is merged

py/torch_tensorrt/dynamo/runtime/_TorchTensorRTModule.py

core/runtime/TRTEngine.cpp

py/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py

peri044 · 2024-08-29T23:26:20Z

py/torch_tensorrt/dynamo/conversion/converter_utils.py

+        cast_layer = ctx.net.add_cast(input_val, trt_dtype)
+        cast_layer.name = f"Cast ITensor {input_val.name} from {input_val.dtype} to {trt_dtype} - [{target_name}]-[{name}]"
+
+        return cast_layer.get_output(0)


rebase with main as the llm_examples PR is merged

peri044 · 2024-08-29T23:29:07Z

py/torch_tensorrt/dynamo/conversion/impl/cat.py

+    if ctx.net.get_flag(trt.NetworkDefinitionCreationFlag.STRONGLY_TYPED):
+        promoted_type = trt_inputs[0].dtype
+        for each_input in trt_inputs[1:]:
+            promoted_type = _enums.dtype._from(
+                torch.promote_types(
+                    _enums.dtype._from(promoted_type).to(torch.dtype),
+                    _enums.dtype._from(each_input.dtype).to(torch.dtype),
+                )
+            )
+
+        trt_promoted_type = promoted_type.to(trt.DataType)
+        trt_casted_inputs = []
+        for i, each_input in enumerate(trt_inputs):
+            casted_input = cast_trt_tensor(
+                ctx, each_input, trt_promoted_type, f"{name}_input_casted_{i}"


The type promotion is fine but does it needs to only happen when strong typing is enabled? Why not do this in general cases as well ?

I thought trt can optimize the perf for relaxed precision. But it seems multiple inputs in ops are eventually casted to same type. Tested sd unet model with/without promoted types, there was no differences. I will generalize.

peri044 · 2024-08-29T23:30:04Z

py/torch_tensorrt/dynamo/conversion/impl/conv.py

+        dtype = input.dtype if strongly_typed else None
+        bias = to_numpy(bias, dtype=dtype)


Shouldn't the type of bias be always input.dtype ?

If test fp16 variant of sd_unet model, bias data type is float16. It needs to be casted to run with weight streaming option.

py/torch_tensorrt/runtime/_weight_streaming.py

narendasan · 2024-10-01T14:49:07Z

py/torch_tensorrt/dynamo/conversion/_TRTInterpreter.py

@@ -85,6 +85,12 @@ def __init__(
        EXPLICIT_BATCH = 1 << (int)(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
        flag |= EXPLICIT_BATCH

+        if compilation_settings.enable_weight_streaming:
+            STRONGLY_TYPED = 1 << (int)(


We should log this at least since it affects the graph being created

Waiting for separate compiler option to use strongly typed network.
https://github.com/pytorch/TensorRT/pull/3110/files#diff-4396607120a22430fe9fdb7d00b094ae5d55f28d0d2e3543a878ac48583ebd21R83
I will incorporate with it.

narendasan

Super minor stuff at this point, think its almost ready to go

narendasan

I think this mostly looks good to me, anything outstanding?

keehyuna · 2024-10-15T01:22:33Z

No pending items. I think this PR can be merged.

peri044

Minor comments:

Added a comment in the example
Also update this example reference in the docsrc/infex.rst to get rendered.
Rebase with main to resolve conflicts.

Overall, changes LGTM

examples/dynamo/weight_streaming_example.py

facebook-github-bot added the cla signed label Aug 22, 2024

github-actions bot requested a review from apbose August 22, 2024 13:51

keehyuna requested review from narendasan, peri044, chohk88 and zewenli98 August 22, 2024 13:52

narendasan reviewed Aug 22, 2024

View reviewed changes

github-actions bot added component: tests Issues re: Tests component: core Issues re: The core compiler labels Aug 26, 2024

narendasan reviewed Aug 26, 2024

View reviewed changes

core/runtime/TRTEngine.cpp Outdated Show resolved Hide resolved

narendasan reviewed Aug 26, 2024

View reviewed changes

py/torch_tensorrt/dynamo/_compiler.py Outdated Show resolved Hide resolved

narendasan reviewed Aug 26, 2024

View reviewed changes

keehyuna commented Aug 27, 2024

View reviewed changes

narendasan reviewed Aug 27, 2024

View reviewed changes

narendasan reviewed Aug 28, 2024

View reviewed changes

peri044 reviewed Aug 28, 2024

View reviewed changes

narendasan reviewed Aug 28, 2024

View reviewed changes

core/runtime/TRTEngine.cpp Outdated Show resolved Hide resolved

narendasan reviewed Aug 28, 2024

View reviewed changes

core/runtime/TRTEngine.cpp Outdated Show resolved Hide resolved

peri044 reviewed Aug 30, 2024

View reviewed changes

narendasan reviewed Aug 30, 2024

View reviewed changes

py/torch_tensorrt/runtime/_weight_streaming.py Outdated Show resolved Hide resolved

narendasan reviewed Oct 1, 2024

View reviewed changes

keehyuna force-pushed the feat_weight_streaming branch from 7c9cc49 to 9842b00 Compare October 14, 2024 10:03

narendasan approved these changes Oct 14, 2024

View reviewed changes

peri044 requested changes Oct 17, 2024

View reviewed changes

examples/dynamo/weight_streaming_example.py Outdated Show resolved Hide resolved

keehyuna added 18 commits October 18, 2024 11:06

feat: Support weight streaming

2e8b563

chore: seperated compile options and runtime option for ws

79c3c3a

chore: reset context in rt module

b81949e

chore: recreate context decorator

9db1561

chore: Move context decorator into rt module

8ab068d

chore: context manager for weight streaming

af79f76

chore: Rename functions

d2bda5e

chore: rebase and update

46a8ea5

chore: changed to min budget + streamable budget

4df7ea5

chore: cpp runtime update for min + streamable budget

7356fcb

chore: promoted type in cat ops for only multiple input

8d06793

chore: import converter_utils from dynamo

eb7cad6

chore: reset context in weight budget setting

82f6528

chore: changed to budget range in [0, streamable budget]

0bcd264

chore: rebase and update weight streaming context

2439df9

chore: add document

665089d

chore: uniform name using *device_memory_budget

5a59471

chore: rebase and update

fa407bc

keehyuna force-pushed the feat_weight_streaming branch from 9842b00 to 3ac5da1 Compare October 18, 2024 03:39

github-actions bot added the documentation Improvements or additions to documentation label Oct 18, 2024

keehyuna force-pushed the feat_weight_streaming branch from 3ac5da1 to bf10495 Compare October 18, 2024 03:49

chore: rebase and update doc

c12f76f

keehyuna force-pushed the feat_weight_streaming branch from bf10495 to c12f76f Compare October 18, 2024 05:14

peri044 merged commit 92bf700 into pytorch:main Oct 23, 2024
67 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Support weight streaming #3111

feat: Support weight streaming #3111

keehyuna commented Aug 22, 2024

narendasan Aug 22, 2024

keehyuna Aug 27, 2024

narendasan Aug 26, 2024

keehyuna Aug 27, 2024

narendasan Aug 26, 2024

keehyuna Aug 27, 2024

narendasan Aug 26, 2024

keehyuna Aug 27, 2024

keehyuna Aug 27, 2024

keehyuna Aug 27, 2024

keehyuna Aug 27, 2024

keehyuna Aug 27, 2024

keehyuna Aug 27, 2024

narendasan commented Aug 27, 2024

narendasan Aug 27, 2024

keehyuna Aug 28, 2024

narendasan Aug 27, 2024

keehyuna Aug 28, 2024

keehyuna commented Aug 28, 2024

narendasan Aug 28, 2024

peri044 left a comment

peri044 Aug 23, 2024

peri044 Aug 29, 2024

peri044 Aug 29, 2024

peri044 Aug 29, 2024

keehyuna Aug 30, 2024

peri044 Aug 29, 2024

keehyuna Sep 1, 2024

narendasan Oct 1, 2024

keehyuna Oct 2, 2024

narendasan left a comment

narendasan left a comment

keehyuna commented Oct 15, 2024

peri044 left a comment

		dtype = input.dtype if strongly_typed else None
		bias = to_numpy(bias, dtype=dtype)

feat: Support weight streaming #3111

feat: Support weight streaming #3111

Conversation

keehyuna commented Aug 22, 2024

Description

Type of change

Checklist:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

narendasan commented Aug 27, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

keehyuna commented Aug 28, 2024

Choose a reason for hiding this comment

peri044 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

narendasan left a comment

Choose a reason for hiding this comment

narendasan left a comment

Choose a reason for hiding this comment

keehyuna commented Oct 15, 2024

peri044 left a comment

Choose a reason for hiding this comment