-
Notifications
You must be signed in to change notification settings - Fork 78
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Facing issue in config search with Batch Size, Dynamic Batching, and Sequence Length #957
Comments
If you want exhaustive profiling then you should run in |
I tried running it with brute search mode using this command: model-analyzer -v profile \
-f config.yaml \
--triton-launch-mode=docker \
--output-model-repository-path /path/to/output\
--run-config-search-mode brute \
--profile-models reranker \
--override-output-model-repository \
--model-repository /path/to/models_repo/ \
--triton-output-path brute_search.log config.yaml perf_analyzer_flags:
shape:
- input_ids:128
- attention_mask:128
- token_type_ids:128 LOGS: model-analyzer logs:
Triton Server Logs: =============================
== Triton Inference Server ==
=============================
NVIDIA Release 23.01 (build 52277748)
Triton Server Version 2.30.0
Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
I0115 09:23:22.748025 7 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fe718000000' with size 268435456
I0115 09:23:22.750046 7 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0115 09:23:22.752359 7 server.cc:563]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0115 09:23:22.752379 7 server.cc:590]
+---------+------+--------+
| Backend | Path | Config |
+---------+------+--------+
+---------+------+--------+
I0115 09:23:22.752394 7 server.cc:633]
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+
I0115 09:23:22.778548 7 metrics.cc:864] Collecting metrics for GPU 0: Tesla T4
I0115 09:23:22.778751 7 metrics.cc:757] Collecting CPU metrics
I0115 09:23:22.778938 7 tritonserver.cc:2264]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.30.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace logging |
| model_repository_path[0] | /home/ec2-user/workspace/model_out/brute |
| model_control_mode | MODE_EXPLICIT |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0115 09:23:22.780195 7 grpc_server.cc:4819] Started GRPCInferenceService at 0.0.0.0:8001
I0115 09:23:22.780459 7 http_server.cc:3477] Started HTTPService at 0.0.0.0:8000
I0115 09:23:22.821468 7 http_server.cc:184] Started Metrics Service at 0.0.0.0:8002
I0115 09:23:26.366687 7 model_lifecycle.cc:459] loading: reranker_config_default:1
I0115 09:23:26.368093 7 onnxruntime.cc:2459] TRITONBACKEND_Initialize: onnxruntime
I0115 09:23:26.368119 7 onnxruntime.cc:2469] Triton TRITONBACKEND API version: 1.11
I0115 09:23:26.368126 7 onnxruntime.cc:2475] 'onnxruntime' TRITONBACKEND API version: 1.11
I0115 09:23:26.368132 7 onnxruntime.cc:2505] backend configuration:
{"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
I0115 09:23:26.383661 7 onnxruntime.cc:2563] TRITONBACKEND_ModelInitialize: reranker_config_default (version 1)
I0115 09:23:26.384239 7 onnxruntime.cc:666] skipping model configuration auto-complete for 'reranker_config_default': inputs and outputs already specified
I0115 09:23:26.388541 7 onnxruntime.cc:2606] TRITONBACKEND_ModelInstanceInitialize: reranker (GPU device 0)
2025-01-15 09:23:26.584355482 [W:onnxruntime:, session_state.cc:1030 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-01-15 09:23:26.584378936 [W:onnxruntime:, session_state.cc:1032 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
I0115 09:23:26.689601 7 model_lifecycle.cc:694] successfully loaded 'reranker_config_default' version 1
=============================
== Triton Inference Server ==
=============================
NVIDIA Release 23.01 (build 52277748)
Triton Server Version 2.30.0
Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
I0115 09:24:02.823235 7 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f4e54000000' with size 268435456
I0115 09:24:02.825230 7 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0115 09:24:02.827562 7 server.cc:563]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0115 09:24:02.827588 7 server.cc:590]
+---------+------+--------+
| Backend | Path | Config |
+---------+------+--------+
+---------+------+--------+
I0115 09:24:02.827605 7 server.cc:633]
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+
I0115 09:24:02.854186 7 metrics.cc:864] Collecting metrics for GPU 0: Tesla T4
I0115 09:24:02.854429 7 metrics.cc:757] Collecting CPU metrics
I0115 09:24:02.854604 7 tritonserver.cc:2264]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.30.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace logging |
| model_repository_path[0] | /home/ec2-user/workspace/model_out/brute |
| model_control_mode | MODE_EXPLICIT |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0115 09:24:02.855883 7 grpc_server.cc:4819] Started GRPCInferenceService at 0.0.0.0:8001
I0115 09:24:02.856151 7 http_server.cc:3477] Started HTTPService at 0.0.0.0:8000
I0115 09:24:02.899441 7 http_server.cc:184] Started Metrics Service at 0.0.0.0:8002
E0115 09:24:06.436241 7 model_repository_manager.cc:1004] Poll failed for model directory 'reranker_config_0': dynamic batching preferred size must be <= max batch size for reranker_config_0
=============================
== Triton Inference Server ==
=============================
NVIDIA Release 23.01 (build 52277748)
Triton Server Version 2.30.0
Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
I0115 09:24:12.757457 7 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fedbe000000' with size 268435456
I0115 09:24:12.759488 7 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0115 09:24:12.761792 7 server.cc:563]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+
I0115 09:24:12.761813 7 server.cc:590]
+---------+------+--------+
| Backend | Path | Config |
+---------+------+--------+
+---------+------+--------+
I0115 09:24:12.761828 7 server.cc:633]
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+
I0115 09:24:12.788033 7 metrics.cc:864] Collecting metrics for GPU 0: Tesla T4
I0115 09:24:12.788255 7 metrics.cc:757] Collecting CPU metrics
I0115 09:24:12.788449 7 tritonserver.cc:2264]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option | Value |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id | triton |
| server_version | 2.30.0 |
| server_extensions | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace logging |
| model_repository_path[0] | /home/ec2-user/workspace/model_out/brute |
| model_control_mode | MODE_EXPLICIT |
| strict_model_config | 0 |
| rate_limit | OFF |
| pinned_memory_pool_byte_size | 268435456 |
| cuda_memory_pool_byte_size{0} | 67108864 |
| response_cache_byte_size | 0 |
| min_supported_compute_capability | 6.0 |
| strict_readiness | 1 |
| exit_timeout | 30 |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
I0115 09:24:12.789778 7 grpc_server.cc:4819] Started GRPCInferenceService at 0.0.0.0:8001
I0115 09:24:12.790045 7 http_server.cc:3477] Started HTTPService at 0.0.0.0:8000
I0115 09:24:12.831095 7 http_server.cc:184] Started Metrics Service at 0.0.0.0:8002
E0115 09:24:16.376338 7 model_repository_manager.cc:1004] Poll failed for model directory 'reranker_config_1': dynamic batching preferred size must be <= max batch size for reranker_config_1 and for all subsequent configs reranker_config_2, reranker_config_3 , etc, same error keeps coming up in server. |
Can you also share the model config? Curious to what the max batch size of the model is. |
I have not added any default config, but this is the config.pbtxt generated for default run by model-analyzer name: "reranker_config_default"
platform: "onnxruntime_onnx"
version_policy {
latest {
num_versions: 1
}
}
max_batch_size: 4
input {
name: "token_type_ids"
data_type: TYPE_INT64
dims: -1
}
input {
name: "attention_mask"
data_type: TYPE_INT64
dims: -1
}
input {
name: "input_ids"
data_type: TYPE_INT64
dims: -1
}
output {
name: "logits"
data_type: TYPE_FP32
dims: 1
}
instance_group {
name: "reranker"
count: 1
gpus: 0
kind: KIND_GPU
}
default_model_filename: "model.onnx"
dynamic_batching {
preferred_batch_size: 4
}
optimization {
input_pinned_memory {
enable: true
}
output_pinned_memory {
enable: true
}
}
backend: "onnxruntime"
model-analyzer -v profile \
-f config.yaml \
-b 4,8,16,32,64,128 \
--triton-launch-mode=docker \
--output-model-repository-path /path/to/output \
--run-config-search-mode brute \
--run-config-search-min-model-batch-size 4 \
--client-protocol http \
--profile-models reranker \
--override-output-model-repository \
--model-repository /path/to/models_repo/ \
--triton-output-path brute_search.log
dynamic_batching {
preferred_batch_size: 4
} If I want to sweep over different configurations of dynamic batching like preferred batch sizes as list [4,8,16.....] and also experiment with queue delay, how can I do that? |
I am running into some issues when running the model analyzer.
But this also results in another issue. For configs other than default config, when the model-analyzer loads the triton server, through the logs I could see that the server was setting the max batch size to 1 and dynamic batch size to 4 which resulted in following error : "dynamic batching preferred size must be <= max batch size"
It would be really helpful if you could answer these questions. And if it is possible to check all these various configs (batch size/dynamic batch/sequence length) with a single config/CLI command, that would be ideal.
Thank you in advance!
The text was updated successfully, but these errors were encountered: