Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facing issue in config search with Batch Size, Dynamic Batching, and Sequence Length #957

Open
harsh-boloai opened this issue Jan 13, 2025 · 4 comments

Comments

@harsh-boloai
Copy link

I am running into some issues when running the model analyzer.

  1. Running it on an onnx model with below config.yaml file, the model-analyzer only checks batch size 4 (attached the screenshot under question 3), not going beyond that:
model_repository: /path/to/models_repo/

# Disable run config search
run_config_search_disable: True

# Model profiling configuration
profile_models:
  reranker:
    parameters:
      concurrency:
        start: 5
        stop: 20
        step: 5
      batch_sizes: [4, 8, 16, 32, 64]
    model_config_parameters:
      dynamic_batching:
        max_queue_delay_microseconds: [200, 400, 600]
      instance_group:
        - kind: KIND_GPU
          count: [1, 2]
perf_analyzer_flags:
  shape:
  - input_ids:128
  - attention_mask:128
  - token_type_ids:128
  1. I tried to run it using CLI with only the input shapes present in config, with the below command:
model-analyzer profile \
  -f config.yaml \
  --triton-launch-mode=docker \
  --output-model-repository-path /path/to/output \
  --run-config-search-max-instance-count 2 \
  --profile-models reranker \
  --run-config-search-max-concurrency 2 \
  --run-config-search-max-model-batch-size 2 \
  --override-output-model-repository \
  --model-repository /path/to/models_repo/

But this also results in another issue. For configs other than default config, when the model-analyzer loads the triton server, through the logs I could see that the server was setting the max batch size to 1 and dynamic batch size to 4 which resulted in following error : "dynamic batching preferred size must be <= max batch size"

  1. Is there any way to have model-analyzer run all configurations with dynamic batching turned off also, to compare how it affects the throughput? Below is the report generated using the config mentioned in in 1) - how to have rows corresponding to dynamic batching disabled in this?
Screenshot 2025-01-13 at 6 16 21 pm
  1. Is there any way to include varying sequence lengths in the analysis? I tried below config but it did not work
perf_analyzer_flags:
  shape:
  - input_ids:[128,256]
  - attention_mask:[128,256]
  - token_type_ids:[128,256]

It would be really helpful if you could answer these questions. And if it is possible to check all these various configs (batch size/dynamic batch/sequence length) with a single config/CLI command, that would be ideal.

Thank you in advance!

@nv-braf
Copy link
Contributor

nv-braf commented Jan 13, 2025

If you want exhaustive profiling then you should run in brute search mode.
Without seeing the logs it sounds to me like quick search couldn't find any valid configurations with a bs > 4.
I would recommend you run with `-v' to turn on the debug/verbose prints to get a better idea of what is being profiled.

@harsh-boloai
Copy link
Author

I tried running it with brute search mode using this command:

model-analyzer -v profile \
  -f config.yaml \
  --triton-launch-mode=docker \
  --output-model-repository-path  /path/to/output\
  --run-config-search-mode brute \
  --profile-models reranker \
  --override-output-model-repository \
  --model-repository /path/to/models_repo/ \
  --triton-output-path brute_search.log

config.yaml

perf_analyzer_flags:
  shape:
  - input_ids:128
  - attention_mask:128
  - token_type_ids:128

LOGS:

model-analyzer logs:

 'triton_server_flags': {},
 'triton_server_path': 'tritonserver',
 'weighting': None}
09:22:53 [Model Analyzer] Initializing GPUDevice handles
09:22:54 [Model Analyzer] Using GPU 0 Tesla T4 with UUID GPU-4d72aabf-001d-3307-fd66-a250532bdf5b
09:22:54 [Model Analyzer] Starting a Triton Server using docker
09:22:54 [Model Analyzer] Pulling docker image nvcr.io/nvidia/tritonserver:23.01-py3
09:22:55 [Model Analyzer] No checkpoint file found, starting a fresh run.
09:22:55 [Model Analyzer] Profiling server only metrics...
09:22:56 [Model Analyzer] DEBUG: Triton Server started.
09:23:11 [Model Analyzer] DEBUG: Stopped Triton Server.
09:23:11 [Model Analyzer] Starting a Triton Server using docker
09:23:11 [Model Analyzer] Pulling docker image nvcr.io/nvidia/tritonserver:23.01-py3
09:23:13 [Model Analyzer] DEBUG: Triton Server started.
09:23:16 [Model Analyzer] DEBUG: Model reranker loaded
09:23:21 [Model Analyzer] DEBUG: Stopped Triton Server.
09:23:21 [Model Analyzer]
09:23:21 [Model Analyzer] Creating model config: reranker_config_default
09:23:21 [Model Analyzer]
09:23:22 [Model Analyzer] DEBUG: Triton Server started.
09:23:26 [Model Analyzer] DEBUG: Model reranker_config_default loaded
09:23:26 [Model Analyzer] Profiling reranker_config_default: client batch size=1, concurrency=1
09:23:26 [Model Analyzer] DEBUG: Running ['perf_analyzer', '-m', 'reranker_config_default', '-b', '1', '-u', 'localhost:8001', '-i', 'grpc', '-f', 'reranker_config_default-results.csv', '--verbose-csv', '--concurrency-range', '1', '--shape', 'input_ids:128', '--shape', 'attention_mask:128', '--shape', 'token_type_ids:128', '--measurement-mode', 'count_windows', '--collect-metrics', '--metrics-url', 'http://localhost:8002/metrics', '--metrics-interval', '1000.0']
09:23:30 [Model Analyzer] DEBUG: Reading PA results from reranker_config_default-results.csv
09:23:30 [Model Analyzer] Profiling reranker_config_default: client batch size=1, concurrency=2
09:23:30 [Model Analyzer] DEBUG: Running ['perf_analyzer', '-m', 'reranker_config_default', '-b', '1', '-u', 'localhost:8001', '-i', 'grpc', '-f', 'reranker_config_default-results.csv', '--verbose-csv', '--concurrency-range', '2', '--shape', 'input_ids:128', '--shape', 'attention_mask:128', '--shape', 'token_type_ids:128', '--measurement-mode', 'count_windows', '--collect-metrics', '--metrics-url', 'http://localhost:8002/metrics', '--metrics-interval', '1000.0']
09:23:34 [Model Analyzer] DEBUG: Reading PA results from reranker_config_default-results.csv
09:23:34 [Model Analyzer] Profiling reranker_config_default: client batch size=1, concurrency=4
09:23:34 [Model Analyzer] DEBUG: Running ['perf_analyzer', '-m', 'reranker_config_default', '-b', '1', '-u', 'localhost:8001', '-i', 'grpc', '-f', 'reranker_config_default-results.csv', '--verbose-csv', '--concurrency-range', '4', '--shape', 'input_ids:128', '--shape', 'attention_mask:128', '--shape', 'token_type_ids:128', '--measurement-mode', 'count_windows', '--collect-metrics', '--metrics-url', 'http://localhost:8002/metrics', '--metrics-interval', '1000.0']
09:23:38 [Model Analyzer] DEBUG: Reading PA results from reranker_config_default-results.csv
09:23:38 [Model Analyzer] Profiling reranker_config_default: client batch size=1, concurrency=8
09:23:38 [Model Analyzer] DEBUG: Running ['perf_analyzer', '-m', 'reranker_config_default', '-b', '1', '-u', 'localhost:8001', '-i', 'grpc', '-f', 'reranker_config_default-results.csv', '--verbose-csv', '--concurrency-range', '8', '--shape', 'input_ids:128', '--shape', 'attention_mask:128', '--shape', 'token_type_ids:128', '--measurement-mode', 'count_windows', '--collect-metrics', '--metrics-url', 'http://localhost:8002/metrics', '--metrics-interval', '1000.0']
09:23:42 [Model Analyzer] DEBUG: Reading PA results from reranker_config_default-results.csv
09:23:42 [Model Analyzer] Profiling reranker_config_default: client batch size=1, concurrency=16
09:23:42 [Model Analyzer] DEBUG: Running ['perf_analyzer', '-m', 'reranker_config_default', '-b', '1', '-u', 'localhost:8001', '-i', 'grpc', '-f', 'reranker_config_default-results.csv', '--verbose-csv', '--concurrency-range', '16', '--shape', 'input_ids:128', '--shape', 'attention_mask:128', '--shape', 'token_type_ids:128', '--measurement-mode', 'count_windows', '--collect-metrics', '--metrics-url', 'http://localhost:8002/metrics', '--metrics-interval', '1000.0']
09:23:46 [Model Analyzer] DEBUG: Reading PA results from reranker_config_default-results.csv
09:23:46 [Model Analyzer] Profiling reranker_config_default: client batch size=1, concurrency=32
09:23:46 [Model Analyzer] DEBUG: Running ['perf_analyzer', '-m', 'reranker_config_default', '-b', '1', '-u', 'localhost:8001', '-i', 'grpc', '-f', 'reranker_config_default-results.csv', '--verbose-csv', '--concurrency-range', '32', '--shape', 'input_ids:128', '--shape', 'attention_mask:128', '--shape', 'token_type_ids:128', '--measurement-mode', 'count_windows', '--collect-metrics', '--metrics-url', 'http://localhost:8002/metrics', '--metrics-interval', '1000.0']
09:23:51 [Model Analyzer] DEBUG: Reading PA results from reranker_config_default-results.csv
09:23:51 [Model Analyzer] Profiling reranker_config_default: client batch size=1, concurrency=64
09:23:51 [Model Analyzer] DEBUG: Running ['perf_analyzer', '-m', 'reranker_config_default', '-b', '1', '-u', 'localhost:8001', '-i', 'grpc', '-f', 'reranker_config_default-results.csv', '--verbose-csv', '--concurrency-range', '64', '--shape', 'input_ids:128', '--shape', 'attention_mask:128', '--shape', 'token_type_ids:128', '--measurement-mode', 'count_windows', '--collect-metrics', '--metrics-url', 'http://localhost:8002/metrics', '--metrics-interval', '1000.0']
09:23:55 [Model Analyzer] DEBUG: Reading PA results from reranker_config_default-results.csv
09:23:55 [Model Analyzer] No longer increasing concurrency as throughput has plateaued
09:23:55 [Model Analyzer]
09:23:55 [Model Analyzer] Creating model config: reranker_config_0
09:23:55 [Model Analyzer]   Setting instance_group to [{'count': 1, 'kind': 'KIND_GPU'}]
09:23:55 [Model Analyzer]   Setting max_batch_size to 1
09:23:55 [Model Analyzer]   Enabling dynamic_batching
09:23:55 [Model Analyzer]
09:24:01 [Model Analyzer] DEBUG: Stopped Triton Server.
09:24:02 [Model Analyzer] DEBUG: Triton Server started.
09:24:06 [Model Analyzer] Model reranker_config_0 load failed: [StatusCode.INTERNAL] failed to load 'reranker_config_0', failed to poll from model repository
09:24:11 [Model Analyzer] DEBUG: Stopped Triton Server.
09:24:11 [Model Analyzer]
09:24:11 [Model Analyzer] Creating model config: reranker_config_1
09:24:11 [Model Analyzer]   Setting instance_group to [{'count': 2, 'kind': 'KIND_GPU'}]
09:24:11 [Model Analyzer]   Setting max_batch_size to 1
09:24:11 [Model Analyzer]   Enabling dynamic_batching
09:24:11 [Model Analyzer]
09:24:12 [Model Analyzer] DEBUG: Triton Server started.
09:24:16 [Model Analyzer] Model reranker_config_1 load failed: [StatusCode.INTERNAL] failed to load 'reranker_config_1', failed to poll from model repository
09:24:21 [Model Analyzer] DEBUG: Stopped Triton Server.
09:24:21 [Model Analyzer]
09:24:21 [Model Analyzer] Creating model config: reranker_config_2
09:24:21 [Model Analyzer]   Setting instance_group to [{'count': 3, 'kind': 'KIND_GPU'}]
09:24:21 [Model Analyzer]   Setting max_batch_size to 1
09:24:21 [Model Analyzer]   Enabling dynamic_batching
09:24:21 [Model Analyzer]
09:24:22 [Model Analyzer] DEBUG: Triton Server started.
09:24:26 [Model Analyzer] Model reranker_config_2 load failed: [StatusCode.INTERNAL] failed to load 'reranker_config_2', failed to poll from model repository
09:24:31 [Model Analyzer] DEBUG: Stopped Triton Server.
09:24:31 [Model Analyzer]
09:24:31 [Model Analyzer] Creating model config: reranker_config_3
09:24:31 [Model Analyzer]   Setting instance_group to [{'count': 4, 'kind': 'KIND_GPU'}]
09:24:31 [Model Analyzer]   Setting max_batch_size to 1
09:24:31 [Model Analyzer]   Enabling dynamic_batching
09:24:31 [Model Analyzer]
09:24:32 [Model Analyzer] DEBUG: Triton Server started.
09:24:36 [Model Analyzer] Model reranker_config_3 load failed: [StatusCode.INTERNAL] failed to load 'reranker_config_3', failed to poll from model repository
09:24:41 [Model Analyzer] DEBUG: Stopped Triton Server.
09:24:41 [Model Analyzer]
09:24:41 [Model Analyzer] Creating model config: reranker_config_4
09:24:41 [Model Analyzer]   Setting instance_group to [{'count': 5, 'kind': 'KIND_GPU'}]
09:24:41 [Model Analyzer]   Setting max_batch_size to 1
09:24:41 [Model Analyzer]   Enabling dynamic_batching
09:24:41 [Model Analyzer]
09:24:42 [Model Analyzer] DEBUG: Triton Server started.
09:24:46 [Model Analyzer] Model reranker_config_4 load failed: [StatusCode.INTERNAL] failed to load 'reranker_config_4', failed to poll from model repository
09:24:51 [Model Analyzer] DEBUG: Stopped Triton Server.
09:24:51 [Model Analyzer] Saved checkpoint to /workspace/checkpoints/0.ckpt
09:24:51 [Model Analyzer] Profile complete. Profiled 1 configurations for models: ['reranker']
09:24:51 [Model Analyzer]
09:24:51 [Model Analyzer] Exporting server only metrics to /workspace/results/metrics-server-only.csv
09:24:51 [Model Analyzer] Exporting inference metrics to /workspace/results/metrics-model-inference.csv
09:24:51 [Model Analyzer] Exporting GPU metrics to /workspace/results/metrics-model-gpu.csv
Models (Inference):
Model      Batch   Concurrency   Model Config Path         Instance Group   Max Batch Size   Satisfies Constraints   Throughput (infer/sec)   p99 Latency (ms)
reranker   1       8             reranker_config_default   1:GPU            4                Yes                     846.0                    10.1
reranker   1       16            reranker_config_default   1:GPU            4                Yes                     841.8                    19.9
reranker   1       32            reranker_config_default   1:GPU            4                Yes                     839.9                    39.4
reranker   1       64            reranker_config_default   1:GPU            4                Yes                     837.8                    78.9
reranker   1       4             reranker_config_default   1:GPU            4                Yes                     716.9                    5.8
reranker   1       2             reranker_config_default   1:GPU            4                Yes                     557.3                    3.8
reranker   1       1             reranker_config_default   1:GPU            4                Yes                     506.7                    2.0

Models (GPU Metrics):
Model      GPU UUID                                   Batch   Concurrency   Model Config Path         Instance Group   Satisfies Constraints   GPU Memory Usage (MB)   GPU Utilization (%)   GPU Power Usage (W)
reranker   GPU-4d72aabf-001d-3307-fd66-a250532bdf5b   1       8             reranker_config_default   1:GPU            Yes                     972.0                   69.0                  61.0
reranker   GPU-4d72aabf-001d-3307-fd66-a250532bdf5b   1       16            reranker_config_default   1:GPU            Yes                     972.0                   61.3                  57.4
reranker   GPU-4d72aabf-001d-3307-fd66-a250532bdf5b   1       32            reranker_config_default   1:GPU            Yes                     972.0                   61.0                  56.3
reranker   GPU-4d72aabf-001d-3307-fd66-a250532bdf5b   1       64            reranker_config_default   1:GPU            Yes                     972.0                   68.2                  58.9
reranker   GPU-4d72aabf-001d-3307-fd66-a250532bdf5b   1       4             reranker_config_default   1:GPU            Yes                     955.3                   57.3                  57.6
reranker   GPU-4d72aabf-001d-3307-fd66-a250532bdf5b   1       2             reranker_config_default   1:GPU            Yes                     955.3                   52.7                  57.1
reranker   GPU-4d72aabf-001d-3307-fd66-a250532bdf5b   1       1             reranker_config_default   1:GPU            Yes                     946.9                   47.3                  57.9

Server Only:
Model           GPU UUID                                   GPU Memory Usage (MB)   GPU Utilization (%)   GPU Power Usage (W)
triton-server   GPU-4d72aabf-001d-3307-fd66-a250532bdf5b   745.0                   0.0                   33.6

09:24:51 [Model Analyzer] WARNING: Requested top 3 configs, but found only 1. Showing all available configs for this model.
09:24:51 [Model Analyzer] WARNING: Requested top 3 configs, but found only 1. Showing all available configs for this model.
09:24:51 [Model Analyzer] WARNING: Requested top 3 configs, but found only 1. Showing all available configs for this model.
09:24:52 [Model Analyzer] Exporting Summary Report to /workspace/reports/summaries/reranker/result_summary.pdf
09:24:52 [Model Analyzer] WARNING: Requested top 3 configs, but found only 1. Showing all available configs for this model.
09:24:52 [Model Analyzer] WARNING: Requested top 3 configs, but found only 1. Showing all available configs for this model.
09:24:52 [Model Analyzer] To generate detailed reports for the 1 best configurations, run `model-analyzer report --report-model-configs reranker_config_default --export-path /workspace --config-file config.yaml`

Triton Server Logs:

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 23.01 (build 52277748)
Triton Server Version 2.30.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

I0115 09:23:22.748025 7 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fe718000000' with size 268435456
I0115 09:23:22.750046 7 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0115 09:23:22.752359 7 server.cc:563]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0115 09:23:22.752379 7 server.cc:590]
+---------+------+--------+
| Backend | Path | Config |
+---------+------+--------+
+---------+------+--------+

I0115 09:23:22.752394 7 server.cc:633]
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+

I0115 09:23:22.778548 7 metrics.cc:864] Collecting metrics for GPU 0: Tesla T4
I0115 09:23:22.778751 7 metrics.cc:757] Collecting CPU metrics
I0115 09:23:22.778938 7 tritonserver.cc:2264]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                               |
| server_version                   | 2.30.0                                                                                                                                                                                               |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace logging |
| model_repository_path[0]         | /home/ec2-user/workspace/model_out/brute                                                                                                                                                             |
| model_control_mode               | MODE_EXPLICIT                                                                                                                                                                                        |
| strict_model_config              | 0                                                                                                                                                                                                    |
| rate_limit                       | OFF                                                                                                                                                                                                  |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                            |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                             |
| response_cache_byte_size         | 0                                                                                                                                                                                                    |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                  |
| strict_readiness                 | 1                                                                                                                                                                                                    |
| exit_timeout                     | 30                                                                                                                                                                                                   |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0115 09:23:22.780195 7 grpc_server.cc:4819] Started GRPCInferenceService at 0.0.0.0:8001
I0115 09:23:22.780459 7 http_server.cc:3477] Started HTTPService at 0.0.0.0:8000
I0115 09:23:22.821468 7 http_server.cc:184] Started Metrics Service at 0.0.0.0:8002
I0115 09:23:26.366687 7 model_lifecycle.cc:459] loading: reranker_config_default:1
I0115 09:23:26.368093 7 onnxruntime.cc:2459] TRITONBACKEND_Initialize: onnxruntime
I0115 09:23:26.368119 7 onnxruntime.cc:2469] Triton TRITONBACKEND API version: 1.11
I0115 09:23:26.368126 7 onnxruntime.cc:2475] 'onnxruntime' TRITONBACKEND API version: 1.11
I0115 09:23:26.368132 7 onnxruntime.cc:2505] backend configuration:
{"cmdline":{"auto-complete-config":"true","min-compute-capability":"6.000000","backend-directory":"/opt/tritonserver/backends","default-max-batch-size":"4"}}
I0115 09:23:26.383661 7 onnxruntime.cc:2563] TRITONBACKEND_ModelInitialize: reranker_config_default (version 1)
I0115 09:23:26.384239 7 onnxruntime.cc:666] skipping model configuration auto-complete for 'reranker_config_default': inputs and outputs already specified
I0115 09:23:26.388541 7 onnxruntime.cc:2606] TRITONBACKEND_ModelInstanceInitialize: reranker (GPU device 0)
2025-01-15 09:23:26.584355482 [W:onnxruntime:, session_state.cc:1030 VerifyEachNodeIsAssignedToAnEp] Some nodes were not assigned to the preferred execution providers which may or may not have an negative impact on performance. e.g. ORT explicitly assigns shape related ops to CPU to improve perf.
2025-01-15 09:23:26.584378936 [W:onnxruntime:, session_state.cc:1032 VerifyEachNodeIsAssignedToAnEp] Rerunning with verbose output on a non-minimal build will show node assignments.
I0115 09:23:26.689601 7 model_lifecycle.cc:694] successfully loaded 'reranker_config_default' version 1

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 23.01 (build 52277748)
Triton Server Version 2.30.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

I0115 09:24:02.823235 7 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7f4e54000000' with size 268435456
I0115 09:24:02.825230 7 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0115 09:24:02.827562 7 server.cc:563]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0115 09:24:02.827588 7 server.cc:590]
+---------+------+--------+
| Backend | Path | Config |
+---------+------+--------+
+---------+------+--------+

I0115 09:24:02.827605 7 server.cc:633]
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+

I0115 09:24:02.854186 7 metrics.cc:864] Collecting metrics for GPU 0: Tesla T4
I0115 09:24:02.854429 7 metrics.cc:757] Collecting CPU metrics
I0115 09:24:02.854604 7 tritonserver.cc:2264]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                               |
| server_version                   | 2.30.0                                                                                                                                                                                               |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace logging |
| model_repository_path[0]         | /home/ec2-user/workspace/model_out/brute                                                                                                                                                             |
| model_control_mode               | MODE_EXPLICIT                                                                                                                                                                                        |
| strict_model_config              | 0                                                                                                                                                                                                    |
| rate_limit                       | OFF                                                                                                                                                                                                  |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                            |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                             |
| response_cache_byte_size         | 0                                                                                                                                                                                                    |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                  |
| strict_readiness                 | 1                                                                                                                                                                                                    |
| exit_timeout                     | 30                                                                                                                                                                                                   |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0115 09:24:02.855883 7 grpc_server.cc:4819] Started GRPCInferenceService at 0.0.0.0:8001
I0115 09:24:02.856151 7 http_server.cc:3477] Started HTTPService at 0.0.0.0:8000
I0115 09:24:02.899441 7 http_server.cc:184] Started Metrics Service at 0.0.0.0:8002
E0115 09:24:06.436241 7 model_repository_manager.cc:1004] Poll failed for model directory 'reranker_config_0': dynamic batching preferred size must be <= max batch size for reranker_config_0

=============================
== Triton Inference Server ==
=============================

NVIDIA Release 23.01 (build 52277748)
Triton Server Version 2.30.0

Copyright (c) 2018-2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

Various files include modifications (c) NVIDIA CORPORATION & AFFILIATES.  All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

I0115 09:24:12.757457 7 pinned_memory_manager.cc:240] Pinned memory pool is created at '0x7fedbe000000' with size 268435456
I0115 09:24:12.759488 7 cuda_memory_manager.cc:105] CUDA memory pool is created on device 0 with size 67108864
I0115 09:24:12.761792 7 server.cc:563]
+------------------+------+
| Repository Agent | Path |
+------------------+------+
+------------------+------+

I0115 09:24:12.761813 7 server.cc:590]
+---------+------+--------+
| Backend | Path | Config |
+---------+------+--------+
+---------+------+--------+

I0115 09:24:12.761828 7 server.cc:633]
+-------+---------+--------+
| Model | Version | Status |
+-------+---------+--------+
+-------+---------+--------+

I0115 09:24:12.788033 7 metrics.cc:864] Collecting metrics for GPU 0: Tesla T4
I0115 09:24:12.788255 7 metrics.cc:757] Collecting CPU metrics
I0115 09:24:12.788449 7 tritonserver.cc:2264]
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                               |
| server_version                   | 2.30.0                                                                                                                                                                                               |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data statistics trace logging |
| model_repository_path[0]         | /home/ec2-user/workspace/model_out/brute                                                                                                                                                             |
| model_control_mode               | MODE_EXPLICIT                                                                                                                                                                                        |
| strict_model_config              | 0                                                                                                                                                                                                    |
| rate_limit                       | OFF                                                                                                                                                                                                  |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                            |
| cuda_memory_pool_byte_size{0}    | 67108864                                                                                                                                                                                             |
| response_cache_byte_size         | 0                                                                                                                                                                                                    |
| min_supported_compute_capability | 6.0                                                                                                                                                                                                  |
| strict_readiness                 | 1                                                                                                                                                                                                    |
| exit_timeout                     | 30                                                                                                                                                                                                   |
+----------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0115 09:24:12.789778 7 grpc_server.cc:4819] Started GRPCInferenceService at 0.0.0.0:8001
I0115 09:24:12.790045 7 http_server.cc:3477] Started HTTPService at 0.0.0.0:8000
I0115 09:24:12.831095 7 http_server.cc:184] Started Metrics Service at 0.0.0.0:8002
E0115 09:24:16.376338 7 model_repository_manager.cc:1004] Poll failed for model directory 'reranker_config_1': dynamic batching preferred size must be <= max batch size for reranker_config_1

and for all subsequent configs reranker_config_2, reranker_config_3 , etc, same error keeps coming up in server.

@nv-braf
Copy link
Contributor

nv-braf commented Jan 15, 2025

Can you also share the model config? Curious to what the max batch size of the model is.

@harsh-boloai
Copy link
Author

I have not added any default config, but this is the config.pbtxt generated for default run by model-analyzer

name: "reranker_config_default"
platform: "onnxruntime_onnx"
version_policy {
  latest {
    num_versions: 1
  }
}
max_batch_size: 4
input {
  name: "token_type_ids"
  data_type: TYPE_INT64
  dims: -1
}
input {
  name: "attention_mask"
  data_type: TYPE_INT64
  dims: -1
}
input {
  name: "input_ids"
  data_type: TYPE_INT64
  dims: -1
}
output {
  name: "logits"
  data_type: TYPE_FP32
  dims: 1
}
instance_group {
  name: "reranker"
  count: 1
  gpus: 0
  kind: KIND_GPU
}
default_model_filename: "model.onnx"
dynamic_batching {
  preferred_batch_size: 4
}
optimization {
  input_pinned_memory {
    enable: true
  }
  output_pinned_memory {
    enable: true
  }
}
backend: "onnxruntime"
  • I was able to make some progress towards having it run against batch sizes greater than 1 by modifying the model-analyzer command to this:
model-analyzer -v profile \
  -f config.yaml \
  -b 4,8,16,32,64,128 \ 
  --triton-launch-mode=docker \
  --output-model-repository-path  /path/to/output \
  --run-config-search-mode brute \
  --run-config-search-min-model-batch-size 4 \
  --client-protocol http \
  --profile-models reranker \
  --override-output-model-repository \
  --model-repository /path/to/models_repo/  \
  --triton-output-path brute_search.log
  1. But I am still not able to figure out how to include dynamic batching enabled/disabled and varying sequence lengths in the config search.

  2. Another issue I observed is when I looked at config.pbtxt for reranker_config_1 and subsequent ones generated by model-analyzer, the dynamic batching param is always fixed to this:

dynamic_batching {
  preferred_batch_size: 4
}

If I want to sweep over different configurations of dynamic batching like preferred batch sizes as list [4,8,16.....] and also experiment with queue delay, how can I do that?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants