Skip to content

Latest commit

 

History

History
145 lines (91 loc) · 6.11 KB

model_types.md

File metadata and controls

145 lines (91 loc) · 6.11 KB

Table of Contents


Support Matrix

The table below illustrates shows which search modes are available based on the model type:

Single/Multi-Model BLS Ensemble LLM
Brute o - - -
Quick o o o o
Optuna o o o o

Multi-Model

This mode has the following limitations:

  • Does not support detailed reporting, only summary reports

Multi-model concurrent search mode can be enabled by adding the parameter --run-config-profile-models-concurrently-enable to the CLI.

It uses Quick Search mode's hill climbing algorithm to search all models configurations spaces in parallel, looking for the maximal objective value within the specified constraints. Model Analyzer has observed positive outcomes towards finding the maximum objective value; with typical runtimes of around 20-30 minutes (compared to the days it would take a brute force run to complete) for a two to three model run.

After it has found the best config(s), it will then sweep the top-N configurations found (specified by --num-configs-per-model) over the default concurrency range before generation of the summary reports.

Note: The algorithm attempts to find the most fair and optimal result for all models, by evaluating each model objective's gain/loss. In many cases this will result in the algorithm ranking higher a configuration that has a lower total combined throughput (if that was the objective), if this better balances the throughputs of all the models.


An example model analyzer YAML config that performs a Multi-model search:

model_repository: /path/to/model/repository/

run_config_profile_models_concurrently_enable: true

profile_models:
  - model_A
  - model_B

Model Weighting

In addition to setting a model's objectives or constraints, in multi-model search mode, you have the ability to set a model's weighting. By default each model is set for equal weighting (value of 1), but in the YAML you can specify weighting: <int> which will bias that model's objectives when evaluating for an optimal result.


An example where model A's objective gains (towards minimizing latency) will have 3 times the importance versus maximizing model B's throughput gains:

model_repository: /path/to/model/repository/

run_config_profile_models_concurrently_enable: true

profile_models:
  model_A:
    weighting: 3
    objectives:
      perf_latency_p99: 1
  model_B:
    weighting: 1
    objectives:
      perf_throughput: 1

Ensemble

Profiling this model type has the following limitations:

  • Only supports up to four composing models
  • Composing models cannot be ensemble or BLS models

Ensemble models can be optimized using the Quick Search mode's hill climbing algorithm to search the composing models' configuration spaces in parallel, looking for the maximal objective value within the specified constraints. Model Analyzer has observed positive outcomes towards finding the maximum objective value; with runtimes under one hour (compared to the days it would take a brute force run to complete) for ensembles that contain up to four composing models.

After Model Analyzer has found the best config(s), it will then sweep the top-N configurations found (specified by --num-configs-per-model) over the concurrency range before generation of the summary reports.


BLS

Profiling this model type has the following limitations:

  • Only supports up to four composing models
  • Composing models cannot be ensemble or BLS models

BLS models can be optimized using the Quick Search mode's hill climbing algorithm to search the BLS composing models' configuration spaces, as well as the BLS model's instance count, in parallel, looking for the maximal objective value within the specified constraints. Model Analyzer has observed positive outcomes towards finding the maximum objective value; with runtimes under one hour (compared to the days it would take a brute force run to complete) for BLS models that contain up to four composing models.

After Model Analyzer has found the best config(s), it will then sweep the top-N configurations found (specified by --num-configs-per-model) over the concurrency range before generation of the summary reports.


LLM

Profiling this model type has the following limitations:

  • Summary/Detailed reports do not include the new metrics

In order to profile LLMs you must tell MA that the model type is LLM by setting --model-type LLM in the CLI/config file. You can specify CLI options to the GenAI-Perf tool using genai_perf_flags. See the GenAI-Perf CLI documentation for a list of the flags that can be specified.

LLMs can be optimized using either Quick or Brute search mode.

An example model analyzer YAML config for a LLM:

model_repository: /path/to/model/repository/

model_type: LLM
client_prototcol: grpc

genai_perf_flags:
  backend: vllm
  streaming: true

For LLMs there are three new metrics being reported: Inter-token Latency, Time to First Token Latency and Output Token Throughput.

These new metrics can be specified as either objectives or constraints.

NOTE: In order to enable these new metrics you must enable streaming in genai_perf_flags and the client protocol must be set to gRPC