We aim to run these benchmarks and share them with the OPEA community for three primary reasons:
- To offer insights on inference throughput in real-world scenarios, helping you choose the best service or deployment for your needs.
- To establish a baseline for validating optimization solutions across different implementations, providing clear guidance on which methods are most effective for your use case.
- To inspire the community to build upon our benchmarks, allowing us to better quantify new solutions in conjunction with current leading LLMs, serving frameworks etc.
Before running the benchmarks, ensure you have:
-
Kubernetes Environment
- Kubernetes installation: Use kubespray or other official Kubernetes installation guides
- (Optional) Kubernetes set up guide on Intel Gaudi product
-
Configuration YAML
The configuration file (e.g.,./ChatQnA/benchmark_chatqna.yaml
) consists of two main sections: deployment and benchmarking. Required fields with# mandatory
comment must be filled with valid values, such asHUGGINGFACEHUB_API_TOKEN
. For all other fields, you can either customize them according to our needs or leave them empty ("") to use the default values from the helm charts.Default Models:
- LLM:
meta-llama/Meta-Llama-3-8B-Instruct
(Required: must be specified as it's shared between deployment and benchmarking phases) - Embedding:
BAAI/bge-base-en-v1.5
- Reranking:
BAAI/bge-reranker-base
You can customize which models to use by setting the
model_id
field in the corresponding service section. Note that the LLM model must be specified in the configuration as it is used by both deployment and benchmarking processes.Important Notes:
- For Gaudi deployments:
- LLM service runs on Gaudi devices
- If enabled, the reranking service (teirerank) also runs on Gaudi devices
- Llama Model Access:
- Downloading Llama models requires both:
- HuggingFace API token
- Special authorization from Meta
- Please visit meta-llama/Meta-Llama-3-8B-Instruct to request access
- Deployment will fail if model download is unsuccessful due to missing authorization
- Downloading Llama models requires both:
Node and Replica Configuration:
node: [1, 2, 4, 8] # Number of nodes to deploy replicaCount: [1, 2, 4, 8] # Must align with node configuration
The
replicaCount
values must align with thenode
configuration by index:- When deploying on 1 node → uses replicaCount[0] = 1
- When deploying on 2 nodes → uses replicaCount[1] = 2
- When deploying on 4 nodes → uses replicaCount[2] = 4
- When deploying on 8 nodes → uses replicaCount[3] = 8
Note: Model parameters that accept lists (e.g.,
max_batch_size
,max_num_seqs
) are deployment parameters that affect model service behavior but not the number of service instances. When these parameters are lists, each value will trigger a service upgrade followed by a new round of testing, while maintaining the same number of service instances. - LLM:
-
Install required Python packages Run the following command to install all necessary dependencies:
pip install -r requirements.txt
Before running benchmarks, you need to:
-
Prepare Test Data
-
Testing for general benchmark target:
Download the retrieval file using the command below for data ingestion in RAG:
wget https://github.com/opea-project/GenAIEval/tree/main/evals/benchmark/data/upload_file.txt
-
Testing for pubmed benchmark target:
For the
chatqna_qlist_pubmed
test case, preparepubmed_${max_lines}.txt
by following this README
After the data is prepared, please update the
absolute path
of this file in the benchmark.yaml file. For example, in theChatQnA/benchmark_chatqna.yaml
file,/home/sdp/upload_file.txt
should be replaced by your file path. -
-
Prepare Model Files (Recommended)
pip install -U "huggingface_hub[cli]" sudo mkdir -p /mnt/models sudo chmod 777 /mnt/models huggingface-cli download --cache-dir /mnt/models meta-llama/Meta-Llama-3-8B-Instruct
The benchmarking process consists of two main components: deployment and benchmarking. We provide deploy_and_benchmark.py
as a unified entry point that combines both steps.
The script deploy_and_benchmark.py
serves as the main entry point. You can use any example's configuration YAML file. Here are examples using ChatQnA configuration:
-
For a specific number of nodes:
# Default OOB (Out of Box) mode python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml --target-node 1 # Or specify test mode explicitly python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml --target-node 1 --test-mode [oob|tune]
-
For all node configurations:
# Default OOB (Out of Box) mode python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml # Or specify test mode explicitly python deploy_and_benchmark.py ./ChatQnA/benchmark_chatqna.yaml --test-mode [oob|tune]
This will process all node configurations defined in your YAML file.
The script provides two test modes controlled by the --test-mode
parameter:
-
OOB (Out of Box) Mode - Default
--test-mode oob # or omit the parameter
- Uses enabled configurations only:
- Resources: Only uses resources when
resources.enabled
is True - Model parameters:
- Uses batch parameters when
batch_params.enabled
is True - Uses token parameters when
token_params.enabled
is True
- Uses batch parameters when
- Resources: Only uses resources when
- Suitable for basic functionality testing with selected optimizations
- Uses enabled configurations only:
-
Tune Mode
--test-mode tune
- Applies all configurations regardless of enabled status:
- Resource-related parameters:
resources.cores_per_instance
: CPU cores allocationresources.memory_capacity
: Memory allocationresources.cards_per_instance
: GPU/Accelerator cards allocation
- Model parameters:
- Batch parameters:
max_batch_size
: Maximum batch size (TGI engine)max_num_seqs
: Maximum number of sequences (vLLM engine)
- Token parameters:
max_input_length
: Maximum input sequence lengthmax_total_tokens
: Maximum total tokens per requestmax_batch_total_tokens
: Maximum total tokens in a batchmax_batch_prefill_tokens
: Maximum tokens in prefill phase
- Batch parameters:
- Resource-related parameters:
- Applies all configurations regardless of enabled status:
Choose "oob" mode when you want to selectively enable optimizations, or "tune" mode when you want to apply all available optimizations regardless of their enabled status.
Helm Chart Directory Issues
-
During execution, the script downloads and extracts the Helm chart to a directory named after your example
-
The directory name is derived from your input YAML file path
- For example: if your input is
./ChatQnA/benchmark_chatqna.yaml
, the extracted directory will bechatqna/
- For example: if your input is
-
In some error cases, this directory might not be properly cleaned up
-
If you encounter deployment issues, check if there's a leftover Helm chart directory:
# Example: for ./ChatQnA/benchmark_chatqna.yaml ls -la chatqna/ # Clean up if needed rm -rf chatqna/
-
After cleaning up the directory, try running the deployment again
Note: Always ensure there are no leftover Helm chart directories from previous failed runs before starting a new deployment.