Example/tutorial on infer LLMs deployed using onnx/onnxruntime #6921

mgiessing · 2024-02-26T11:09:20Z

mgiessing
Feb 26, 2024

Hi, I have a CPU-only environment and to my understanding neither TRT-LLM nor vLLM would work on that (both requires GPU).

Therefore I deployed a tinyllama-1b model using onnxruntime and now wondering if there is any information on how to do the inferencing there based on the inputs?

Any information/similar work is much appreciated :)

$ curl http://localhost:8000/v2/models/tinyllama_onnx_nocache | jq

{
  "name": "tinyllama_onnx_nocache",
  "versions": [
    "1"
  ],
  "platform": "onnxruntime_onnx",
  "inputs": [
    {
      "name": "position_ids",
      "datatype": "INT64",
      "shape": [
        -1,
        -1
      ]
    },
    {
      "name": "attention_mask",
      "datatype": "INT64",
      "shape": [
        -1,
        -1
      ]
    },
    {
      "name": "input_ids",
      "datatype": "INT64",
      "shape": [
        -1,
        -1
      ]
    }
  ],
  "outputs": [
    {
      "name": "logits",
      "datatype": "FP32",
      "shape": [
        -1,
        -1,
        32000
      ]
    }
  ]
}

Logs of the container:

[...]
I0225 13:47:45.868670 1 server.cc:653] 
+------------------------+---------+--------+
| Model                  | Version | Status |
+------------------------+---------+--------+
| densenet_onnx          | 1       | READY  |
| tinyllama_onnx         | 1       | READY  |
| tinyllama_onnx_nocache | 1       | READY  |
+------------------------+---------+--------+

I0225 13:47:45.868859 1 metrics.cc:701] Collecting CPU metrics
I0225 13:47:45.868978 1 tritonserver.cc:2387] 
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Option                           | Value                                                                                                                                                                                                           |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| server_id                        | triton                                                                                                                                                                                                          |
| server_version                   | 2.33.0                                                                                                                                                                                                          |
| server_extensions                | classification sequence model_repository model_repository(unload_dependents) schedule_policy model_configuration system_shared_memory cuda_shared_memory binary_tensor_data parameters statistics trace logging |
| model_repository_path[0]         | /models                                                                                                                                                                                                         |
| model_control_mode               | MODE_NONE                                                                                                                                                                                                       |
| strict_model_config              | 0                                                                                                                                                                                                               |
| rate_limit                       | OFF                                                                                                                                                                                                             |
| pinned_memory_pool_byte_size     | 268435456                                                                                                                                                                                                       |
| min_supported_compute_capability | 0.0                                                                                                                                                                                                             |
| strict_readiness                 | 1                                                                                                                                                                                                               |
| exit_timeout                     | 30                                                                                                                                                                                                              |
| cache_enabled                    | 0                                                                                                                                                                                                               |
+----------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

I0225 13:47:45.869927 1 grpc_server.cc:2450] Started GRPCInferenceService at 0.0.0.0:8001
I0225 13:47:45.870108 1 http_server.cc:3555] Started HTTPService at 0.0.0.0:8000
I0225 13:47:45.910862 1 http_server.cc:185] Started Metrics Service at 0.0.0.0:8002

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Example/tutorial on infer LLMs deployed using onnx/onnxruntime #6921

{{title}}

Replies: 0 comments

Select a reply

Example/tutorial on infer LLMs deployed using onnx/onnxruntime #6921

mgiessing Feb 26, 2024

Replies: 0 comments

mgiessing
Feb 26, 2024