You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I use the vLLM framework with the FLASHINFER backend, I get the error "RuntimeError: Qwen2-VL does not support _Backend.FLASHINFER backend now". Do you know how I can fix this error? When will the Qwen2-VL support the FLASHINFER backend?
VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve /root/autodl-tmp/models/Qwen2-VL-7B-Instruct-FP8-Dynamic --max-model-len 3840
WARNING 01-07 11:20:01 cuda.py:22] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
INFO 01-07 11:20:04 api_server.py:528] vLLM API server version 0.6.3.post1
INFO 01-07 11:20:04 api_server.py:529] args: Namespace(subparser='serve', model_tag='/root/autodl-tmp/models/Qwen2-VL-7B-Instruct-FP8-Dynamic', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/root/autodl-tmp/models/Qwen2-VL-7B-Instruct-FP8-Dynamic', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=3840, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7f60a04ce8e0>)
INFO 01-07 11:20:04 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/13974d98-e800-44f9-b4c4-0f45e0dae986 for IPC Path.
INFO 01-07 11:20:04 api_server.py:179] Started engine process with PID 3205
WARNING 01-07 11:20:05 cuda.py:22] You are using a deprecated pynvml package. Please install nvidia-ml-py instead, and make sure to uninstall pynvml. When both of them are installed, pynvml will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.
WARNING 01-07 11:20:08 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 01-07 11:20:12 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 01-07 11:20:12 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/root/autodl-tmp/models/Qwen2-VL-7B-Instruct-FP8-Dynamic', speculative_config=None, tokenizer='/root/autodl-tmp/models/Qwen2-VL-7B-Instruct-FP8-Dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=3840, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/root/autodl-tmp/models/Qwen2-VL-7B-Instruct-FP8-Dynamic, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
INFO 01-07 11:20:12 selector.py:141] Using Flashinfer backend.
INFO 01-07 11:20:13 model_runner.py:1056] Starting to load model /root/autodl-tmp/models/Qwen2-VL-7B-Instruct-FP8-Dynamic...
Process SpawnProcess-1:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/root/miniconda3/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 139, in from_engine_args
return cls(
^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in init
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 334, in init
self.model_executor = executor_class(
^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 47, in init
self._init_executor()
File "/root/miniconda3/lib/python3.12/site-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
self.driver_worker.load_model()
File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1058, in load_model
self.model = get_model(model_config=self.model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
return loader.load_model(model_config=model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 398, in load_model
model = _initialize_model(model_config, self.load_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 175, in _initialize_model
return build_model(
^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 160, in build_model
return model_class(config=hf_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_vl.py", line 878, in init
self.visual = Qwen2VisionTransformer(
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_vl.py", line 469, in init
Qwen2VisionBlock(
File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_vl.py", line 320, in init
self.attn = Qwen2VisionAttention(embed_dim=dim,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_vl.py", line 219, in init
raise RuntimeError(
RuntimeError: Qwen2-VL does not support _Backend.FLASHINFER backend now.
Traceback (most recent call last):
File "/root/miniconda3/bin/vllm", line 8, in
sys.exit(main())
^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/scripts.py", line 195, in main
args.dispatch_function(args)
File "/root/miniconda3/lib/python3.12/site-packages/vllm/scripts.py", line 41, in serve
uvloop.run(run_server(args))
File "/root/miniconda3/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/root/miniconda3/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
async with build_async_engine_client(args) as engine_client:
File "/root/miniconda3/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/root/miniconda3/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start
The text was updated successfully, but these errors were encountered:
When I use the vLLM framework with the FLASHINFER backend, I get the error "RuntimeError: Qwen2-VL does not support _Backend.FLASHINFER backend now". Do you know how I can fix this error? When will the Qwen2-VL support the FLASHINFER backend?
VLLM_ATTENTION_BACKEND=FLASHINFER vllm serve /root/autodl-tmp/models/Qwen2-VL-7B-Instruct-FP8-Dynamic --max-model-len 3840
WARNING 01-07 11:20:01 cuda.py:22] You are using a deprecated
pynvml
package. Please installnvidia-ml-py
instead, and make sure to uninstallpynvml
. When both of them are installed,pynvml
will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.INFO 01-07 11:20:04 api_server.py:528] vLLM API server version 0.6.3.post1
INFO 01-07 11:20:04 api_server.py:529] args: Namespace(subparser='serve', model_tag='/root/autodl-tmp/models/Qwen2-VL-7B-Instruct-FP8-Dynamic', config='', host=None, port=8000, uvicorn_log_level='info', allow_credentials=False, allowed_origins=[''], allowed_methods=[''], allowed_headers=['*'], api_key=None, lora_modules=None, prompt_adapters=None, chat_template=None, response_role='assistant', ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_cert_reqs=0, root_path=None, middleware=[], return_tokens_as_token_ids=False, disable_frontend_multiprocessing=False, enable_auto_tool_choice=False, tool_call_parser=None, tool_parser_plugin='', model='/root/autodl-tmp/models/Qwen2-VL-7B-Instruct-FP8-Dynamic', tokenizer=None, skip_tokenizer_init=False, revision=None, code_revision=None, tokenizer_revision=None, tokenizer_mode='auto', trust_remote_code=False, download_dir=None, load_format='auto', config_format=<ConfigFormat.AUTO: 'auto'>, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, max_model_len=3840, guided_decoding_backend='outlines', distributed_executor_backend=None, worker_use_ray=False, pipeline_parallel_size=1, tensor_parallel_size=1, max_parallel_loading_workers=None, ray_workers_use_nsight=False, block_size=16, enable_prefix_caching=False, disable_sliding_window=False, use_v2_block_manager=False, num_lookahead_slots=0, seed=0, swap_space=4, cpu_offload_gb=0, gpu_memory_utilization=0.9, num_gpu_blocks_override=None, max_num_batched_tokens=None, max_num_seqs=256, max_logprobs=20, disable_log_stats=False, quantization=None, rope_scaling=None, rope_theta=None, enforce_eager=False, max_context_len_to_capture=None, max_seq_len_to_capture=8192, disable_custom_all_reduce=False, tokenizer_pool_size=0, tokenizer_pool_type='ray', tokenizer_pool_extra_config=None, limit_mm_per_prompt=None, mm_processor_kwargs=None, enable_lora=False, max_loras=1, max_lora_rank=16, lora_extra_vocab_size=256, lora_dtype='auto', long_lora_scaling_factors=None, max_cpu_loras=None, fully_sharded_loras=False, enable_prompt_adapter=False, max_prompt_adapters=1, max_prompt_adapter_token=0, device='auto', num_scheduler_steps=1, multi_step_stream_outputs=True, scheduler_delay_factor=0.0, enable_chunked_prefill=None, speculative_model=None, speculative_model_quantization=None, num_speculative_tokens=None, speculative_disable_mqa_scorer=False, speculative_draft_tensor_parallel_size=None, speculative_max_model_len=None, speculative_disable_by_batch_size=None, ngram_prompt_lookup_max=None, ngram_prompt_lookup_min=None, spec_decoding_acceptance_method='rejection_sampler', typical_acceptance_sampler_posterior_threshold=None, typical_acceptance_sampler_posterior_alpha=None, disable_logprobs_during_spec_decoding=None, model_loader_extra_config=None, ignore_patterns=[], preemption_mode=None, served_model_name=None, qlora_adapter_name_or_path=None, otlp_traces_endpoint=None, collect_detailed_traces=None, disable_async_output_proc=False, override_neuron_config=None, scheduling_policy='fcfs', disable_log_requests=False, max_log_len=None, disable_fastapi_docs=False, dispatch_function=<function serve at 0x7f60a04ce8e0>)
INFO 01-07 11:20:04 api_server.py:166] Multiprocessing frontend to use ipc:///tmp/13974d98-e800-44f9-b4c4-0f45e0dae986 for IPC Path.
INFO 01-07 11:20:04 api_server.py:179] Started engine process with PID 3205
WARNING 01-07 11:20:05 cuda.py:22] You are using a deprecated
pynvml
package. Please installnvidia-ml-py
instead, and make sure to uninstallpynvml
. When both of them are installed,pynvml
will take precedence and cause errors. See https://pypi.org/project/pynvml for more information.WARNING 01-07 11:20:08 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
WARNING 01-07 11:20:12 arg_utils.py:1019] [DEPRECATED] Block manager v1 has been removed, and setting --use-v2-block-manager to True or False has no effect on vLLM behavior. Please remove --use-v2-block-manager in your engine argument. If your use case is not supported by SelfAttnBlockSpaceManager (i.e. block manager v2), please file an issue with detailed information.
INFO 01-07 11:20:12 llm_engine.py:237] Initializing an LLM engine (v0.6.3.post1) with config: model='/root/autodl-tmp/models/Qwen2-VL-7B-Instruct-FP8-Dynamic', speculative_config=None, tokenizer='/root/autodl-tmp/models/Qwen2-VL-7B-Instruct-FP8-Dynamic', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=3840, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=compressed-tensors, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=/root/autodl-tmp/models/Qwen2-VL-7B-Instruct-FP8-Dynamic, num_scheduler_steps=1, chunked_prefill_enabled=False multi_step_stream_outputs=True, enable_prefix_caching=False, use_async_output_proc=True, use_cached_outputs=True, mm_processor_kwargs=None)
INFO 01-07 11:20:12 selector.py:141] Using Flashinfer backend.
INFO 01-07 11:20:13 model_runner.py:1056] Starting to load model /root/autodl-tmp/models/Qwen2-VL-7B-Instruct-FP8-Dynamic...
Process SpawnProcess-1:
Traceback (most recent call last):
File "/root/miniconda3/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/root/miniconda3/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 390, in run_mp_engine
engine = MQLLMEngine.from_engine_args(engine_args=engine_args,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 139, in from_engine_args
return cls(
^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/multiprocessing/engine.py", line 78, in init
self.engine = LLMEngine(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/engine/llm_engine.py", line 334, in init
self.model_executor = executor_class(
^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/executor/executor_base.py", line 47, in init
self._init_executor()
File "/root/miniconda3/lib/python3.12/site-packages/vllm/executor/gpu_executor.py", line 40, in _init_executor
self.driver_worker.load_model()
File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/worker.py", line 183, in load_model
self.model_runner.load_model()
File "/root/miniconda3/lib/python3.12/site-packages/vllm/worker/model_runner.py", line 1058, in load_model
self.model = get_model(model_config=self.model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/model_loader/init.py", line 19, in get_model
return loader.load_model(model_config=model_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 398, in load_model
model = _initialize_model(model_config, self.load_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 175, in _initialize_model
return build_model(
^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/model_loader/loader.py", line 160, in build_model
return model_class(config=hf_config,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_vl.py", line 878, in init
self.visual = Qwen2VisionTransformer(
^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_vl.py", line 469, in init
Qwen2VisionBlock(
File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_vl.py", line 320, in init
self.attn = Qwen2VisionAttention(embed_dim=dim,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/model_executor/models/qwen2_vl.py", line 219, in init
raise RuntimeError(
RuntimeError: Qwen2-VL does not support _Backend.FLASHINFER backend now.
Traceback (most recent call last):
File "/root/miniconda3/bin/vllm", line 8, in
sys.exit(main())
^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/scripts.py", line 195, in main
args.dispatch_function(args)
File "/root/miniconda3/lib/python3.12/site-packages/vllm/scripts.py", line 41, in serve
uvloop.run(run_server(args))
File "/root/miniconda3/lib/python3.12/site-packages/uvloop/init.py", line 109, in run
return __asyncio.run(
^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/asyncio/runners.py", line 194, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "uvloop/loop.pyx", line 1518, in uvloop.loop.Loop.run_until_complete
File "/root/miniconda3/lib/python3.12/site-packages/uvloop/init.py", line 61, in wrapper
return await main
^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 552, in run_server
async with build_async_engine_client(args) as engine_client:
File "/root/miniconda3/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 107, in build_async_engine_client
async with build_async_engine_client_from_engine_args(
File "/root/miniconda3/lib/python3.12/contextlib.py", line 210, in aenter
return await anext(self.gen)
^^^^^^^^^^^^^^^^^^^^^
File "/root/miniconda3/lib/python3.12/site-packages/vllm/entrypoints/openai/api_server.py", line 194, in build_async_engine_client_from_engine_args
raise RuntimeError(
RuntimeError: Engine process failed to start
The text was updated successfully, but these errors were encountered: