Skip to content

The traceback of test_bd_serving #32

@AIxyz

Description

@AIxyz

I use dInfer a8b4a06 and run test_bd_serving.py, the traceback as follows: @zheng-da

# dllm-dinfer:v-260106 (de602d2bc8f9) (26.4GB)
docker run -it --gpus='"device=4"' --entrypoint=/bin/bash -v /bigdata/shared/models/huggingface/LLaDA2.0-mini--572899f-C8:/model de602d2bc8f9

sed -i 's#/mnt/infra/dulun.dl/models/dllm-mini/block-diffusion-sft-2k-v2-full-bd/LLaDA2-mini-preview-ep4-v0#/model#g' /code/dInfer/tests/test_bd_serving.py # I replace the model_path
sed -i 's#import pytest##g' /code/dInfer/tests/test_bd_serving.py # pytest is not used
sed -i 's#  model = init_sglang_dist()#  #g' /code/dInfer/tests/test_bd_serving.py # global model has inited in line 97, so line 196 shoule remove

date && python3 /code/dInfer/tests/test_bd_serving.py && date
Image
INFO 01-08 01:23:35 [__init__.py:216] Automatically detected platform cuda.
WARNING:sglang.srt.layers.moe.utils:MOE_RUNNER_BACKEND is not initialized, the backend will be automatically selected
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 122, in spawn_main
    exitcode = _main(fd, parent_sentinel)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 131, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 246, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.12/multiprocessing/spawn.py", line 297, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen runpy>", line 287, in run_path
  File "<frozen runpy>", line 98, in _run_module_code
  File "<frozen runpy>", line 88, in _run_code
  File "/code/dInfer/tests/test_bd_serving.py", line 97, in <module>
    model = init_sglang_dist()
            ^^^^^^^^^^^^^^^^^^
  File "/code/dInfer/tests/test_bd_serving.py", line 69, in init_sglang_dist
    distributed.init_distributed_environment(1, 0, 'env://', 0, 'nccl')
  File "/usr/local/lib/python3.12/dist-packages/sglang/srt/distributed/parallel_state.py", line 1408, in init_distributed_environment
    torch.distributed.init_process_group(
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/c10d_logger.py", line 95, in wrapper
    func_return = func(*args, **kwargs)
                  ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py", line 1757, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
                              ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/rendezvous.py", line 278, in _env_rendezvous_handler
    store = _create_c10d_store(
            ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/distributed/rendezvous.py", line 198, in _create_c10d_store
    return TCPStore(
           ^^^^^^^^^
torch.distributed.DistNetworkError: The server socket has failed to listen on any local network address. port: 40399, useIpv6: false, code: -98, name: EADDRINUSE, message: address already in use

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions