Skip to content

hardware tag matching fails with UCX backend #116

@angainor

Description

@angainor

Running ghexbench with UCX_DC_MLX5_TM_ENABLE=y causes an error and a segfault. The same setting works with MPI backend when using OpenMPI on IB networks. Is it something about how we create the worker / UCX context?

[1615309041.680074] [b2237:256544:0] rc_mlx5_common.c:827  UCX  ERROR ibv_exp_create_srq(device=mlx5_0) failed: Cannot allocate memory

==== backtrace (tid: 110170) ====
 0 0x0000000000052e95 ucs_debug_print_backtrace()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucs/debug/debug.c:656
 1 0x000000000003e54c ucp_address_pack()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucp/wireup/address.c:832
 2 0x000000000003e54c ucp_address_pack()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucp/wireup/address.c:844
 3 0x00000000000246bd ucp_worker_get_address()  /build-result/src/hpcx-v2.7.0-gcc-MLNX_OFED_LINUX-4.7-1.0.0.1-redhat7.7-x86_64/ucx-v1.9.x/src/ucp/core/ucp_worker.c:2241
 4 0x00000000004327a8 gridtools::ghex::tl::ucx::worker_t::worker_t()  ???:0
 5 0x000000000042b646 cartex::runtime::impl::init()  ???:0
 6 0x000000000041da99 cartex::runtime::exchange()  ???:0
 7 0x000000000040afe5 main()  ???:0
 8 0x0000000000022545 __libc_start_main()  ???:0
 9 0x000000000040ca8d _start()  ???:0
=================================

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions