Skip to content

Conversation

ovidiusm
Copy link
Contributor

@ovidiusm ovidiusm commented Oct 2, 2025

What?

Socket metadata exchange helper functions are throwing exceptions, and some of them are not caught, leading to application crash in case of peer closing connection.

This is different from ETCD metadata exchange, which is handled graciously and errors are logged without throwing exceptions to the application.

This PR adds handling code for the thrown exceptions.

Why?

Crash seen in CI in flaky test:

[2025-10-02T08:56:24.615Z] ++ get_next_tcp_port
[2025-10-02T08:56:24.615Z] ++ local port_file=/tmp/nixl_tcp_port_0
[2025-10-02T08:56:24.615Z] ++ '[' '!' -f /tmp/nixl_tcp_port_0 ']'
[2025-10-02T08:56:24.615Z] ++ local current_port
[2025-10-02T08:56:24.615Z] +++ cat /tmp/nixl_tcp_port_0
[2025-10-02T08:56:24.615Z] ++ current_port=10505
[2025-10-02T08:56:24.615Z] ++ local next_port=10506
[2025-10-02T08:56:24.615Z] ++ ss -tuln
[2025-10-02T08:56:24.615Z] ++ grep -q :10506
[2025-10-02T08:56:24.615Z] ++ '[' 10506 -ge 11000 ']'
[2025-10-02T08:56:24.615Z] ++ echo 10506
[2025-10-02T08:56:24.615Z] ++ echo 10506
[2025-10-02T08:56:24.615Z] + blocking_send_recv_port=10506
[2025-10-02T08:56:24.615Z] + mkdir -p /tmp/telemetry_test
[2025-10-02T08:56:24.615Z] + sleep 5
[2025-10-02T08:56:24.615Z] + python3 blocking_send_recv_example.py --mode=target --ip=127.0.0.1 --port=10506
[2025-10-02T08:56:27.129Z] 2025-10-02 08:56:26 NIXL INFO    _api.py:361 Backend UCX was instantiated
[2025-10-02T08:56:27.129Z] 2025-10-02 08:56:26 NIXL INFO    _api.py:251 Initialized NIXL agent: target
[2025-10-02T08:56:27.129Z] 2025-10-02 08:56:26 NIXL INFO    blocking_send_recv_example.py:64 Running test with [tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]), tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])] tensors in mode target
[2025-10-02T08:56:29.646Z] + NIXL_TELEMETRY_ENABLE=y
[2025-10-02T08:56:29.646Z] + NIXL_TELEMETRY_DIR=/tmp/telemetry_test
[2025-10-02T08:56:29.646Z] + python3 blocking_send_recv_example.py --mode=initiator --ip=127.0.0.1 --port=10506
[2025-10-02T08:56:32.162Z] 2025-10-02 08:56:31 NIXL INFO    _api.py:361 Backend UCX was instantiated
[2025-10-02T08:56:32.162Z] 2025-10-02 08:56:31 NIXL INFO    _api.py:251 Initialized NIXL agent: initiator
[2025-10-02T08:56:32.162Z] 2025-10-02 08:56:31 NIXL INFO    blocking_send_recv_example.py:64 Running test with [tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]), tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])] tensors in mode initiator
[2025-10-02T08:56:32.162Z] 2025-10-02 08:56:31 NIXL INFO    blocking_send_recv_example.py:94 Initiator sending to 127.0.0.1
[2025-10-02T08:56:32.162Z] 2025-10-02 08:56:32 NIXL INFO    blocking_send_recv_example.py:84 Waiting for transfer
[2025-10-02T08:56:32.162Z] 2025-10-02 08:56:32 NIXL INFO    blocking_send_recv_example.py:111 Ready for transfer
[2025-10-02T08:56:32.162Z] 2025-10-02 08:56:32 NIXL INFO    blocking_send_recv_example.py:147 Test Complete.
[2025-10-02T08:56:32.162Z] 2025-10-02 08:56:32 NIXL INFO    blocking_send_recv_example.py:138 initiator Data verification passed
[2025-10-02T08:56:32.162Z] 2025-10-02 08:56:32 NIXL INFO    blocking_send_recv_example.py:147 Test Complete.
[2025-10-02T08:56:32.162Z] terminate called after throwing an instance of 'std::runtime_error'
[2025-10-02T08:56:32.162Z]   what():  sendCommMessage(fd=28) 8/30 bytes failed, errno=32
[2025-10-02T08:56:33.529Z] .gitlab/test_python.sh: line 78: 133637 Aborted                 (core dumped) NIXL_TELEMETRY_ENABLE=y NIXL_TELEMETRY_DIR=/tmp/telemetry_test python3 blocking_send_recv_example.py --mode="initiator" --ip=127.0.0.1 --port="$blocking_send_recv_port"
Test Python failed with msg: Step Test Python failed with exit code=134

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
@ovidiusm
Copy link
Contributor Author

ovidiusm commented Oct 2, 2025

/build

Copy link

github-actions bot commented Oct 2, 2025

👋 Hi ovidiusm! Thank you for contributing to ai-dynamo/nixl.

Your PR reviewers will review your contribution then trigger the CI to test your changes.

🚀

brminich
brminich previously approved these changes Oct 2, 2025
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
@ovidiusm ovidiusm changed the title Fix error handling for metadata exchange over sockets Core: Fix error handling for metadata exchange over sockets Oct 13, 2025
@ovidiusm
Copy link
Contributor Author

/build

Signed-off-by: Ovidiu Mara <ovidium@nvidia.com>
@ovidiusm
Copy link
Contributor Author

/build

Copy link
Contributor

@tvegas1 tvegas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move to reuse try { sendCommMessage() } catch () {} in separate function and use:

if (sendCommMessageChecked(client->second, "LOAD"+myID) {
    NIXL_ERR << "Error..";
    break;
}

@ovidiusm
Copy link
Contributor Author

/build

@ovidiusm
Copy link
Contributor Author

/build

@ovidiusm ovidiusm changed the title Core: Fix error handling for metadata exchange over sockets Listener: Fix error handling for metadata exchange over sockets Oct 14, 2025
@ovidiusm
Copy link
Contributor Author

/build

1 similar comment
@ovidiusm
Copy link
Contributor Author

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants