Skip to content

TCP teardown errors with high stream counts (128+) #32

@lance0

Description

@lance0

Description

With high parallel stream counts (-P 128), teardown errors still occur despite the fixes in #25. The original fix works for normal stream counts (confirmed by @matttbe with 50 iterations of 1-second runs), but 128 streams triggers two distinct problems.

Reported by @matttbe in #25 (comment)

Reproduction

xfr <host> -P 128 --no-tui

Tested in a network namespace with veth pairs at ~19 Mbps.

Problem 1: Post-test RST cascade

When the server's duration timer fires, it cancels all receive handlers, which drop their TcpStream. Dropping a socket with unread data in the kernel buffer sends RST (not FIN) to the client. The client's send tasks are still running — the client doesn't learn the test is over until the Result message arrives on the control channel, which happens after the server has already torn down all data streams.

The client's error suppression checks cancel (not yet set), deadline_reached (not yet, due to timing asymmetry), and near_deadline (250ms grace, not enough). So the errors are treated as fatal and logged as ERROR.

Timing asymmetry: The server starts its interval timer before any data streams connect. Each client send_data starts its own deadline from when that stream connects. With 128 sequential TCP connects, late-starting streams have deadlines that lag the server's by hundreds of milliseconds.

Join timeout: The 2s hardcoded timeout for join_all on stream handles isn't enough for 128 tasks.

Problem 2: Mid-test broken pipe

In one trace, stream 16 gets Broken pipe at ~7s into a 10s test while the test continues running (intervals 8 and 9 still arrive). This is well before either side's deadline. Cause is unclear — possibly kernel-level resource pressure with 128 TCP connections competing for ~19 Mbps (~150 kbps per stream).

Planned fixes

  • Scale the client join timeout with stream count (max(2s, streams * 50ms)) — v0.9.1
  • Client stops local data streams at local duration expiry (narrows server/client teardown race) — v0.9.1
  • Receive-side cancel drain to reduce RST-on-close bursts — v0.9.1
  • Server gracefully shuts down data sockets (FIN instead of RST) — RST is correct for timed tests; FIN would let bufferbloated send buffers drain past the requested duration
  • Investigate mid-test stream failures under high contention

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions