apollo_network: add reconnection backoff for short-lived connections#13470
apollo_network: add reconnection backoff for short-lived connections#13470sirandreww-starkware wants to merge 1 commit intomainfrom
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.
| } | ||
| } | ||
| _ => {} | ||
| } |
There was a problem hiding this comment.
Short-lived backoff breaks existing reconnection test
High Severity
The new short-lived connection detection in on_swarm_event causes a regression in the existing discovery_redials_when_all_connections_closed test. That test calls connect_to_bootstrap_node (which fires ConnectionEstablished with a Dialer endpoint, so a timestamp is recorded), then immediately fires ConnectionClosed. Since the elapsed time is ~0ms (well under MIN_CONNECTION_DURATION_FOR_BACKOFF_RESET of 1s), is_short_lived evaluates to true, and a reconnect backoff of 1000ms is applied (from the test config base_delay_millis: 1000, factor: 1). The test then expects the re-dial within TIMEOUT of 1s — making it race against the backoff delay. This is at best extremely flaky and likely a CI failure.
| } else { | ||
| self.reconnect_backoffs.remove(&peer_id); | ||
| self.pending_reconnect_delays.remove(&peer_id); | ||
| } |
There was a problem hiding this comment.
Inbound connection close incorrectly clears accumulated backoff state
Medium Severity
When the last connection to a peer closes and it was an inbound connection (not tracked in connection_timestamps), timestamp is None, making is_short_lived evaluate to false. The else branch then clears reconnect_backoffs and pending_reconnect_delays for that peer. This incorrectly resets any accumulated outbound split-brain backoff state, defeating the exponential backoff protection the PR is designed to provide.



Note
Medium Risk
Changes libp2p dialing/connection lifecycle handling and introduces new backoff state, which can impact connectivity and reconnection timing if misclassified or not cleaned up.
Overview
Adds a per-peer reconnection exponential backoff in
DialingBehaviourto avoid rapid connect-then-reject (split-brain) cycles. The behaviour now timestamps outboundConnectionEstablishedevents and, when the last connection to a peer closes withinMIN_CONNECTION_DURATION_FOR_BACKOFF_RESET, delays the nextrequest_dialusing accumulated backoff; longer-lived disconnects reset this state.Tightens
DialPeerStreamdial-failure handling by tracking theConnectionIdof the emitted dial and only applying its internal retry backoff to matchingDialFailureevents, plus addsset_next_dial_timeso the behaviour can inject the reconnection delay. Tests were updated to use the dial’s actualconnection_idwhen simulatingDialFailure.Written by Cursor Bugbot for commit b82209b. This will update automatically on new commits. Configure here.