Skip to content

apollo_network: add reconnection backoff for short-lived connections#13470

Open
sirandreww-starkware wants to merge 1 commit intomainfrom
03-25-apollo_network_fixed_immediate_re-dial_on_dial_failure_issue
Open

apollo_network: add reconnection backoff for short-lived connections#13470
sirandreww-starkware wants to merge 1 commit intomainfrom
03-25-apollo_network_fixed_immediate_re-dial_on_dial_failure_issue

Conversation

@sirandreww-starkware
Copy link
Contributor

@sirandreww-starkware sirandreww-starkware commented Mar 25, 2026

Note

Medium Risk
Changes libp2p dialing/connection lifecycle handling and introduces new backoff state, which can impact connectivity and reconnection timing if misclassified or not cleaned up.

Overview
Adds a per-peer reconnection exponential backoff in DialingBehaviour to avoid rapid connect-then-reject (split-brain) cycles. The behaviour now timestamps outbound ConnectionEstablished events and, when the last connection to a peer closes within MIN_CONNECTION_DURATION_FOR_BACKOFF_RESET, delays the next request_dial using accumulated backoff; longer-lived disconnects reset this state.

Tightens DialPeerStream dial-failure handling by tracking the ConnectionId of the emitted dial and only applying its internal retry backoff to matching DialFailure events, plus adds set_next_dial_time so the behaviour can inject the reconnection delay. Tests were updated to use the dial’s actual connection_id when simulating DialFailure.

Written by Cursor Bugbot for commit b82209b. This will update automatically on new commits. Configure here.

@reviewable-StarkWare
Copy link

This change is Reviewable

@sirandreww-starkware sirandreww-starkware self-assigned this Mar 25, 2026
Copy link
Contributor Author

This stack of pull requests is managed by Graphite. Learn more about stacking.

@sirandreww-starkware sirandreww-starkware marked this pull request as ready for review March 25, 2026 15:19
Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.

}
}
_ => {}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Short-lived backoff breaks existing reconnection test

High Severity

The new short-lived connection detection in on_swarm_event causes a regression in the existing discovery_redials_when_all_connections_closed test. That test calls connect_to_bootstrap_node (which fires ConnectionEstablished with a Dialer endpoint, so a timestamp is recorded), then immediately fires ConnectionClosed. Since the elapsed time is ~0ms (well under MIN_CONNECTION_DURATION_FOR_BACKOFF_RESET of 1s), is_short_lived evaluates to true, and a reconnect backoff of 1000ms is applied (from the test config base_delay_millis: 1000, factor: 1). The test then expects the re-dial within TIMEOUT of 1s — making it race against the backoff delay. This is at best extremely flaky and likely a CI failure.

Fix in Cursor Fix in Web

} else {
self.reconnect_backoffs.remove(&peer_id);
self.pending_reconnect_delays.remove(&peer_id);
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inbound connection close incorrectly clears accumulated backoff state

Medium Severity

When the last connection to a peer closes and it was an inbound connection (not tracked in connection_timestamps), timestamp is None, making is_short_lived evaluate to false. The else branch then clears reconnect_backoffs and pending_reconnect_delays for that peer. This incorrectly resets any accumulated outbound split-brain backoff state, defeating the exponential backoff protection the PR is designed to provide.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants