apollo_network: add reconnection backoff for short-lived connections by sirandreww-starkware · Pull Request #13470 · starkware-libs/sequencer

sirandreww-starkware · 2026-03-25T15:18:58Z

Note

Medium Risk
Changes libp2p dialing/connection lifecycle handling and introduces new backoff state, which can impact connectivity and reconnection timing if misclassified or not cleaned up.

Overview
Adds a per-peer reconnection exponential backoff in DialingBehaviour to avoid rapid connect-then-reject (split-brain) cycles. The behaviour now timestamps outbound ConnectionEstablished events and, when the last connection to a peer closes within MIN_CONNECTION_DURATION_FOR_BACKOFF_RESET, delays the next request_dial using accumulated backoff; longer-lived disconnects reset this state.

Tightens DialPeerStream dial-failure handling by tracking the ConnectionId of the emitted dial and only applying its internal retry backoff to matching DialFailure events, plus adds set_next_dial_time so the behaviour can inject the reconnection delay. Tests were updated to use the dial’s actual connection_id when simulating DialFailure.

^{Written by Cursor Bugbot for commit b82209b. This will update automatically on new commits. Configure here.}

reviewable-StarkWare · 2026-03-25T15:19:05Z

This change is

sirandreww-starkware · 2026-03-25T15:19:12Z

apollo_network: add reconnection backoff for short-lived connections #13470 👈 (View in Graphite)
main

This stack of pull requests is managed by Graphite. Learn more about stacking.

cursor

Cursor Bugbot has reviewed your changes and found 2 potential issues.

^{Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, have a team admin enable autofix in the Cursor dashboard.}

cursor · 2026-03-25T15:27:32Z

crates/apollo_network/src/discovery/behaviours/dialing/behaviour.rs

+                }
+            }
+            _ => {}
+        }


Short-lived backoff breaks existing reconnection test

High Severity

The new short-lived connection detection in on_swarm_event causes a regression in the existing discovery_redials_when_all_connections_closed test. That test calls connect_to_bootstrap_node (which fires ConnectionEstablished with a Dialer endpoint, so a timestamp is recorded), then immediately fires ConnectionClosed. Since the elapsed time is ~0ms (well under MIN_CONNECTION_DURATION_FOR_BACKOFF_RESET of 1s), is_short_lived evaluates to true, and a reconnect backoff of 1000ms is applied (from the test config base_delay_millis: 1000, factor: 1). The test then expects the re-dial within TIMEOUT of 1s — making it race against the backoff delay. This is at best extremely flaky and likely a CI failure.

cursor · 2026-03-25T15:27:33Z

crates/apollo_network/src/discovery/behaviours/dialing/behaviour.rs

+                } else {
+                    self.reconnect_backoffs.remove(&peer_id);
+                    self.pending_reconnect_delays.remove(&peer_id);
+                }


Inbound connection close incorrectly clears accumulated backoff state

Medium Severity

When the last connection to a peer closes and it was an inbound connection (not tracked in connection_timestamps), timestamp is None, making is_short_lived evaluate to false. The else branch then clears reconnect_backoffs and pending_reconnect_delays for that peer. This incorrectly resets any accumulated outbound split-brain backoff state, defeating the exponential backoff protection the PR is designed to provide.

apollo_network: add reconnection backoff for short-lived connections

b82209b

sirandreww-starkware self-assigned this Mar 25, 2026

sirandreww-starkware requested a review from ShahakShama March 25, 2026 15:19

sirandreww-starkware marked this pull request as ready for review March 25, 2026 15:19

cursor bot reviewed Mar 25, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

apollo_network: add reconnection backoff for short-lived connections#13470

apollo_network: add reconnection backoff for short-lived connections#13470
sirandreww-starkware wants to merge 1 commit intomainfrom
03-25-apollo_network_fixed_immediate_re-dial_on_dial_failure_issue

sirandreww-starkware commented Mar 25, 2026 •

edited by cursor bot

Loading

Uh oh!

reviewable-StarkWare commented Mar 25, 2026

Uh oh!

sirandreww-starkware commented Mar 25, 2026

Uh oh!

cursor bot left a comment

Uh oh!

cursor bot Mar 25, 2026

Uh oh!

cursor bot Mar 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sirandreww-starkware commented Mar 25, 2026 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

reviewable-StarkWare commented Mar 25, 2026

Uh oh!

sirandreww-starkware commented Mar 25, 2026

Uh oh!

cursor bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor bot Mar 25, 2026

Choose a reason for hiding this comment

Short-lived backoff breaks existing reconnection test

Uh oh!

cursor bot Mar 25, 2026

Choose a reason for hiding this comment

Inbound connection close incorrectly clears accumulated backoff state

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sirandreww-starkware commented Mar 25, 2026 •

edited by cursor bot

Loading