Skip to content

Conversation

@black-dragon74
Copy link
Member

@black-dragon74 black-dragon74 commented Dec 3, 2025

This patch adds the functionality to retry for a maximum of
maxRetries to connect to the sidecar.

If the connection attempt is not successful, the object is considered
obsolete and is deleted.

The retry is tracked inside an annotation(connRetryAnnotation) and also
reflected in object's status.

These transient artifacts are cleaned up once a connection is
established.


const (
CSIAddonsNodeConnectionMaxRetries = 3
CSIAddonsNodeConnectionSleepInterval = 2 * time.Second
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be a little short, maybe? If there is a temporary network issue, the object may get deleted too early?

Consider doing an exponential backoff like other reties in Kubernetes do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

var connErr error
for i := range CSIAddonsNodeConnectionMaxRetries {
logger.Info("Connecting to sidecar", "attempt", i)
newConn, connErr = connection.NewConnection(ctx, endPoint, nodeID, driverName, csiAddonsNode.Namespace, csiAddonsNode.Name, r.EnableAuth)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered doing the retry in the connection itself? That would make the controller a little simpler, it should not need to care about reconnecting too much. Only if there is a particular error returned, the CSIAddonsNode CR could be deleted by the controller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds great. Components other than this reconciler would also benefit from the said change. Thank you!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upon further consideration, retries inside connection package will be blocking and might starve the work queue.

I implemented the backoff using RequeueAfter and am tracking state in annotation and status.

@black-dragon74 black-dragon74 force-pushed the fix-extra-addonsnodeconn branch from b5acabf to 714a22d Compare December 9, 2025 13:33
@mergify mergify bot added the api Change to the API, requires extra care label Dec 9, 2025
@mergify mergify bot requested a review from ShyamsundarR December 9, 2025 13:33
This patch adds the functionality to retry for a maximum of
`maxRetries` to connect to the sidecar.

If the connection attempt is not successful, the object is considered
obsolete and is deleted.

The retry is tracked inside an annotation(`connRetryAnnotation`) and also
reflected in object's status.

These transient artifacts are cleaned up once a connection is
established.

Signed-off-by: Niraj Yadav <niryadav@redhat.com>
@black-dragon74 black-dragon74 force-pushed the fix-extra-addonsnodeconn branch from 714a22d to 31bdf4c Compare December 9, 2025 13:36
@black-dragon74 black-dragon74 added the DNM Do Not Merge label Dec 9, 2025
@black-dragon74 black-dragon74 changed the title csiaddonsnode: delete the object after max connection retries csiaddonsnode: Add retry with exponential backoff for connections Dec 9, 2025
backoff := baseRetryDelay * time.Duration(math.Pow(2, float64(currentRetries)))
logger.Info("Requeuing request for attempting the connection again", "backoff", backoff)

return ctrl.Result{RequeueAfter: backoff}, nil
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 1st retry is after 0 seconds? That probably isn't what you want. Should getRetryCountFromAnnotation() really return 0 if the annotation is not set?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first retry is baseRetryDelay * 2 ^ valFromAnnotation.

Hence, 3 * 2^0 leads to 3 * 1, as intended.

@nixpanic
Copy link
Collaborator

nixpanic commented Dec 9, 2025

What is the process to get a deleted CSIAddonsNode back in case of a longer network interruption? Can that be automated too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api Change to the API, requires extra care DNM Do Not Merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants