Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Dec 20, 2025

Plan: Add retry logic for AgentRun RPC calls

  • Analyze the current AgentRun gRPC call flow and identify failure points
  • Add retry configuration parameters to Config struct (max retries, initial backoff, max backoff)
  • Implement retry logic with exponential backoff in grpc_client.go AgentRun method
  • Fix import issues in grpc_test.go
  • Add specific tests for retry logic (TestIsRetryableError)
  • Address code review feedback (logging improvements, removed jitter comment)
  • Improve error detection with gRPC status codes
  • Fix backoff calculation to prevent integer overflow
  • Run tests and security scan
  • Final code review (no issues found)
  • Merge main branch to incorporate latest changes

Summary

Successfully merged main branch which includes:

  • Dependency updates: google.golang.org/grpc v1.77.0 → v1.78.0
  • Dependency updates: github.com/golang-jwt/jwt/v4 v4.5.0 → v4.5.2
  • Fix for missing types package causing test compilation failures

All tests pass after merge. The retry logic implementation remains intact and compatible with the latest codebase.

Original prompt

This section details on the original issue you should resolve

<issue_title>AgentRun: rpc error: code = Unavailable desc = transport is closing</issue_title>
<issue_description>Describe the bug
AgentRun rpc calls sometimes fail on a selected node.

This is where this error comes from: https://github.com/distribworks/dkron/blob/master/dkron/grpc_agent.go#L92

because that one fails:
https://github.com/distribworks/dkron/blob/master/dkron/grpc_agent.go#L78

Which makes me think, that RPC call simply fails to an agent node there:
https://github.com/distribworks/dkron/blob/master/plugin/executor.go#L78

Also during that moment few other job executions intents on same node fail with same error.
I guess, some retries could be implemented there. Maybe fail happens, because agent can be busy with something at the moment, resulting in timeout firing on a caller side.

Error happens not often, but this results in missed executions.
Most probably that happens when there are lots of job execution intents at the same time to the same agent. (which is our case)

</issue_description>

<agent_instructions>Analyze the issue and propose a solution</agent_instructions>

Comments on the Issue (you are @copilot in this section)


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@coderabbitai
Copy link

coderabbitai bot commented Dec 20, 2025

Important

Review skipped

Bot user detected.

To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copilot AI and others added 4 commits December 20, 2025 10:50
Co-authored-by: vcastellm <47026+vcastellm@users.noreply.github.com>
Co-authored-by: vcastellm <47026+vcastellm@users.noreply.github.com>
Co-authored-by: vcastellm <47026+vcastellm@users.noreply.github.com>
…ulation

Co-authored-by: vcastellm <47026+vcastellm@users.noreply.github.com>
Copilot AI changed the title [WIP] Fix rpc error handling in AgentRun calls Add retry logic with exponential backoff for AgentRun RPC calls Dec 20, 2025
Copilot AI requested a review from vcastellm December 20, 2025 11:06
@vcastellm
Copy link
Member

@copilot merge main

@vcastellm vcastellm marked this pull request as ready for review January 21, 2026 18:46
Copy link
Contributor Author

Copilot AI commented Jan 21, 2026

@copilot merge main

Done. Merged main branch (847ec74) which includes gRPC v1.78.0, JWT v4.5.2, and the types package fix. All tests pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AgentRun: rpc error: code = Unavailable desc = transport is closing

2 participants