-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retain pods for failed jobs a longer period #102
Comments
That would still mean changing the elastic agent profile. I wonder if it can be made more dynamic. I'm imagining some kind of metadata which would tell it to not terminate immediately. Related code is here. Of course, with this, the plugin would need to store the information that it needs to be retained till / cleaned up later. |
This can be achieved by TerminationGracePeriod or preStopHook in Kubernetes. We can probably sleep for a variable amount of time in preStopHook. People can then choose to keep agents shorter / longer based on the config |
@ckaushik My concern is: The GoCD server will send only one event about job completion. If we miss that and don't terminate it, without keeping track of the fact that the event was sent, then it'll leave behind pods. So, either we keep track of the event and delay the time the plugin terminates the pod. Or, maybe you're suggesting that we somehow use the Anyway, however the implementation is, my concern is that we should terminate them eventually. |
Can't we pass along the job status to https://github.com/gocd/kubernetes-elastic-agents/blob/master/src/main/java/cd/go/contrib/elasticagent/requests/JobCompletionRequest.java#L29 at the time of termination to have it not terminate the pod that completed the job? The elastic profile can have a property (pod_retention_time?) that specifies how long after job completion the pod remains. |
Yes. We would need to store that information, as you said, and make sure that the pod is terminated later. |
2.0.0 introduced terminating a pod immediately upon job completion which has proved useful when running many different pipelines with elastic agents, as nodes' resources are freed up quicker which reduces the chances of nodes have to autoscale.
This has introduced some extra difficulties when troubleshooting failed jobs though since the pods are cleaned up immediately it leaves nothing left to debug. It could be useful to set an alternate grace period for pods whose assigned jobs have failed, to give the option to look around the pod.
The text was updated successfully, but these errors were encountered: