Retain pods for failed jobs a longer period #102

Evesy · 2019-04-12T13:41:58Z

2.0.0 introduced terminating a pod immediately upon job completion which has proved useful when running many different pipelines with elastic agents, as nodes' resources are freed up quicker which reduces the chances of nodes have to autoscale.

This has introduced some extra difficulties when troubleshooting failed jobs though since the pods are cleaned up immediately it leaves nothing left to debug. It could be useful to set an alternate grace period for pods whose assigned jobs have failed, to give the option to look around the pod.

arvindsv · 2019-04-17T16:30:00Z

That would still mean changing the elastic agent profile. I wonder if it can be made more dynamic. I'm imagining some kind of metadata which would tell it to not terminate immediately.

Related code is here. Of course, with this, the plugin would need to store the information that it needs to be retained till / cleaned up later.

ckaushik · 2019-05-17T12:38:58Z

This can be achieved by TerminationGracePeriod or preStopHook in Kubernetes. We can probably sleep for a variable amount of time in preStopHook. People can then choose to keep agents shorter / longer based on the config
@arvindsv wdyt?

arvindsv · 2019-05-17T17:28:38Z

@ckaushik My concern is: The GoCD server will send only one event about job completion. If we miss that and don't terminate it, without keeping track of the fact that the event was sent, then it'll leave behind pods. So, either we keep track of the event and delay the time the plugin terminates the pod. Or, maybe you're suggesting that we somehow use the terminationGracePeriodSeconds option and let k8s delay the termination.

Anyway, however the implementation is, my concern is that we should terminate them eventually.

varshavaradarajan · 2019-05-17T18:03:24Z

Can't we pass along the job status to https://github.com/gocd/kubernetes-elastic-agents/blob/master/src/main/java/cd/go/contrib/elasticagent/requests/JobCompletionRequest.java#L29 at the time of termination to have it not terminate the pod that completed the job? The elastic profile can have a property (pod_retention_time?) that specifies how long after job completion the pod remains.

arvindsv · 2019-05-17T19:07:10Z

Yes. We would need to store that information, as you said, and make sure that the pod is terminated later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retain pods for failed jobs a longer period #102

Retain pods for failed jobs a longer period #102

Evesy commented Apr 12, 2019

arvindsv commented Apr 17, 2019

ckaushik commented May 17, 2019

arvindsv commented May 17, 2019

varshavaradarajan commented May 17, 2019

arvindsv commented May 17, 2019

Retain pods for failed jobs a longer period #102

Retain pods for failed jobs a longer period #102

Comments

Evesy commented Apr 12, 2019

arvindsv commented Apr 17, 2019

ckaushik commented May 17, 2019

arvindsv commented May 17, 2019

varshavaradarajan commented May 17, 2019

arvindsv commented May 17, 2019