Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retain pods for failed jobs a longer period #102

Open
Evesy opened this issue Apr 12, 2019 · 5 comments
Open

Retain pods for failed jobs a longer period #102

Evesy opened this issue Apr 12, 2019 · 5 comments

Comments

@Evesy
Copy link

Evesy commented Apr 12, 2019

2.0.0 introduced terminating a pod immediately upon job completion which has proved useful when running many different pipelines with elastic agents, as nodes' resources are freed up quicker which reduces the chances of nodes have to autoscale.

This has introduced some extra difficulties when troubleshooting failed jobs though since the pods are cleaned up immediately it leaves nothing left to debug. It could be useful to set an alternate grace period for pods whose assigned jobs have failed, to give the option to look around the pod.

@arvindsv
Copy link
Member

That would still mean changing the elastic agent profile. I wonder if it can be made more dynamic. I'm imagining some kind of metadata which would tell it to not terminate immediately.

Related code is here. Of course, with this, the plugin would need to store the information that it needs to be retained till / cleaned up later.

@ckaushik
Copy link

This can be achieved by TerminationGracePeriod or preStopHook in Kubernetes. We can probably sleep for a variable amount of time in preStopHook. People can then choose to keep agents shorter / longer based on the config
@arvindsv wdyt?

@arvindsv
Copy link
Member

@ckaushik My concern is: The GoCD server will send only one event about job completion. If we miss that and don't terminate it, without keeping track of the fact that the event was sent, then it'll leave behind pods. So, either we keep track of the event and delay the time the plugin terminates the pod. Or, maybe you're suggesting that we somehow use the terminationGracePeriodSeconds option and let k8s delay the termination.

Anyway, however the implementation is, my concern is that we should terminate them eventually.

@varshavaradarajan
Copy link
Contributor

Can't we pass along the job status to https://github.com/gocd/kubernetes-elastic-agents/blob/master/src/main/java/cd/go/contrib/elasticagent/requests/JobCompletionRequest.java#L29 at the time of termination to have it not terminate the pod that completed the job? The elastic profile can have a property (pod_retention_time?) that specifies how long after job completion the pod remains.

@arvindsv
Copy link
Member

Yes. We would need to store that information, as you said, and make sure that the pod is terminated later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants