Stephen Bailey

06/08/2022, 5:35 PM
I'm getting into some situations where a bunch of hourly load jobs get kicked off, triggering auto-scaling and evicting the pods where my jobs are running. I'll get errors like this:
Step <op> finished without success or failure event. Downstream steps will not execute.
When I look at the job, I find:
Warning  TooManyActivePods  26m   job-controller  Too many active pods running after completion count reached
Still learning a good bit about k8s, but I'm wondering whether there's a way to tag the job pods as "do not destroy", or something to that effect?


06/08/2022, 5:40 PM
I believe what we want is I don’t see why this shouldn’t be a built in option to OSS and cloud
I believe one of the downsides here is that you cannot drain node (i.e. remove all the pods from the node) if you set the
for any of those pods to 0 - you’ll have to wait until the job fully terminates before trying to do that manual drain
Stephen Bailey

06/08/2022, 7:29 PM
thanks rex!