I have a dagster deployment on AWS EKS which works quite wel dagster #deployment-kubernetes

I have a dagster deployment on AWS EKS which works...

Eric Värnild

09/01/2023, 1:44 PM

I have a dagster deployment on AWS EKS which works quite well. I run long running dagster jobs or steps: several hours to several days. I have been having trouble with kubernetes vertical auto-scaling evicting nodes on which these long running jobs are still running. basically ruining my job. Does anybody have similar problems and if yes any mitigation? Discussing with ChatGPT on the subject, the options I see are 1) implement some kubernetes finalizers that would prevent pods running jobs to be deleted 2) implement a custom autoscaler or controller.

Andrea Giardini

09/01/2023, 1:47 PM

It can probably be fixed with this annotation: https://kubernetes.io/docs/reference/labels-annotations-taints/#cluster-autoscaler-kubernetes-io-safe-to-evict

Eric Värnild

09/01/2023, 1:51 PM

Unfortunately, it does not seem to be enough. I already have this annotation.

Andrea Giardini

09/01/2023, 1:52 PM

Are you sure the annotation is applied on the pod and not only on the job?

Eric Värnild

09/01/2023, 1:54 PM

I am not sure I understand your question. I only see the annotation on the pod, not on the job which would be correct because this is a pod annotation

Andrea Giardini

09/01/2023, 1:55 PM

a job creates a pod, but the job’s annotation are not necessarily the same as its pod annotations, that’s why i was asking

Eric Värnild

09/01/2023, 1:58 PM

So I see the annotation on the pod which it should help. I was surprised myself that it would not be enough. ChatGPT says there is no guarantee:

Eric Värnild

09/01/2023, 1:59 PM

And as a matter of fact, MOST of the time it does not scale down the node but sometimes it does and it is usually when the job is quite long, which is quite annoying, I do not save state and I have to restart from the beginning...

Andrea Giardini

09/01/2023, 2:01 PM

https://aws.github.io/aws-eks-best-practices/cluster-autoscaling/#prevent-scale-down-eviction aws EKS docs

🙌 2

Eric Värnild

09/01/2023, 3:15 PM

ok it seems that I am victim of the known problem of EKS which is the Availability Zone Rebalance which does not take into account the annotation. I have observed that my pods were terminated on nodes which were evicted because of AZrebalance. I will try to suspend the availability zone rebalance and see if it helps.

D 1

Eric Värnild

09/01/2023, 3:15 PM

Thanks @Andrea Giardini for the help!

Open in Slack

Previous Next