I have a dagster deployment on AWS EKS which works...
# deployment-kubernetes
e
I have a dagster deployment on AWS EKS which works quite well. I run long running dagster jobs or steps: several hours to several days. I have been having trouble with kubernetes vertical auto-scaling evicting nodes on which these long running jobs are still running. basically ruining my job. Does anybody have similar problems and if yes any mitigation? Discussing with ChatGPT on the subject, the options I see are 1) implement some kubernetes finalizers that would prevent pods running jobs to be deleted 2) implement a custom autoscaler or controller.
a
e
Unfortunately, it does not seem to be enough. I already have this annotation.
a
Are you sure the annotation is applied on the pod and not only on the job?
e
I am not sure I understand your question. I only see the annotation on the pod, not on the job which would be correct because this is a pod annotation
a
a job creates a pod, but the job’s annotation are not necessarily the same as its pod annotations, that’s why i was asking
e
So I see the annotation on the pod which it should help. I was surprised myself that it would not be enough. ChatGPT says there is no guarantee:
And as a matter of fact, MOST of the time it does not scale down the node but sometimes it does and it is usually when the job is quite long, which is quite annoying, I do not save state and I have to restart from the beginning...
e
ok it seems that I am victim of the known problem of EKS which is the Availability Zone Rebalance which does not take into account the annotation. I have observed that my pods were terminated on nodes which were evicted because of AZrebalance. I will try to suspend the availability zone rebalance and see if it helps.
D 1
Thanks @Andrea Giardini for the help!