It appears that the k8s launcher does not have a m...
# dagster-feedback
a
It appears that the k8s launcher does not have a mechanism to detect jobs which have entered a state like
BackoffLimitExceeded
. It is easy for a k8s job to enter this state if the cluster's scaling has reached capacity and a job is unable to be scheduled for a sufficient period of time. This leaves runs hanging indefinitely as kubernetes will not continue to try to schedule the job after capacity is available again.
2
d
Hi Alec - do you have the "run monitoring" feature here enabled? https://docs.dagster.io/deployment/run-monitoring#run-monitoring That's intended to detect and gracefully terminate hanging runs like this (planning to start enabling it by default in the 1.2 release)
a
I don't believe so! Thank you. There really are so many good aspects to dagster sometimes it's just a challenge discovering what is available.