https://dagster.io/ logo
#dagster-feedback
Title
# dagster-feedback
a

Alec Koumjian

02/22/2023, 1:45 PM
It appears that the k8s launcher does not have a mechanism to detect jobs which have entered a state like
BackoffLimitExceeded
. It is easy for a k8s job to enter this state if the cluster's scaling has reached capacity and a job is unable to be scheduled for a sufficient period of time. This leaves runs hanging indefinitely as kubernetes will not continue to try to schedule the job after capacity is available again.
2
d

daniel

02/22/2023, 2:03 PM
Hi Alec - do you have the "run monitoring" feature here enabled? https://docs.dagster.io/deployment/run-monitoring#run-monitoring That's intended to detect and gracefully terminate hanging runs like this (planning to start enabling it by default in the 1.2 release)
a

Alec Koumjian

02/22/2023, 2:04 PM
I don't believe so! Thank you. There really are so many good aspects to dagster sometimes it's just a challenge discovering what is available.
6 Views