The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Hello! I am running Dagster in Kubernetes with the `K8sRunLauncher` . In the past couple weeks, I’ve noticed that there are jobs that occasionally get either stuck in the “Starting” or “Started” state. These jobs never terminate and clog up the run queue, preventing other jobs from starting.

What could be the cause of jobs getting into this state? I realize I can add in run monitoring for jobs to prevent them from running forever, but would like to understand the cause -- is this something that has been seen before?

Example event logs from a run that has been stuck in “Starting”

Hi Will, I've seen this before when I had issues with the node selector. My jobs would be created and would spawn the worker pod, but the pods would never be assigned to a node and thus the jobs remained in started state. For me it was helpful to inspect the events on the cluster with  `kubectl get events`