Hi team! On dagster 0.13.7, we are encountering this error sporadically. Could you assist us on identifying some potential root causes for it?
08/10/2022, 4:16 PM
Hi Bolin. Which run launcher are you using? We've seen this error pop up before a k8s container restarts when a Dagster job hasn't successfully completed
08/11/2022, 2:49 AM
08/11/2022, 2:53 AM
we’re using the K8sRunLauncher
08/11/2022, 5:30 PM
Yeah further up in the events I imagine you’ll see another pod start and emit the RUN_STARTED event. It then died for whatever reason (e.g. node spun down) and Kubernetes has a known issue where even if you disable retries on a pod, it will still try to restart it. We don’t support k8s pods restarting like that so we guard against it with this status check.
You’ll want to
• investigate why the pod failed in the first place (too much resource usage? etc.)
• if you decide it’s an ephemeral failure, you could consider configuring automatic retries at the dagster level https://docs.dagster.io/deployment/run-retries