https://dagster.io/ logo
Title
e

Eldan Hamdani

05/16/2023, 7:37 AM
Hi , we’ve got an issue when kubernetes pod (the dagster-run) killed unexpectedly. we’ve seen that the runs on dagit still keeps running although the pod were killed. could you help me understand why is that happens?
r

Roei Jacobovich

05/16/2023, 10:57 AM
What was the “job termination reason”? can be seen when describing the K8s job itself My first suggestion would be some sort of out-of-memory issue for a complex Dagster job
e

Eldan Hamdani

05/16/2023, 10:59 AM
it happened for us when using spot instance in our gke node pool, but even I just run manual jon in dagit and after minutes I delete the pod manually- I still see the dagit job run…
r

Roei Jacobovich

05/16/2023, 11:01 AM
Do you have “run monitoring” enabled? https://docs.dagster.io/deployment/run-monitoring Re spot instances - you can look out for eviction requests events in the cluster
The “run monitoring” mechanism should solve the issue you described regarding Dagit itself detecting crashed jobs. While checking out the K8s jobs and relevant events would help researching why the jobs were killed in the first place.
e

Eldan Hamdani

05/16/2023, 11:04 AM
but why the dagit job wasn’t failed after the pod crashed? (although I still didn’t enabled run monitoring)
r

Roei Jacobovich

05/16/2023, 11:11 AM
What happens exactly after the K8s job has crashed? Do you see logs written in Dagit? Do you see Ops running? What happens after ops finish their execution?
e

Eldan Hamdani

05/16/2023, 11:13 AM
the ops keep running forever and we don’t see nothing on the ui
r

Roei Jacobovich

05/16/2023, 11:50 AM
Do they “run” in Dagit or do their steps are running forever as well? What do you see on their logs (on the containers themselves)?
e

Eldan Hamdani

05/16/2023, 11:51 AM
the steps are still running, and I don’t have any container to look at because the pod crashed.
j

jordan

05/16/2023, 8:05 PM
what you’re seeing is the “hanging runs” described in the run monitoring doc - in short, dagster still sees the steps as running because nothing has communicated to it that the run is no longer running. let us know if you still see this after enabling run monitoring, but i suspect it’ll take care of the issue for you
e

Eldan Hamdani

05/17/2023, 7:10 AM
and where should I add the run_moinitoring? in the helm chart? can you send me an example?
e

Eldan Hamdani

05/22/2023, 8:24 AM
@Saar Amitay