Ripple Khera
02/14/2023, 4:26 PMIgnoring a duplicate run that was started from somewhere other than the run monitor daemon
gets logged in dagit at 03:23
• step pod runs for 11 hrs and terminates at 05:42
• next step pod is not spun up
ive been told dagster support is pretty responsive, and this is becoming quite a headache for us, so any pointers would be much appreciated.daniel
02/14/2023, 4:33 PMkubectl describe
from the run worker pod (the name of the pod should be in the event logs from the run) that would help explain exactly what happened here to kill the run pod.Ripple Khera
02/14/2023, 4:34 PMdaniel
02/14/2023, 4:35 PMdaniel
02/14/2023, 4:35 PMdaniel
02/14/2023, 4:36 PM<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>
annotation that can be set to False)Ripple Khera
02/14/2023, 4:40 PMdaniel
02/14/2023, 4:41 PMMichel Rouly
02/14/2023, 5:02 PMJob
and its pod consistently ran from run kick off and does not appear to have been restarted at any point
• the run `Job`'s pod got re-created / restarted halfway through, at 03:10
based on that, it does look to us like the run got restarted after it crashed.Michel Rouly
02/14/2023, 5:03 PMdaniel
02/14/2023, 5:11 PMmax_resume_run_attempts
key set as part of run monitoring? That's a specific feature within run monitoring for restarting crashed run workers (instead of just detecting hanging or crashed run workers and failing the run)daniel
02/14/2023, 5:11 PMMichel Rouly
02/14/2023, 5:12 PMrun_monitoring
config
run_monitoring:
enabled: true
start_timeout_seconds: 360
max_resume_run_attempts: 5
poll_interval_seconds: 120
daniel
02/14/2023, 5:14 PMMichel Rouly
02/14/2023, 5:55 PMdaniel
02/14/2023, 5:56 PMMichel Rouly
02/14/2023, 5:57 PMIgnoring a duplicate run that was started from somewhere other than the run monitor daemon
Michel Rouly
02/14/2023, 5:57 PMdaniel
02/14/2023, 5:58 PMdaniel
02/14/2023, 6:11 PMMichel Rouly
02/14/2023, 6:38 PMCollected 2 runs for monitoring
Checking run 520b88ec-2d27-4103-b239-b2fbe41d8595
unfortunately we have a missing user deployment that's causing a bunch of error log spam, so I might be filtering errors out too aggressivelyMichel Rouly
02/14/2023, 6:42 PMRipple Khera
02/14/2023, 6:43 PMRipple Khera
02/14/2023, 6:43 PMdaniel
02/14/2023, 6:43 PM"Detected run worker status XXX. Resuming run YYY with a new worker."
in the event logs if the monitoring daemon restarted itdaniel
02/14/2023, 6:44 PMdaniel
02/14/2023, 6:44 PMdaniel
02/14/2023, 6:45 PMMichel Rouly
02/14/2023, 6:46 PMMichel Rouly
02/14/2023, 6:46 PMdaniel
02/14/2023, 6:46 PMMichel Rouly
02/14/2023, 6:47 PMdaniel
02/14/2023, 6:47 PMMichel Rouly
02/14/2023, 6:47 PMMichel Rouly
02/14/2023, 6:47 PMdaniel
02/14/2023, 6:47 PMdaniel
02/14/2023, 6:47 PMMichel Rouly
02/14/2023, 6:48 PMdaniel
02/14/2023, 6:49 PMdaniel
02/14/2023, 6:50 PMMichel Rouly
02/14/2023, 6:52 PMCompleted
, the only weird piece is its pod has 1 restart on it, and is ~9 hours younger than the Job.
and its pod is Status: Succeeded
so according to k8s everything is ✅daniel
02/14/2023, 6:55 PMdaniel
02/14/2023, 6:55 PMPods Statuses: 0 Running / 1 Succeeded / 0 Failed
daniel
02/14/2023, 6:56 PMkubectl describe
output)Michel Rouly
02/14/2023, 6:57 PMPods Statuses: 0 Active / 1 Succeeded / 0 Failed
Michel Rouly
02/14/2023, 6:57 PMState: Terminated
Reason: Completed
Exit Code: 0
Started: Tue, 14 Feb 2023 03:22:30 -0500
Finished: Tue, 14 Feb 2023 03:23:06 -0500
Ready: False
Restart Count: 1
daniel
02/14/2023, 6:58 PMMichel Rouly
02/14/2023, 6:59 PMk -n dagster logs -p pod/dagster-run-520b88ec-2d27-4103-b239-b2fbe41d8595-bzcbq
Error from server (BadRequest): previous terminated container "dagster" in pod "dagster-run-520b88ec-2d27-4103-b239-b2fbe41d8595-bzcbq" not found
not in k8s anymore. I'll check our logs again, I initially tried looking for a pod with a different instance ID, but I'm thinking it would be the same since it only restartedMichel Rouly
02/14/2023, 7:00 PMMichel Rouly
02/14/2023, 7:01 PMdaniel
02/14/2023, 7:32 PMdaniel
02/15/2023, 3:45 PMMichel Rouly
02/15/2023, 9:09 PMMichel Rouly
02/15/2023, 9:11 PMrunning down why the pods are getting disrupted in the first placeYeah you're totally right. Our clusters are in a pretty chaotic state for a number of reasons right now, so we've definitely been leaning more heavily on the fault tolerance features of Dagster than we'd like to be. We've been experimenting with node TTLs and resource rebalancing pretty heavily recently and that's reduced the average node lifespan a fair bit.