Hi guys, part of Zephyr AI here, <we have been usi...
# ask-community
r
Hi guys, part of Zephyr AI here, we have been using dagster for a long time and have routine philosophical conversations like “what if we used dagster to run ci/cd for us”. Jokes aside, i have a support question on dagter 1.1.17. We have an issue where: we have a long running pipeline (lets say 18 hours) on k8s where we use the pod per op (pod per step) strategy. our hypothesis is when: the dagster daemon restarts sometime during the course of the execution of the pipeline (so any one of the steps) then: the next step in the pipeline is not spun up the dagster run pod and dagit just show the job in progress, thats it. this is a rough timeline: • run kicks off at 17:54 • 6 minutes later (18:00) a step pod kicks off • during execution (03:22) daemon restarts •
Ignoring a duplicate run that was started from somewhere other than the run monitor daemon
gets logged in dagit at 03:23 • step pod runs for 11 hrs and terminates at 05:42 • next step pod is not spun up ive been told dagster support is pretty responsive, and this is becoming quite a headache for us, so any pointers would be much appreciated.
d
Hi Ripple - so once the daemon launches a run, the run exists in its own pod that's totally independent of the daemon. So I don't think the daemon restarting is the direct cause here. My guess would be that instead, something in your cluster is happening that is restarting several pods at once (including both the daemon and the pod for that run). One common way this can happen is the k8s cluster autoscaling down a node and killing any pods running on that node. If you have logs or a
kubectl describe
from the run worker pod (the name of the pod should be in the event logs from the run) that would help explain exactly what happened here to kill the run pod.
r
so you think the run pod and the daemon pod restarting around the same time would cause enough chaos for the next step to not be executed?
d
Just the run pod getting killed by the cluster would be enough to disrupt the run, yeah - the daemon is out of the picture at that point
(for that run at least)
In terms of how to deal with this, I think you have two options: • set up run retries so that when this happens it will pick up where it left off in a new run: https://docs.dagster.io/deployment/run-retries#run-retries • Take steps in your cluster to prevent the pod from being killed in the first place. This is going to depend on the exact reason that the pod was killed, but if it was killed by your cluster autoscaler there are often ways to tell it to wait until the pod dies on its own (e.g. there's a
<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>
annotation that can be set to False)
r
how are run monitoring and run retries related?
d
Yeah I was actually just about to suggest that "Resuming runs after run worker crashes" bullet on that page as well - if you're running each step in its own pod with the k8s_job_executor then that would work too (and it should all stay in a single run that way)
m
ceiling dagster I work with @Ripple Khera on this project. We do have run monitoring turned on actually! I wonder if that was maybe responsible for this additional observation: • the step
Job
and its pod consistently ran from run kick off and does not appear to have been restarted at any point • the run `Job`'s pod got re-created / restarted halfway through, at 03:10 based on that, it does look to us like the run got restarted after it crashed.
but even though it was restarted (potentially by run monitoring) it still somehow lost track of the ops it was supposed to continue scheduling, or got disconnected from the daemon/Dagit somehow
d
Michel to confirm - you specifically have that
max_resume_run_attempts
key set as part of run monitoring? That's a specific feature within run monitoring for restarting crashed run workers (instead of just detecting hanging or crashed run workers and failing the run)
m
yeah, this is our
run_monitoring
config
Copy code
run_monitoring:
  enabled: true
  start_timeout_seconds:  360
  max_resume_run_attempts: 5
  poll_interval_seconds: 120
d
Hmmm that is curious - any chance you could send us debug logs from the run in question? (From this UI)
👀 1
m
Blah, unfortunately I'm not sure we can share that without an nda or a support agreement unfortunately, there's a lot of internal identifiers in there. Shoot.
d
Hm, OK - Are there any clues in the event logs about why it might not be launching the next pod? That'd be the thing i'd be looking through them for
m
The only suspicious log we've seen so far is this one, in the new run pod:
Ignoring a duplicate run that was started from somewhere other than the run monitor daemon
I'll see if I can find the logs for the previous run pod...
d
The expected behavior here is that there's a second pod spun up that resumes the run - "dagster api resume_run" instead of "dagster api execute_run" - that log you mentioned i'm not actually super concerned about because i think that's k8s trying to be helpful and restarting the original run pod in a new node
👍 1
(whereas the expected behavior is an entirely new pod is launched from the daemon in order to resume the run - you mentioned the daemon is also getting disrupted around this time, but I would still expect it to re-assume its duties once it restarts. Logs from the monitoring daemon during this period might also be illuminating)
m
looks like just a whole lot of
Copy code
Collected 2 runs for monitoring
Checking run 520b88ec-2d27-4103-b239-b2fbe41d8595
unfortunately we have a missing user deployment that's causing a bunch of error log spam, so I might be filtering errors out too aggressively
I think we'll look into running the Daemon on dedicated hardware to start with, and see how that helps
r
And add a pdb for run and step?
Oh sorry wrong channel haha
d
The expected behavior is the run should have
Copy code
"Detected run worker status XXX. Resuming run YYY with a new worker."
in the event logs if the monitoring daemon restarted it
Changing where the daemon runs might help keep the daemon from getting restarted, but is unlikely to prevent the run pod from being restarted which is what disrupted the run
👍 1
If doing the retry in a new run is an option, the run retries feature is a bit more battle-tested than the "restart crashed run workers" feature
(the latter is still labeled experimental, the former is not)
m
right-- we discussed run retrying, but unfortunately these runs can take upwards of 8-12hrs so it would be a pretty significant retry to retry the whole run.
it's definitely on the table though
d
Oh, it can retry from failure if that helps
👀 1
m
so it would, but according to Dagit there hasn't been a failure yet
d
i.e. it would not do previously completed ops again
m
oh I guess this didn't get stated earlier: the run is still flagged "in progress"
even though there's no op running
d
oh, that's also significant - neither of these features is going to work if run monitoring isn't able to detect that the has failed
👍 1
that implies that something about the health check isn't working correctly in this scenario
m
oh interesting, I don't know anything about Dagster's health checks.
d
pretty much all it does is call this method on the run job using the k8s api: https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/BatchV1Api.md#read_namespaced_job_status
👍 1
Which i believe roughly maps to what is returned by 'kubectl describe job' on the k8s job for that run
m
the k8s Job for the run is marked
Completed
, the only weird piece is its pod has 1 restart on it, and is ~9 hours younger than the Job. and its pod is
Status: Succeeded
so according to k8s everything is
d
hmmmm yet a pod definitely failed no? how odd
We're talking about this line right?
Copy code
Pods Statuses:    0 Running / 1 Succeeded / 0 Failed
(in the
kubectl describe
output)
m
Copy code
Pods Statuses:    0 Active / 1 Succeeded / 0 Failed
and here's the state of the associated pod
Copy code
State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 14 Feb 2023 03:22:30 -0500
      Finished:     Tue, 14 Feb 2023 03:23:06 -0500
    Ready:          False
    Restart Count:  1
d
There's no record of the previous pod that failed anywhere? (it would be the one early in the run logs)
m
Copy code
k -n dagster logs -p pod/dagster-run-520b88ec-2d27-4103-b239-b2fbe41d8595-bzcbq
Error from server (BadRequest): previous terminated container "dagster" in pod "dagster-run-520b88ec-2d27-4103-b239-b2fbe41d8595-bzcbq" not found
not in k8s anymore. I'll check our logs again, I initially tried looking for a pod with a different instance ID, but I'm thinking it would be the same since it only restarted
nope nothing. just the same logs as in the current pod.
and I just verified, none of our other (normal) run jobs have a pod with a restart. just this one.
d
I think we have a PR that will help with this case, I recall something like this coming up once before: https://github.com/dagster-io/dagster/pull/11510 - this thread is a good motivation to get it over the finish line
I think we'll be able to get that run monitoring fix out a week from tomorrow - in the meantime I think it's still worth running down why the pods are getting disrupted in the first place since even if Dagster is able to recover it's still probably pretty disruptive (it will have to start over whatever op it was in the middle of when the disruption happened for example)
m
Hey that's amazing to hear! We'll be very excited to try it out.
running down why the pods are getting disrupted in the first place
Yeah you're totally right. Our clusters are in a pretty chaotic state for a number of reasons right now, so we've definitely been leaning more heavily on the fault tolerance features of Dagster than we'd like to be. We've been experimenting with node TTLs and resource rebalancing pretty heavily recently and that's reduced the average node lifespan a fair bit.