Hi guys part of Zephyr AI here <https dagster io blog zephyr dagster #ask-community

Hi guys, part of Zephyr AI here, <we have been usi...

Ripple Khera

02/14/2023, 4:26 PM

Hi guys, part of Zephyr AI here, we have been using dagster for a long time and have routine philosophical conversations like “what if we used dagster to run ci/cd for us”. Jokes aside, i have a support question on dagter 1.1.17. We have an issue where: we have a long running pipeline (lets say 18 hours) on k8s where we use the pod per op (pod per step) strategy. our hypothesis is when: the dagster daemon restarts sometime during the course of the execution of the pipeline (so any one of the steps) then: the next step in the pipeline is not spun up the dagster run pod and dagit just show the job in progress, thats it. this is a rough timeline: • run kicks off at 17:54 • 6 minutes later (18:00) a step pod kicks off • during execution (03:22) daemon restarts •

Ignoring a duplicate run that was started from somewhere other than the run monitor daemon

gets logged in dagit at 03:23 • step pod runs for 11 hrs and terminates at 05:42 • next step pod is not spun up ive been told dagster support is pretty responsive, and this is becoming quite a headache for us, so any pointers would be much appreciated.

daniel

02/14/2023, 4:33 PM

Hi Ripple - so once the daemon launches a run, the run exists in its own pod that's totally independent of the daemon. So I don't think the daemon restarting is the direct cause here. My guess would be that instead, something in your cluster is happening that is restarting several pods at once (including both the daemon and the pod for that run). One common way this can happen is the k8s cluster autoscaling down a node and killing any pods running on that node. If you have logs or a

kubectl describe

from the run worker pod (the name of the pod should be in the event logs from the run) that would help explain exactly what happened here to kill the run pod.

Ripple Khera

02/14/2023, 4:34 PM

so you think the run pod and the daemon pod restarting around the same time would cause enough chaos for the next step to not be executed?

daniel

02/14/2023, 4:35 PM

Just the run pod getting killed by the cluster would be enough to disrupt the run, yeah - the daemon is out of the picture at that point

daniel

02/14/2023, 4:35 PM

(for that run at least)

daniel

02/14/2023, 4:36 PM

In terms of how to deal with this, I think you have two options: • set up run retries so that when this happens it will pick up where it left off in a new run: https://docs.dagster.io/deployment/run-retries#run-retries • Take steps in your cluster to prevent the pod from being killed in the first place. This is going to depend on the exact reason that the pod was killed, but if it was killed by your cluster autoscaler there are often ways to tell it to wait until the pod dies on its own (e.g. there's a

<http://cluster-autoscaler.kubernetes.io/safe-to-evict|cluster-autoscaler.kubernetes.io/safe-to-evict>

annotation that can be set to False)

Ripple Khera

02/14/2023, 4:40 PM

how are run monitoring and run retries related?

daniel

02/14/2023, 4:41 PM

Yeah I was actually just about to suggest that "Resuming runs after run worker crashes" bullet on that page as well - if you're running each step in its own pod with the k8s_job_executor then that would work too (and it should all stay in a single run that way)

Michel Rouly

02/14/2023, 5:02 PM

ceiling dagster I work with @Ripple Khera on this project. We do have run monitoring turned on actually! I wonder if that was maybe responsible for this additional observation: • the step

Job

and its pod consistently ran from run kick off and does not appear to have been restarted at any point • the run `Job`'s pod got re-created / restarted halfway through, at 03:10 based on that, it does look to us like the run got restarted after it crashed.

Michel Rouly

02/14/2023, 5:03 PM

but even though it was restarted (potentially by run monitoring) it still somehow lost track of the ops it was supposed to continue scheduling, or got disconnected from the daemon/Dagit somehow

daniel

02/14/2023, 5:11 PM

Michel to confirm - you specifically have that

max_resume_run_attempts

key set as part of run monitoring? That's a specific feature within run monitoring for restarting crashed run workers (instead of just detecting hanging or crashed run workers and failing the run)

daniel

02/14/2023, 5:11 PM

(This one here: https://docs.dagster.io/1.1.17/deployment/run-monitoring#resuming-runs-after-run-worker-crashes-experimental)

Michel Rouly

02/14/2023, 5:12 PM

yeah, this is our

run_monitoring

config

Copy code

run_monitoring:
  enabled: true
  start_timeout_seconds:  360
  max_resume_run_attempts: 5
  poll_interval_seconds: 120

daniel

02/14/2023, 5:14 PM

Hmmm that is curious - any chance you could send us debug logs from the run in question? (From this UI)

👀 1

Michel Rouly

02/14/2023, 5:55 PM

Blah, unfortunately I'm not sure we can share that without an nda or a support agreement unfortunately, there's a lot of internal identifiers in there. Shoot.

daniel

02/14/2023, 5:56 PM

Hm, OK - Are there any clues in the event logs about why it might not be launching the next pod? That'd be the thing i'd be looking through them for

Michel Rouly

02/14/2023, 5:57 PM

The only suspicious log we've seen so far is this one, in the new run pod:

Ignoring a duplicate run that was started from somewhere other than the run monitor daemon

Michel Rouly

02/14/2023, 5:57 PM

I'll see if I can find the logs for the previous run pod...

daniel

02/14/2023, 5:58 PM

The expected behavior here is that there's a second pod spun up that resumes the run - "dagster api resume_run" instead of "dagster api execute_run" - that log you mentioned i'm not actually super concerned about because i think that's k8s trying to be helpful and restarting the original run pod in a new node

👍 1

daniel

02/14/2023, 6:11 PM

(whereas the expected behavior is an entirely new pod is launched from the daemon in order to resume the run - you mentioned the daemon is also getting disrupted around this time, but I would still expect it to re-assume its duties once it restarts. Logs from the monitoring daemon during this period might also be illuminating)

Michel Rouly

02/14/2023, 6:38 PM

looks like just a whole lot of

Copy code

Collected 2 runs for monitoring
Checking run 520b88ec-2d27-4103-b239-b2fbe41d8595

unfortunately we have a missing user deployment that's causing a bunch of error log spam, so I might be filtering errors out too aggressively

Michel Rouly

02/14/2023, 6:42 PM

I think we'll look into running the Daemon on dedicated hardware to start with, and see how that helps

Ripple Khera

02/14/2023, 6:43 PM

And add a pdb for run and step?

Ripple Khera

02/14/2023, 6:43 PM

Oh sorry wrong channel haha

daniel

02/14/2023, 6:43 PM

The expected behavior is the run should have

Copy code

"Detected run worker status XXX. Resuming run YYY with a new worker."

in the event logs if the monitoring daemon restarted it

daniel

02/14/2023, 6:44 PM

Changing where the daemon runs might help keep the daemon from getting restarted, but is unlikely to prevent the run pod from being restarted which is what disrupted the run

👍 1

daniel

02/14/2023, 6:44 PM

If doing the retry in a new run is an option, the run retries feature is a bit more battle-tested than the "restart crashed run workers" feature

daniel

02/14/2023, 6:45 PM

(the latter is still labeled experimental, the former is not)

Michel Rouly

02/14/2023, 6:46 PM

right-- we discussed run retrying, but unfortunately these runs can take upwards of 8-12hrs so it would be a pretty significant retry to retry the whole run.

Michel Rouly

02/14/2023, 6:46 PM

it's definitely on the table though

daniel

02/14/2023, 6:46 PM

Oh, it can retry from failure if that helps

👀 1

Michel Rouly

02/14/2023, 6:47 PM

so it would, but according to Dagit there hasn't been a failure yet

daniel

02/14/2023, 6:47 PM

i.e. it would not do previously completed ops again

Michel Rouly

02/14/2023, 6:47 PM

oh I guess this didn't get stated earlier: the run is still flagged "in progress"

Michel Rouly

02/14/2023, 6:47 PM

even though there's no op running

daniel

02/14/2023, 6:47 PM

oh, that's also significant - neither of these features is going to work if run monitoring isn't able to detect that the has failed

👍 1

daniel

02/14/2023, 6:47 PM

that implies that something about the health check isn't working correctly in this scenario

Michel Rouly

02/14/2023, 6:48 PM

oh interesting, I don't know anything about Dagster's health checks.

daniel

02/14/2023, 6:49 PM

pretty much all it does is call this method on the run job using the k8s api: https://github.com/kubernetes-client/python/blob/master/kubernetes/docs/BatchV1Api.md#read_namespaced_job_status

👍 1

daniel

02/14/2023, 6:50 PM

Which i believe roughly maps to what is returned by 'kubectl describe job' on the k8s job for that run

Michel Rouly

02/14/2023, 6:52 PM

the k8s Job for the run is marked

Completed

, the only weird piece is its pod has 1 restart on it, and is ~9 hours younger than the Job. and its pod is

Status: Succeeded

so according to k8s everything is ✅

daniel

02/14/2023, 6:55 PM

hmmmm yet a pod definitely failed no? how odd

daniel

02/14/2023, 6:55 PM

We're talking about this line right?

Copy code

Pods Statuses:    0 Running / 1 Succeeded / 0 Failed

daniel

02/14/2023, 6:56 PM

(in the

kubectl describe

output)

Michel Rouly

02/14/2023, 6:57 PM

Copy code

Pods Statuses:    0 Active / 1 Succeeded / 0 Failed

Michel Rouly

02/14/2023, 6:57 PM

and here's the state of the associated pod

Copy code

State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Tue, 14 Feb 2023 03:22:30 -0500
      Finished:     Tue, 14 Feb 2023 03:23:06 -0500
    Ready:          False
    Restart Count:  1

daniel

02/14/2023, 6:58 PM

There's no record of the previous pod that failed anywhere? (it would be the one early in the run logs)

Michel Rouly

02/14/2023, 6:59 PM

Copy code

k -n dagster logs -p pod/dagster-run-520b88ec-2d27-4103-b239-b2fbe41d8595-bzcbq
Error from server (BadRequest): previous terminated container "dagster" in pod "dagster-run-520b88ec-2d27-4103-b239-b2fbe41d8595-bzcbq" not found

not in k8s anymore. I'll check our logs again, I initially tried looking for a pod with a different instance ID, but I'm thinking it would be the same since it only restarted

Michel Rouly

02/14/2023, 7:00 PM

nope nothing. just the same logs as in the current pod.

Michel Rouly

02/14/2023, 7:01 PM

and I just verified, none of our other (normal) run jobs have a pod with a restart. just this one.

daniel

02/14/2023, 7:32 PM

I think we have a PR that will help with this case, I recall something like this coming up once before: https://github.com/dagster-io/dagster/pull/11510 - this thread is a good motivation to get it over the finish line

daniel

02/15/2023, 3:45 PM

I think we'll be able to get that run monitoring fix out a week from tomorrow - in the meantime I think it's still worth running down why the pods are getting disrupted in the first place since even if Dagster is able to recover it's still probably pretty disruptive (it will have to start over whatever op it was in the middle of when the disruption happened for example)

Michel Rouly

02/15/2023, 9:09 PM

Hey that's amazing to hear! We'll be very excited to try it out.

Michel Rouly

02/15/2023, 9:11 PM

running down why the pods are getting disrupted in the first place

Yeah you're totally right. Our clusters are in a pretty chaotic state for a number of reasons right now, so we've definitely been leaning more heavily on the fault tolerance features of Dagster than we'd like to be. We've been experimenting with node TTLs and resource rebalancing pretty heavily recently and that's reduced the average node lifespan a fair bit.

2 Views

Open in Slack

Previous Next