I have a run k8s run coordinator k8s run launcher multiproce dagster #ask-community

I have a run (k8s run coordinator, k8s run launche...

Spencer Nelson

03/24/2023, 1:46 PM

I have a run (k8s run coordinator, k8s run launcher, multiprocess executor) which is in a broken state and dagster can’t seem to recover. What should I do with a run that has been in “CANCELING” state for over 9 hours? At 145459 yesterday, the last log line is “Multiprocess executor: received termination signal - forwarding to active child processes”. I believe this is because the node hit a resource limit in ephemeral storage, so the kubelet started evicting pods, and evicted this run. At 214224 yesterday, I finally canceled the run in dagit. I see

Sending run termination request.

and

[K8sRunLauncher] Run was terminated successfully.

90 milliseconds later. But the run says “Cancelling” for its status, and it is part of a backfill which still says “In progress” (all other runs in the backfill are complete). Now it is 6:45 the next day, and it is still “Cancelling”. What do I do?

daniel

03/24/2023, 2:11 PM

Hi Spencer - if you terminate the run from the UI, there should be a checkbox that allows you to choose to 'force terminate' the run - this should move it into a CANCELED state no matter what (kind of like the "Force Quit" option when killing a process )

daniel

03/24/2023, 2:12 PM

This checkbox is what I was referring to: If it's in CANCELING, that might be the only option if you select Terminate, actually

daniel

03/24/2023, 2:13 PM

i.e. the box when you press Terminate might look something like this instead

Spencer Nelson

03/24/2023, 4:07 PM

Okay, I do indeed see that. The Warning scares me. What computational resources should I go check?

daniel

03/24/2023, 4:07 PM

The computational resources here would be the k8s pod

Spencer Nelson

03/24/2023, 4:07 PM

Okay, I’m reasonably confident that the pod is gone because I think it was evicted anyway

Spencer Nelson

03/24/2023, 4:09 PM

So this seems like it was a bug somewhere in… the run coordinator perhaps? I can make an issue if you like

daniel

03/24/2023, 4:09 PM

If you still have logs from the pod that might help explain why the signal didn’t terminate it the way it typically does

daniel

03/24/2023, 4:10 PM

I think logs of those nature would be needed in order for the issue to be actionable

Spencer Nelson

03/24/2023, 4:10 PM

I almost certainly don’t have logs but I’ll check

Spencer Nelson

03/24/2023, 4:10 PM

Makes sense that you’d need em

Spencer Nelson

03/24/2023, 4:10 PM

I might have node logs, but not pod ones?

daniel

03/24/2023, 4:10 PM

Although we could keep it around for tracking in case others run into something similar

Spencer Nelson

03/24/2023, 4:10 PM

Like, from the kubelet

daniel

03/24/2023, 4:12 PM

I think we would need the pod logs. What that sequence of events means is that the delete_job() call through the k8s api was successful, but then for whatever reason the k8s pod didn't terminate cleanly

Spencer Nelson

03/24/2023, 4:15 PM

Wow, I actually do have all the logs.

Spencer Nelson

03/24/2023, 4:18 PM

I am a little hesitant to just post the logs on a public issue because I’m not confident there’s nothing sensitive in them. I don’t mind sharing them with just you (and others at elementl), do you have any suggestions there? I can DM them or something, I don’t know what’s useful

daniel

03/24/2023, 4:19 PM

No problem - DMing me or sending to daniel@elementl.com would both work

👍 1

Open in Slack

Previous Next