I have a run (k8s run coordinator, k8s run launche...
# ask-community
s
I have a run (k8s run coordinator, k8s run launcher, multiprocess executor) which is in a broken state and dagster can’t seem to recover. What should I do with a run that has been in “CANCELING” state for over 9 hours? At 145459 yesterday, the last log line is “Multiprocess executor: received termination signal - forwarding to active child processes”. I believe this is because the node hit a resource limit in ephemeral storage, so the kubelet started evicting pods, and evicted this run. At 214224 yesterday, I finally canceled the run in dagit. I see
Sending run termination request.
and
[K8sRunLauncher] Run was terminated successfully.
90 milliseconds later. But the run says “Cancelling” for its status, and it is part of a backfill which still says “In progress” (all other runs in the backfill are complete). Now it is 6:45 the next day, and it is still “Cancelling”. What do I do?
d
Hi Spencer - if you terminate the run from the UI, there should be a checkbox that allows you to choose to 'force terminate' the run - this should move it into a CANCELED state no matter what (kind of like the "Force Quit" option when killing a process )
This checkbox is what I was referring to: If it's in CANCELING, that might be the only option if you select Terminate, actually
i.e. the box when you press Terminate might look something like this instead
s
Okay, I do indeed see that. The Warning scares me. What computational resources should I go check?
d
The computational resources here would be the k8s pod
s
Okay, I’m reasonably confident that the pod is gone because I think it was evicted anyway
So this seems like it was a bug somewhere in… the run coordinator perhaps? I can make an issue if you like
d
If you still have logs from the pod that might help explain why the signal didn’t terminate it the way it typically does
I think logs of those nature would be needed in order for the issue to be actionable
s
I almost certainly don’t have logs but I’ll check
Makes sense that you’d need em
I might have node logs, but not pod ones?
d
Although we could keep it around for tracking in case others run into something similar
s
Like, from the kubelet
d
I think we would need the pod logs. What that sequence of events means is that the delete_job() call through the k8s api was successful, but then for whatever reason the k8s pod didn't terminate cleanly
s
Wow, I actually do have all the logs.
I am a little hesitant to just post the logs on a public issue because I’m not confident there’s nothing sensitive in them. I don’t mind sharing them with just you (and others at elementl), do you have any suggestions there? I can DM them or something, I don’t know what’s useful
d
No problem - DMing me or sending to daniel@elementl.com would both work
👍 1