Highlighting <https://github.com/dagster-io/dagste...
# deployment-kubernetes
m
Highlighting https://github.com/dagster-io/dagster/issues/14248 to delete pods when cancelling a run via the UI. This week I'm working on adjusting how we chunk our work, and ending up with 10s of pods hanging (as it happens, on concurrent db writes). When I cancel the job from the UI, all the pods stick around, which clogs up the k8s cluster until I do some
xargs
work to delete all the running pods (and I think I need to clean up pending pods too).
m
Usually it does, I'm wondering if there's something about my hanging db calls that's fouling it up. This is a run that was doing it. I: • started the run • found that my tasks were stuck (after they ran for about 10 minutes) • cancelled the job via the UI • ran
kubectl delete pod
for all the
dagster-step-
pods still running • came back and found some more running later and deleted them too If it's helpful I can ping you when there are some more cancelled via UI but still running, though I don't want to leave it in that state too long since it ties up resources / blocks me from trying the next revision.
Here's one I just terminated. I see many "Deleting kubernetes job" logs (and spot checking one, the matching k8s pod is not present), but also many running pods still.
List of still-running pods at the moment.
d
could the steps that were not interrupted be ones that were still starting up when the interrupt happened?
doesn't seem like that's it from the code... when the step is launched, it goes in running_steps, then running_steps is checked to see which steps to interrupt.. I hope its not launching additional steps after the termination signal comes in
m
I do see some with logs like this, and this one's pod is still running:
Copy code
11:51:52.192 my_step Executing step "my_step" in Kubernetes job dagster-step-a35bca4a5342af02503841e5f7c696cb.
<I terminated the job at 11:48>
11:58:56.992 my_step Step worker started for my_step
d
How is this for a theory
m
But I can also find some which did their
STEP_START
and
STEP_INPUT
events before I hit Terminate, and the still have running pods.
d
The deletion is taking long enough that kubernetes is losing patience and hard-killing the job pod before it can finish cleaning up each of the step pods
which is also why the run is stuck in CANCELING
m
Is this the job pod? Or the code location?
Copy code
tech-data-pipeline-td-dev-912220-654b6d8848-zrdx8                1/1     Running     0          37m
d
that's the code location
the job pod (probably should have said run pod) starts with dagster-run
at the very top of the logs there's a "Creating Kubernetes run worker job" line that gives its name
m
Confirmed that there's no pod for that job running.
d
I'm feeling very good about the non-graceful termination theory
m
Is the grace period something I set in my agent Helm chart? (I see the pod yaml in the linked blog post, not sure how to map it to Dagster configs.)
d
I think it would be covered by pod_spec_config, which you can set for particular locations, per-job via tags, or in the Helm chart via workspace.runK8sConfig.podSpecConfig: https://artifacthub.io/packages/helm/dagster-cloud/dagster-cloud-agent?modal=values&amp;path=workspace.runK8sConfig
m
Thanks! I will try that out and let you know how it goes on the next job.
That seems to have done the trick. Thanks again!