Highlighting <https github com dagster io dagster issues 142 dagster #deployment-kubernetes

Highlighting <https://github.com/dagster-io/dagste...

Mark Fickett

06/13/2023, 3:27 PM

Highlighting https://github.com/dagster-io/dagster/issues/14248 to delete pods when cancelling a run via the UI. This week I'm working on adjusting how we chunk our work, and ending up with 10s of pods hanging (as it happens, on concurrent db writes). When I cancel the job from the UI, all the pods stick around, which clogs up the k8s cluster until I do some

xargs

work to delete all the running pods (and I think I need to clean up pending pods too).

daniel

06/13/2023, 3:32 PM

Hmmm in theory we attempt to delete the step pods.. https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_core/executor/step_delegating/step_delegating_executor.py#L2[…]213 and https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-k8s/dagster_k8s/executor.py#L311-L323 Got a link to a run where that's not happening?

Mark Fickett

06/13/2023, 3:35 PM

Usually it does, I'm wondering if there's something about my hanging db calls that's fouling it up. This is a run that was doing it. I: • started the run • found that my tasks were stuck (after they ran for about 10 minutes) • cancelled the job via the UI • ran

kubectl delete pod

for all the

dagster-step-

pods still running • came back and found some more running later and deleted them too If it's helpful I can ping you when there are some more cancelled via UI but still running, though I don't want to leave it in that state too long since it ties up resources / blocks me from trying the next revision.

Mark Fickett

06/13/2023, 3:59 PM

Here's one I just terminated. I see many "Deleting kubernetes job" logs (and spot checking one, the matching k8s pod is not present), but also many running pods still.

Mark Fickett

06/13/2023, 4:01 PM

List of still-running pods at the moment.

Untitled

daniel

06/13/2023, 4:05 PM

could the steps that were not interrupted be ones that were still starting up when the interrupt happened?

daniel

06/13/2023, 4:07 PM

doesn't seem like that's it from the code... when the step is launched, it goes in running_steps, then running_steps is checked to see which steps to interrupt.. I hope its not launching additional steps after the termination signal comes in

Mark Fickett

06/13/2023, 4:12 PM

I do see some with logs like this, and this one's pod is still running:

Copy code

11:51:52.192 my_step Executing step "my_step" in Kubernetes job dagster-step-a35bca4a5342af02503841e5f7c696cb.
<I terminated the job at 11:48>
11:58:56.992 my_step Step worker started for my_step

daniel

06/13/2023, 4:14 PM

How is this for a theory

Mark Fickett

06/13/2023, 4:14 PM

But I can also find some which did their

STEP_START

and

STEP_INPUT

events before I hit Terminate, and the still have running pods.

daniel

06/13/2023, 4:14 PM

The deletion is taking long enough that kubernetes is losing patience and hard-killing the job pod before it can finish cleaning up each of the step pods

daniel

06/13/2023, 4:14 PM

which is also why the run is stuck in CANCELING

daniel

06/13/2023, 4:15 PM

you could try increasing the termination grace period? https://cloud.google.com/blog/products/containers-kubernetes/kubernetes-best-practices-terminating-with-grace

Mark Fickett

06/13/2023, 4:17 PM

Is this the job pod? Or the code location?

Copy code

tech-data-pipeline-td-dev-912220-654b6d8848-zrdx8                1/1     Running     0          37m

daniel

06/13/2023, 4:17 PM

that's the code location

daniel

06/13/2023, 4:17 PM

the job pod (probably should have said run pod) starts with dagster-run

daniel

06/13/2023, 4:17 PM

at the very top of the logs there's a "Creating Kubernetes run worker job" line that gives its name

Mark Fickett

06/13/2023, 4:22 PM

Confirmed that there's no pod for that job running.

daniel

06/13/2023, 4:22 PM

I'm feeling very good about the non-graceful termination theory

Mark Fickett

06/13/2023, 4:23 PM

Is the grace period something I set in my agent Helm chart? (I see the pod yaml in the linked blog post, not sure how to map it to Dagster configs.)

daniel

06/13/2023, 4:24 PM

I think it would be covered by pod_spec_config, which you can set for particular locations, per-job via tags, or in the Helm chart via workspace.runK8sConfig.podSpecConfig: https://artifacthub.io/packages/helm/dagster-cloud/dagster-cloud-agent?modal=values&path=workspace.runK8sConfig

daniel

06/13/2023, 4:24 PM

Or per location: https://docs.dagster.io/dagster-cloud/deployment/agents/kubernetes/configuration-reference#per-location-configuration

Mark Fickett

06/13/2023, 4:25 PM

Thanks! I will try that out and let you know how it goes on the next job.

Mark Fickett

06/13/2023, 5:27 PM

That seems to have done the trick. Thanks again!

15 Views

Open in Slack

Previous Next