Hey! Does anyone have experience with the cluster-...
# deployment-kubernetes
m
Hey! Does anyone have experience with the cluster-autoscaler evicting Dagster op pods (
dagster-step
) in order to scale down a node? 🧵
dagster bot responded by community 1
🤖 1
We've started noticing this behavior more due to larger and larger elastic graphs.
We could annotate the Dagster pods with the CA
safe-to-evict: false
annotation, but that kind of feels like a bandaid. Plus we may have other eviction forces beyond just the CA (e.g. k8s descheduler). So I don't love that option.
We tried setting restrictive PDBs (
maxUnavailable: 0
) but it looks like this doesn't work for jobs (this message shows up on our PDB's status):
Copy code
message: jobs.batch does not implement the scale subresource
So....I'm kind of at a loss here. Our Dagster ops happen to use node local storage, so we could have CA ignore pods with local storage, but that ends up having issues with things like
metrics-server
and other pods that we do want evictable despite having local storage.
Something that the spark-on-k8s-operator does which sidesteps this problem is launch worker pods simply as pods without a
Job
controller, which CA won't evict.
f
Working with Michel on this - adding some more errors from subsequent runs:
Copy code
clustered_extract_tumor_mutation_profile_workflow (06dfdcb4-ac74-4846-8894-a4f3157151a2) started a new run worker while the run was already in state DagsterRunStatus.STARTED. This most frequently happens when the run worker unexpectedly stops and is restarted by the cluster. Marking the run as failed.
Noticed weirdness with the dagster-run pod for this job. It definitely took a long time to spin up the pod. Once the pod actually spun up - it caused this error to be printed and the run was killed.
Last error on that pod was
Copy code
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "dagster-run-06dfdcb4-ac74-4846-8894-a4f3157151a2--1-lw9g2": operation timeout: context deadline exceeded
Could dagster be timing out on waiting for pods to spin up and trying to spin up a new one entirely?
d
Hi fahad - when we've seen that happen in the past (another run worker being spun up) it's almost always being done by kubernetes itself rather than dagster - it evicts the pod and then incorrectly tries to spin another copy up, so we detect that early and stop it so that it doesn't try to re-run steps. There's config that you can set that will tell dagster to try to safely resume where it left off when this happened though: https://docs.dagster.io/deployment/run-monitoring#resuming-runs-after-run-worker-crashes-experimental
and we're working on a job-level retry feature that would let you tell dagster to retry the job when it crashes like this
f
Awesome thank you for linking the feature! I think we have some investigation to do on our cluster - something must be killing the pod running the dagster-run
But i'm not seeing the k8s-descheduler nor the autoscaler mention anything about it in the logs
a
Mmmm you've said that your Dagster ops use the node local storage. Are you sure the pods aren't evicted due to the lack of storage of the node? I've seen that happening before. The ops fill the local disk of the node and so kubernetes evicts them to keep the node stable.
f
That's interesting - I'll definitely look into that as well. We are using ephemeral storage requests and providing each node with expanded storage but it's possible we're overcommitting by not requesting the correct resource
Although I don't think the dagster-run pod should be using any storage and that does still get evicted 🤔
a
This is the syntax we use for ephemeral storage:
Copy code
extra_tags = {
        "dagster-k8s/config": {
            "pod_spec_config": {
                "volumes": [
                    {
                        "name": "scratch-data",
                        "ephemeral": {
                            "volumeClaimTemplate": {
                                "spec": {
                                    "accessModes": ["ReadWriteOnce"],
                                    "storageClassName": "ssd",
                                    "resources": {
                                        "requests": {"storage": str(size) + "Gi"}
                                    },
                                }
                            }
                        },
                    }
                ]
            },
            "container_config": {
                "volume_mounts": [{"mountPath": mount_point, "name": "scratch-data"}]
            },
        }
    }
But indeed it's an interesting problem... Let me know if you figure out what's the cause of it
f
Been looking at this a bit over the last week and have noticed a few things that have helped so far. For one thing we definitely had to set up some
PodDisruptionBudgets
to prevent the autoscaler and our k8s-descheduler from evicting dagster pods. The other was related to the disk storage symptom in that we had a few pods that were set up with too-high memory limits and were causing memory pressure on the node. Even after those two fixes we are still seeing the k8s control plane tainting the nodes that the dagster-run pod exists on with a “NoSchedule” taint. Eventually k8s will consider the node unschedulable as well and the entire node is brought down.
I enabled run_monitoring to work around some of these issues for now and it works pretty well for an experimental feature! One issue i’ve noticed is that if the node that gets brought down has both a dagster-run and a dagster-step pod on it (meaning both are evicted), the dagster-run pod will restart but it won’t be able to pick up that the dagster-step pod was also killed and does not attempt to retry the step/op using the
OpRetryPolicy
đź‘Ť 1
d
cc @johann re: run_monitoring feedback above
j
@Dagster Bot issue K8s run monitoring fails when both a run worker and step worker go down on the same node
d
j
@fahad would you be able to share (either here or in a DM) a debug file for the run? I’m curious what the step events look like
f
Can do! Thanks for looking into it!
177 Views