Hey Does anyone have experience with the cluster autoscaler dagster #deployment-kubernetes

Hey! Does anyone have experience with the cluster-...

Michel Rouly

05/27/2022, 6:47 PM

Hey! Does anyone have experience with the cluster-autoscaler evicting Dagster op pods (

dagster-step

) in order to scale down a node? 🧵

dagster bot responded by community 1

🤖 1

Michel Rouly

05/27/2022, 6:47 PM

We've started noticing this behavior more due to larger and larger elastic graphs.

Michel Rouly

05/27/2022, 6:47 PM

We could annotate the Dagster pods with the CA

safe-to-evict: false

annotation, but that kind of feels like a bandaid. Plus we may have other eviction forces beyond just the CA (e.g. k8s descheduler). So I don't love that option.

Michel Rouly

05/27/2022, 6:48 PM

We tried setting restrictive PDBs (

maxUnavailable: 0

) but it looks like this doesn't work for jobs (this message shows up on our PDB's status):

Copy code

message: jobs.batch does not implement the scale subresource

Michel Rouly

05/27/2022, 6:49 PM

So....I'm kind of at a loss here. Our Dagster ops happen to use node local storage, so we could have CA ignore pods with local storage, but that ends up having issues with things like

metrics-server

and other pods that we do want evictable despite having local storage.

Michel Rouly

05/27/2022, 7:20 PM

Something that the spark-on-k8s-operator does which sidesteps this problem is launch worker pods simply as pods without a

Job

controller, which CA won't evict.

fahad

05/27/2022, 8:24 PM

Working with Michel on this - adding some more errors from subsequent runs:

Copy code

clustered_extract_tumor_mutation_profile_workflow (06dfdcb4-ac74-4846-8894-a4f3157151a2) started a new run worker while the run was already in state DagsterRunStatus.STARTED. This most frequently happens when the run worker unexpectedly stops and is restarted by the cluster. Marking the run as failed.

Noticed weirdness with the dagster-run pod for this job. It definitely took a long time to spin up the pod. Once the pod actually spun up - it caused this error to be printed and the run was killed.

fahad

05/27/2022, 8:26 PM

Last error on that pod was

Copy code

Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "dagster-run-06dfdcb4-ac74-4846-8894-a4f3157151a2--1-lw9g2": operation timeout: context deadline exceeded

fahad

05/27/2022, 8:40 PM

Could dagster be timing out on waiting for pods to spin up and trying to spin up a new one entirely?

daniel

05/27/2022, 9:20 PM

Hi fahad - when we've seen that happen in the past (another run worker being spun up) it's almost always being done by kubernetes itself rather than dagster - it evicts the pod and then incorrectly tries to spin another copy up, so we detect that early and stop it so that it doesn't try to re-run steps. There's config that you can set that will tell dagster to try to safely resume where it left off when this happened though: https://docs.dagster.io/deployment/run-monitoring#resuming-runs-after-run-worker-crashes-experimental

daniel

05/27/2022, 9:20 PM

and we're working on a job-level retry feature that would let you tell dagster to retry the job when it crashes like this

fahad

05/27/2022, 9:36 PM

Awesome thank you for linking the feature! I think we have some investigation to do on our cluster - something must be killing the pod running the dagster-run

fahad

05/27/2022, 9:36 PM

But i'm not seeing the k8s-descheduler nor the autoscaler mention anything about it in the logs

Andrea Giardini

05/28/2022, 5:50 AM

Mmmm you've said that your Dagster ops use the node local storage. Are you sure the pods aren't evicted due to the lack of storage of the node? I've seen that happening before. The ops fill the local disk of the node and so kubernetes evicts them to keep the node stable.

fahad

05/31/2022, 3:47 PM

That's interesting - I'll definitely look into that as well. We are using ephemeral storage requests and providing each node with expanded storage but it's possible we're overcommitting by not requesting the correct resource

fahad

05/31/2022, 3:48 PM

Although I don't think the dagster-run pod should be using any storage and that does still get evicted 🤔

Andrea Giardini

06/03/2022, 12:14 PM

This is the syntax we use for ephemeral storage:

Copy code

extra_tags = {
        "dagster-k8s/config": {
            "pod_spec_config": {
                "volumes": [
                    {
                        "name": "scratch-data",
                        "ephemeral": {
                            "volumeClaimTemplate": {
                                "spec": {
                                    "accessModes": ["ReadWriteOnce"],
                                    "storageClassName": "ssd",
                                    "resources": {
                                        "requests": {"storage": str(size) + "Gi"}
                                    },
                                }
                            }
                        },
                    }
                ]
            },
            "container_config": {
                "volume_mounts": [{"mountPath": mount_point, "name": "scratch-data"}]
            },
        }
    }

Andrea Giardini

06/03/2022, 12:14 PM

But indeed it's an interesting problem... Let me know if you figure out what's the cause of it

fahad

06/03/2022, 3:25 PM

Been looking at this a bit over the last week and have noticed a few things that have helped so far. For one thing we definitely had to set up some

PodDisruptionBudgets

to prevent the autoscaler and our k8s-descheduler from evicting dagster pods. The other was related to the disk storage symptom in that we had a few pods that were set up with too-high memory limits and were causing memory pressure on the node. Even after those two fixes we are still seeing the k8s control plane tainting the nodes that the dagster-run pod exists on with a “NoSchedule” taint. Eventually k8s will consider the node unschedulable as well and the entire node is brought down.

fahad

06/03/2022, 3:26 PM

I enabled run_monitoring to work around some of these issues for now and it works pretty well for an experimental feature! One issue i’ve noticed is that if the node that gets brought down has both a dagster-run and a dagster-step pod on it (meaning both are evicted), the dagster-run pod will restart but it won’t be able to pick up that the dagster-step pod was also killed and does not attempt to retry the step/op using the