https://dagster.io/ logo
#dagster-kubernetes
Title
# dagster-kubernetes
m

Matyas Tamas

03/09/2021, 6:01 PM
Any advice on best practices for autoscaling worker nodes in a k8s-celery deploy setup? So far, I've been relying on GKE autoscaling a worker pool (actually two pools, one for {CPU,GPU} workers respectively) and more or less the setup from the docs pushes out job pods that are unschedulable when there are insufficient worker nodes to trigger autoscaling. This sort of works okay, but larger node clusters (and big changes in the number of jobs) causes a lot of failures (among the unschedulable pods) and doesn't seem like the best way to do this?
As an aside, it took a bit of time to track down where various queue settings are that affect how many pods are generated (and how many are queued) when a bunch of jobs are submitted. For example, these celery default settings were unexpected: •
worker_concurrency
is by default set to the number of node CPUs •
worker_prefetch_multiplier
is set to 4 by default (these can be changed in
values.yaml:runLauncher.config.celeryK8sRunLauncher.configSource
)
I saw @Noah K's presentation last month using Keda for auto-scaling, which seemed pretty nice.
n

Noah K

03/09/2021, 7:41 PM
Strongly recommend you do not use a forking concurrency model with celery
👍 1
prefetching is mostly up to how slow your tasks are. If they take more than a few seconds, set it to 1 to ensure better balancing between replicas
👍 1
(1 == disable prefetch)
m

Matyas Tamas

03/09/2021, 7:50 PM
agree on both points!
One thing that's a bit confusing to me is that it seems that ideally I would want to have a metric on the queue size and scale the number of nodes accordingly and dequeue only as many jobs as there are available worker nodes. I may have not done enough googling, but it seems like most autoscaling models are the opposite - add the number of pods that would be required for your tasks, and let something else scale the number of nodes to satisfy the number of pods needing scheduling.
oh right, you mentioned the previous above points in your deck - I somehow missed that these were celery config settings. It looks like I should try
task_acks_late
as well
n

Noah K

03/09/2021, 8:54 PM
Node scaling is generally done by cluster-autoscaler, independently of all this
It will see when there are pods that need hardware and make more hardware
m

Matyas Tamas

03/09/2021, 9:10 PM
right... for batch jobs where a lot of tasks are created, and a bunch are dequeued and waiting unscheduled for nodes to spin up there seems to be a high failure rate where those unscheduled pods are somehow put in a bad state. I need to chase down what specifically causes that, but setting task_acks_late for some reason seems to coincide with me not being able to repro that particular problem... 🤔
n

Noah K

03/09/2021, 9:11 PM
Not sure what you mean, nothing should be dequeued until the worker replicas start, and they can't start until they have a node to run on 🙂
m

Matyas Tamas

03/09/2021, 9:23 PM
My limited understanding is that something like this is happening: • jobs (solids) are put on a celery queue from dagster-k8s-celery execution • each celery worker pulls tasks off the queue when available • the celery worker doesn't do much except generate a k8s job that runs the actual task (solid) • those generated jobs try to get scheduled among the available worker nodes One thing about all this is that it doesn't really matter where the celery worker is running because it is not really doing any work, it's just generating the job tasks that need to get put on to the worker pools. That said, the celery workers are what take the tasks off the queue, so if you have many celery workers running (e.g. equal to the max node pool capacity), those workers might generate many more job tasks than can be scheduled on the available nodes (which will sit there until the available nodes eventually get to them, but hopefully also triggering the cluster autoscaler to add more nodes)
n

Noah K

03/09/2021, 9:28 PM
No, that is not how it works
m

Matyas Tamas

03/09/2021, 9:28 PM
That's kind of weird though bc it's basically just making another queue of jobs from all the unscheduled jobs waiting around (and it's not even a queue since order isn't respected). From your comment, it sounds like what I should be doing instead is scaling the celery workers and using them to trigger the node pool to expand?
n

Noah K

03/09/2021, 9:28 PM
Well, I mean you could do that I guess
m

Matyas Tamas

03/09/2021, 9:28 PM
lol
n

Noah K

03/09/2021, 9:29 PM
But I don't know of any launcher for that
So first: there's a bunch of k8s-y options
dagster_k8s, dagster_celery, dagster_k8s_celery, and my weirdo layout
Which are you talking about?
m

Matyas Tamas

03/09/2021, 9:30 PM
dagster_k8s_celery
n

Noah K

03/09/2021, 9:30 PM
So that uses a Job for the executor, the thing running the pipeline run itself
But solid evaluations are sent to celery workers
I don't think those launch further jobs? I can check
m

Matyas Tamas

03/09/2021, 9:33 PM
When I execute a pipeline, I generally see one dagster-run-xxx job created per pipeline execution, and one dagster-job-xxx job created per solid
btw, I very much appreciate your advice on this!
n

Noah K

03/09/2021, 9:33 PM
Oh maybe they do, weird
Yeah
The executor task itself does also spit out Job objects
Weird
Yeah that would do all kinds of weirdness to autoscaling 🙂
m

Matyas Tamas

03/09/2021, 9:34 PM
lol
I think this would be manageable if I autoscaled the celery workers to match the node pool resources. One downside is I'm not sure if I can scale a pool down to zero in that case...
So if I'm scaling up, I add more celery workers, but they don't get scheduled until the extra nodes are added, and thus the jobs themselves aren't generated until the nodes are up. Maybe I'll try that...
n

Noah K

03/09/2021, 9:39 PM
I would just not use Jobs for anything important but I'm biased 😄
m

Matyas Tamas

03/09/2021, 9:39 PM
haha
you're saying to get the celery workers to just execute the task?
n

Noah K

03/09/2021, 9:41 PM
Yes
The queue is always the final arbiter of what still needs processing
m

Matyas Tamas

03/09/2021, 9:42 PM
I'm guessing they designed it this way so that you could run different code versions through the same workers, which is nice for dev.
and if you push a bad image, the celery workers can stay up
anyway - thanks for your help - I have a few ideas for next steps at least!
btw, have you liked keda for autoscaling? I was going to check it out
n

Noah K

03/09/2021, 9:45 PM
Yes, it's the thing we use for basically everything
👍 1
11 Views