Hi Dagster team! We are running a Dagster cluster ...
# ask-community
v
Hi Dagster team! We are running a Dagster cluster on a dozen of powerful machines. Despite the
max_concurrent_runs
setting currently being set to 36, we are never seeing more than 20 jobs dispatched concurrently by the
celery_executor
with most celery workers just sitting idle. What could be the bottleneck and how can we diagnose this? We were hoping to raise this setting to several hundreds as we add more hardware to the cluster. Could it be that the gRPC server is not cutting it? We are hosting one instance for the whole cluster. Shall we be running more?
Things fail horribly with more than one gRPC server running and jobs never finish, so that doesn't seem to be the answer. 😞
Or maybe stop hosting a Dagster gRPC server and move the Daemon to a really powerful machine?
d
Hi, how long is each individual job - is each one pretty quick to finish? If so, Could the bottleneck be the dagster run queue that is pulling runs off of the queue and launching them? That part wouldn't go through celery - it may need us to add the ability to dequeue runs in parallel rather than in serial
v
The jobs last between 45 seconds and 2 minutes.
It's unlikely the issue is from celery; for historical reasons, we are still dispatching raw celery jobs on the same infrastructure, and can run hundreds in parallel without breaking a sweat.
Dagster jobs that fork out to several ops in parallel are no issue either.
It's really starting the jobs themselves which is taking a long time; that seems to confirm your thoughts.
d
I think the relevant issue here is https://github.com/dagster-io/dagster/issues/7763 (the last part about the run queue)
There's also an interval that you can configure for how frequently the run queue operates that may help a bit as well
v
Ah that's exactly what we are doing: all our jobs are started by a sensor.
Ah, sure, we could give that a try. Could you please point me to this parameter?
dequeue_interval_seconds - that said I think the default is pretty low (every 5 seconds)
v
I could lower it to 1s and report back.
d
Other users in a similar situation in the past have had some success using dynamic orchestration to have a dynamic op for each thing that used to be a short small job - could be a potential workaround until that feature is landed to improve the run queue throughput
v
We are pretty familiar with dynamic graphs; I'll see what we can do. Thanks!
Looking into this a bit more, the only reason we are running
RunRequests
is to work around the fact that Dagster subgraphs cannot branch into parallel operations (see here). As a result, we are unable to use dynamic graphs here and have to rely on a sensor to dispatch our subgraphs as parallel jobs. 😞
d
Ah :( I wish I had a better answer for you - I'm hoping that we can get to parallelizing the run dequeuer soon, but ‘soon’ here might mean August/September
❤️ 1
v
August/September sounds great! We were planning on expanding our workloads in the coming weeks, resulting in hundreds of jobs being dispatched by a sensor. As it stands, this would likely make our system unusable. We however (and thankfully) have the option to stage our ramp up so we'll enable the feature little by little so as to not overwhelm the queue.
Once again, thanks for all the work that you do! We are extremely happy with Dagster; it has allowed us to make tremendous progress in our automation.
condagster 1
This here is just a small bump in the road. If Dagster cannot be made to dequeue requests in parallel by end of September, we can always bake our own solution with *_gasp*_ raw celery jobs that run the pipelines with
execute_in_process
or something. 😅