Hi Dagster team We are running a Dagster cluster on a dozen dagster #ask-community

Hi Dagster team! We are running a Dagster cluster ...

VxD

07/22/2022, 5:07 AM

Hi Dagster team! We are running a Dagster cluster on a dozen of powerful machines. Despite the

max_concurrent_runs

setting currently being set to 36, we are never seeing more than 20 jobs dispatched concurrently by the

celery_executor

with most celery workers just sitting idle. What could be the bottleneck and how can we diagnose this? We were hoping to raise this setting to several hundreds as we add more hardware to the cluster. Could it be that the gRPC server is not cutting it? We are hosting one instance for the whole cluster. Shall we be running more?

VxD

07/22/2022, 5:27 AM

Things fail horribly with more than one gRPC server running and jobs never finish, so that doesn't seem to be the answer. 😞

VxD

07/22/2022, 5:37 AM

Or maybe stop hosting a Dagster gRPC server and move the Daemon to a really powerful machine?

daniel

07/22/2022, 12:02 PM

Hi, how long is each individual job - is each one pretty quick to finish? If so, Could the bottleneck be the dagster run queue that is pulling runs off of the queue and launching them? That part wouldn't go through celery - it may need us to add the ability to dequeue runs in parallel rather than in serial

VxD

07/22/2022, 12:04 PM

The jobs last between 45 seconds and 2 minutes.

VxD

07/22/2022, 12:08 PM

It's unlikely the issue is from celery; for historical reasons, we are still dispatching raw celery jobs on the same infrastructure, and can run hundreds in parallel without breaking a sweat.

VxD

07/22/2022, 12:09 PM

Dagster jobs that fork out to several ops in parallel are no issue either.

VxD

07/22/2022, 12:10 PM

It's really starting the jobs themselves which is taking a long time; that seems to confirm your thoughts.

daniel

07/22/2022, 12:11 PM

I think the relevant issue here is https://github.com/dagster-io/dagster/issues/7763 (the last part about the run queue)

daniel

07/22/2022, 12:11 PM

There's also an interval that you can configure for how frequently the run queue operates that may help a bit as well

VxD

07/22/2022, 12:11 PM

Ah that's exactly what we are doing: all our jobs are started by a sensor.

VxD

07/22/2022, 12:12 PM

Ah, sure, we could give that a try. Could you please point me to this parameter?

daniel

07/22/2022, 12:15 PM

https://docs.dagster.io/_apidocs/internals#dagster.core.run_coordinator.QueuedRunCoordinator

daniel

07/22/2022, 12:15 PM

dequeue_interval_seconds - that said I think the default is pretty low (every 5 seconds)

VxD

07/22/2022, 12:16 PM

I could lower it to 1s and report back.

daniel

07/22/2022, 12:17 PM

Other users in a similar situation in the past have had some success using dynamic orchestration to have a dynamic op for each thing that used to be a short small job - could be a potential workaround until that feature is landed to improve the run queue throughput

daniel

07/22/2022, 12:17 PM

https://docs.dagster.io/concepts/ops-jobs-graphs/dynamic-graphs

VxD

07/22/2022, 12:18 PM

We are pretty familiar with dynamic graphs; I'll see what we can do. Thanks!

VxD

07/26/2022, 1:33 AM

Looking into this a bit more, the only reason we are running

RunRequests

is to work around the fact that Dagster subgraphs cannot branch into parallel operations (see here). As a result, we are unable to use dynamic graphs here and have to rely on a sensor to dispatch our subgraphs as parallel jobs. 😞

daniel

07/26/2022, 1:35 AM

Ah :( I wish I had a better answer for you - I'm hoping that we can get to parallelizing the run dequeuer soon, but ‘soon’ here might mean August/September

❤️ 1

VxD

07/26/2022, 1:37 AM

August/September sounds great! We were planning on expanding our workloads in the coming weeks, resulting in hundreds of jobs being dispatched by a sensor. As it stands, this would likely make our system unusable. We however (and thankfully) have the option to stage our ramp up so we'll enable the feature little by little so as to not overwhelm the queue.

VxD

07/26/2022, 1:38 AM

Once again, thanks for all the work that you do! We are extremely happy with Dagster; it has allowed us to make tremendous progress in our automation.

condagster 1

VxD

07/26/2022, 1:40 AM

This here is just a small bump in the road. If Dagster cannot be made to dequeue requests in parallel by end of September, we can always bake our own solution with *_gasp*_ raw celery jobs that run the pipelines with

execute_in_process

or something. 😅

2 Views

Open in Slack

Previous Next