https://dagster.io/ logo
Title
t

Tyler Ellison

10/06/2021, 8:31 PM
Might be a weird question but… What determines how fast the QueuedRunCoordinator can launch runs? I’ve got a couple thousand small runs sitting in queue. They seem to complete faster than the coordinator launches new ones so I only have 2-4 in progress at a time.
d

daniel

10/06/2021, 9:00 PM
hi Tyler - not a weird question! There's a dequeue_interval_seconds parameter that you can set on your dagster.yaml that defaults to 30 - that controls how frequently it checks to pull things off the queue. something like
module: dagster.core.run_coordinator
class: QueuedRunCoordinator
config:
  dequeue_interval_seconds: 5
j

johann

10/06/2021, 9:04 PM
This seems like it would be more of a throughput issue, if the thousands of runs are sticking around in the queue for a while
👍 1
@Tyler Ellison there’s a few things that could help increase throughput. • increasing resources for the Daemon service. If you can get metrics for it’s cpu/mem usage, it might reveal if it’s getting bottlenecked there • another possible bottleneck is the run launcher’s call to the underlying platform. For example, we’ve seen cases where K8s clusters have jobs that aren’t getting garbage collected, eventually leading to a K8s API server that’s very slow to respond. What run launcher are you using?
It may also help us to see the logs of the queue daemon, just to confirm it’s pulling things off the queue properly
t

Tyler Ellison

10/07/2021, 1:51 PM
Thanks for the input @johann and @daniel! cpu/mem usage was sitting around 50-60%. I ended up with about 14k runs that had to process and they finished 7 hours later that night. We’ve currently got an Azure VM running our Dagster instance. I’m just using the DefaultRunLauncher. I am thinking I should be looking into something that will scale better in the long run. This was an initial extract for a new integration so it’s not our typical amount of runs. I feel kubernetes might of handled this better by scaling out. What’s your thoughts?
j

johann

10/07/2021, 2:46 PM
Did the number of in progress runs ever get close to your concurrency limit?
Launching that many runs with the DefaultRunLauncher could possibly be causing some issues. Each run launch involves sending a grpc request to a single user code server, so latency there might be causing the Daemon to not be able to unqueue runs quickly
Kubernetes or another launcher that spins up new compute per run could be a good option, totally depends on where you want things deployed
👍 1
t

Tyler Ellison

10/07/2021, 2:59 PM
Just some context… I have a sensor kicking off a run to pull a file over SFTP to Azure blob storage. That’s the initial run. I also have a sensor kicking off a run to pull a file from blob, parse it and push it into Snowflake 🙂 The sftp -> blob run is extremely quick (1-2 sec) so they were just finishing faster than the launcher could launch new ones lol. Once it got to the second pipeline, it floated to my concurrency limit since they take a little longer.
j

johann

10/07/2021, 3:07 PM
Got it. Yeah having a high number of short pipelines is actually a dimension of stress on the system that we’re not as familiar with, compared to a lower number of large pipelines. We’re definitely interested in feedback. At some point the queue daemon will need to be able to scale horizontally to avoid being the bottleneck. It’s possible you ran in to that with your first set of runs, though that may also be solvable with a different launcher like you suggested
t

Tyler Ellison

10/07/2021, 3:41 PM
Thanks again with all the information! I’ll definitely look into using a different launcher and see what that looks like 🙂
j

johann

10/07/2021, 3:48 PM
Daniel and I chatted about this, and his initial suggestion that I dismissed might actually be a solution haha
You could try a shorter interval and seeing how that performs
❤️ 1
a

assaf

06/13/2022, 4:47 PM
I'm in a very similar predicament, so if anything's been improved in Dagster since (with respect to many short jobs in the queue), I'd be happy to learn about it. I have 2-10K runs in the queue, with these jobs usually taking ~10s to complete once scheduled to a pod by K8s. I've configured room for 512 concurrent runs in the QueuedRunCoordinator, however, I only see ~60-80 runs showing up in the in-progress tab in Dagit. So far I've: 1. Decreased
minimum_interval_seconds
on the sensor to 1s. Can it go to zero? 2. Provided 4X more CPU and memory to the daemon. 3. Scaled up my Dagster DB (initially there was some CPU pressure there, but that's been resolved). 4. Instrumented the latency for IO-sensitive parts of the sensor eval fn, as well as total tick latency. I'm seeing p99 total latency usually under 500ms, with some occasional spikes going to ~5s. 5. I periodically garbage-collect stale pods from the k8s API
D'oh!
kubectl get jobs | grep dagster-run | wc -l
  243482
d

daniel

06/13/2022, 5:08 PM
I think there are still some improvements we need to make in both the sensor daemon and the run queue daemon to optimize them for throughput / reduce latency when there are a bunch of runs
a

assaf

06/13/2022, 5:32 PM
Any open issues we can follow/contribute?
d

daniel

06/13/2022, 5:33 PM
Just so you know this isn't relegated to eternal backlog - actively thinking through adding more parallelism to the various daemons here (might still be a few weeks before it lands though): https://github.com/dagster-io/dagster/pull/8265
a

assaf

07/03/2022, 8:59 AM
Looks like we got it in 0.15.3, though I can't find any documentation of the new config. The source code indicates there are new
use_threads
and
num_workers
fields, but it's not immediately obvious where those should be set.
j

johann

07/05/2022, 2:53 PM
Hi @assaf, that will be configured in your
dagster.yaml
:
run_coordinator:
  ...
sensors:
  use_threads: true
  num_workers: 10
Or if using the helm chart, it will be configured in your `values.yaml`: https://github.com/dagster-io/dagster/blob/master/helm/dagster/values.yaml#L1006-L1011
@Dagster Bot docs sensor thread pool
d

Dagster Bot

07/05/2022, 2:53 PM
p

prha

07/05/2022, 6:41 PM
I think there is docs/helm values but that PR unfortunately did not make it into the release. We’ll make sure this all gets updated for
0.15.4
. The relevant PR is here: https://github.com/dagster-io/dagster/pull/8657