https://dagster.io/ logo
#ask-community
Title
# ask-community
t

Tom Reilly

02/27/2023, 11:30 PM
I'm using
EcsRunLauncher
and my
QueuedRunCoordinator
is setup with
max_concurrent_runs
set to 750. A sensor requested ~650 job runs but I never saw more than about 200 in progress at once even though there were hundreds of runs waiting in queue. I expected to see a larger number of jobs in progress at once. The db I use for run and event storage as well as my grpc service did hit 100% CPU utilization at times. Any advice for getting runs out of the queue faster and have in progress runs closer to the
max_concurrent_runs
value?
r

rex

02/28/2023, 12:43 AM
Have you configured your instance to use threads in order to evaluate sensors and schedules? This should increase your throughput: https://docs.dagster.io/deployment/dagster-instance#sensor-evaluation
Something like in your
dagster.yaml
Copy code
sensors:
  use_threads: true
  num_workers: 8

schedules:
  use_threads: true
  num_workers: 8
thank you box 1
d

daniel

02/28/2023, 2:08 AM
I think for parallelizing the runs coming off the queue you'd want something like this (similar to what rex posted, but for taking runs off the queue instead of putting them on):
Copy code
run_coordinator:
  module: dagster.core.run_coordinator
  class: QueuedRunCoordinator
  config:
    dequeue_use_threads: true
    dequeue_num_workers: 8 # This number can be tuned depending on your parallelism needs
thank you box 1
t

Tom Reilly

03/01/2023, 2:22 PM
I enabled both of these and throughput definitely increased when enqueuing and dequeuing. I have run monitoring enabled
Copy code
run_monitoring:
  enabled: true
  start_timeout_seconds: 300
and since enabling threading I'm seeing some failures due to a job exceeding the
start_timeout_seconds
. Is this a sign to scale up the daemon?
d

daniel

03/01/2023, 2:39 PM
Do you have a sense of what was happening during those 300 seconds for the run that failed? What was the last line in the event log for that run before run monitoring killed it? I wouldn’t actually expect the daemon to be in the critical path there - but I’ve definitely seen ECS fargate sometimes take more than 300 seconds to spin up a task
t

Tom Reilly

03/01/2023, 2:48 PM
image.png
The engine event logs out
[EcsRunLauncher] Launching run in ECS task
with a task arn and then about 5min later the run_failure is triggered
In ECS, I do not see a task with the arn logged out by the
ENGINE_EVENT
for additional context, I'm load testing with a set of about 700 files. The sensor runs a query that returns the 700 files and initiates a
RunRequest
for each
of those 700, since I enabled threading, I'm seeing about 2-5 fail each time due to exceeding
start_timeout_seconds
Manually checking a dozen or so successful job runs, the time between the
RUN_STARTING
and
RUN_START
is usually around 70 seconds
d

daniel

03/01/2023, 3:02 PM
What version of dagster are you using?
t

Tom Reilly

03/01/2023, 3:03 PM
1.1.18
d

daniel

03/01/2023, 3:34 PM
It's surprising that the task ARN wouldn't be in ECS, since the only place we would get such an ARN from would be the ECS API (it comes from the result of the run_task API call): https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-aws/dagster_aws/ecs/launcher.py#L394-L430 Are you sure the tasks aren't there but in a STOPPED state?
looks the task expires after an hour
For now, with sensor threading disabled and dequeue threading enabled I can consistently process the set of files without failures. Will leave that disabled for now and if I encounter the error I'll make sure to check the task before it expires
2 Views