I m using `EcsRunLauncher` and my `QueuedRunCoordinator` is dagster #ask-community

I'm using `EcsRunLauncher` and my `QueuedRunCoordi...

Tom Reilly

02/27/2023, 11:30 PM

I'm using

EcsRunLauncher

and my

QueuedRunCoordinator

is setup with

max_concurrent_runs

set to 750. A sensor requested ~650 job runs but I never saw more than about 200 in progress at once even though there were hundreds of runs waiting in queue. I expected to see a larger number of jobs in progress at once. The db I use for run and event storage as well as my grpc service did hit 100% CPU utilization at times. Any advice for getting runs out of the queue faster and have in progress runs closer to the

max_concurrent_runs

value?

rex

02/28/2023, 12:43 AM

Have you configured your instance to use threads in order to evaluate sensors and schedules? This should increase your throughput: https://docs.dagster.io/deployment/dagster-instance#sensor-evaluation

rex

02/28/2023, 12:43 AM

Something like in your

dagster.yaml

Copy code

sensors:
  use_threads: true
  num_workers: 8

schedules:
  use_threads: true
  num_workers: 8

thank you box 1

daniel

02/28/2023, 2:08 AM

I think for parallelizing the runs coming off the queue you'd want something like this (similar to what rex posted, but for taking runs off the queue instead of putting them on):

Copy code

run_coordinator:
  module: dagster.core.run_coordinator
  class: QueuedRunCoordinator
  config:
    dequeue_use_threads: true
    dequeue_num_workers: 8 # This number can be tuned depending on your parallelism needs

thank you box 1

Tom Reilly

03/01/2023, 2:22 PM

I enabled both of these and throughput definitely increased when enqueuing and dequeuing. I have run monitoring enabled

Copy code

run_monitoring:
  enabled: true
  start_timeout_seconds: 300

and since enabling threading I'm seeing some failures due to a job exceeding the

start_timeout_seconds

. Is this a sign to scale up the daemon?

daniel

03/01/2023, 2:39 PM

Do you have a sense of what was happening during those 300 seconds for the run that failed? What was the last line in the event log for that run before run monitoring killed it? I wouldn’t actually expect the daemon to be in the critical path there - but I’ve definitely seen ECS fargate sometimes take more than 300 seconds to spin up a task

Tom Reilly

03/01/2023, 2:48 PM

image.png

Tom Reilly

03/01/2023, 2:50 PM

The engine event logs out

[EcsRunLauncher] Launching run in ECS task

with a task arn and then about 5min later the run_failure is triggered

Tom Reilly

03/01/2023, 2:51 PM

In ECS, I do not see a task with the arn logged out by the

ENGINE_EVENT

Tom Reilly

03/01/2023, 2:52 PM

for additional context, I'm load testing with a set of about 700 files. The sensor runs a query that returns the 700 files and initiates a

RunRequest

for each

Tom Reilly

03/01/2023, 2:53 PM

of those 700, since I enabled threading, I'm seeing about 2-5 fail each time due to exceeding

start_timeout_seconds

Tom Reilly

03/01/2023, 3:01 PM

Manually checking a dozen or so successful job runs, the time between the

RUN_STARTING

and

RUN_START

is usually around 70 seconds

daniel

03/01/2023, 3:02 PM

What version of dagster are you using?

Tom Reilly

03/01/2023, 3:03 PM

1.1.18

daniel

03/01/2023, 3:34 PM

It's surprising that the task ARN wouldn't be in ECS, since the only place we would get such an ARN from would be the ECS API (it comes from the result of the run_task API call): https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-aws/dagster_aws/ecs/launcher.py#L394-L430 Are you sure the tasks aren't there but in a STOPPED state?

Tom Reilly

03/01/2023, 6:10 PM

https://docs.aws.amazon.com/AmazonECS/latest/developerguide/stopped-task-errors.html

Tom Reilly

03/01/2023, 6:10 PM

looks the task expires after an hour

Tom Reilly

03/01/2023, 6:12 PM

For now, with sensor threading disabled and dequeue threading enabled I can consistently process the set of files without failures. Will leave that disabled for now and if I encounter the error I'll make sure to check the task before it expires

2 Views

Open in Slack

Previous Next