Folks we have some tasks on Dagster that get stuck on `Start dagster #deployment-ecs

Folks we have some tasks on Dagster that get stuck...

Bianca Rosa

12/01/2022, 12:59 PM

Folks we have some tasks on Dagster that get stuck on

Starting

on the interface. After some painful tshooting, I see that the stoppedReason is

Timeout waiting for network interface provisioning to complete.

which is documented as a possible flakyness on ECS Scheduler and requires a manual attempt to run again, but I don’t think

dagster-ecs

is handling this. We are using v`0.15.9.`

❤️ 1

daniel

12/01/2022, 1:04 PM

Hi Bianca - I believe in the latest version of dagster this will be detected and moved into a failure state, and a run-level retry can be set up to restart the job on failure

Bianca Rosa

12/01/2022, 1:25 PM

Great, thanks! - do you happen to know if I can find this change on dagster-ecs changelog?

Bianca Rosa

12/01/2022, 1:25 PM

Just checking if there is a upgrade path that doesnt require going from 0.15 to v1 right away

daniel

12/01/2022, 4:34 PM

I would actually expect this to be available in 0.15.9 if run_monitoring is enabled on your instance: https://docs.dagster.io/0.15.9/deployment/run-monitoring#run-monitoring

johann

12/02/2022, 11:03 AM

Just to clarify, is this with cloud or open source?

Bianca Rosa

12/05/2022, 8:46 PM

Open source!

Bianca Rosa

12/07/2022, 4:19 PM

Reading the docs from run monitoring, it seems like its only available for some run launchers:

Bianca Rosa

12/07/2022, 4:19 PM

Dagster can detect hanging runs and restart crashed run workers. Run monitoring is currently only supported on instances using one of the following run launchers: •

K8sRunLauncher

•

CeleryK8sRunLauncher

•

DockerRunLauncher

Bianca Rosa

12/07/2022, 4:19 PM

We’re using ECSRunLauncher thinking with blobs

daniel

12/07/2022, 4:26 PM

ECSRunLauncher should be in that list - we'll get that added

Bianca Rosa

12/07/2022, 5:16 PM

Awesome! ty =D

Arnoud van Dommelen

01/18/2023, 11:37 AM

@Hi Hi @daniel, cool feature! I am not able to activate the run retry in the dagster.yaml as it tells me that this is not supported for ECSRunLauncher. My environment runs on the following versions:

Copy code

dagster==1.1.10 dagit==1.1.10 dagster-graphql==1.1.10 dagster-aws==0.17.10 dagster-postgres==0.17.10

And the dagster.yaml looks as follows (only included relevant part):

Copy code

run_launcher:
  module: dagster_aws.ecs
  class: EcsRunLauncher
run_monitoring:
  enabled: true
  start_timeout_seconds: 600
  max_resume_run_attempts: 3
  poll_interval_seconds: 120

Run monitoring works when I remove the "max_resume_run_attempts" and "poll_interval_seconds". Is this not yet implemented for ECSRunLauncher or am I missing something? Thank you in advance!

daniel

01/18/2023, 3:30 PM

hi Arnoud, it's a bit confusing but there's a separate 'run_retries' key for retries: https://docs.dagster.io/deployment/run-retries#configuration

Copy code

run_retries:
  enabled: true
  max_retries: 3 # Sets a default for all jobs. 0 if not set

The

max_resume_run_attempts

is a separate more experimental thing that tries to keep the same run going and recover if the run process crashes, as opposed to run retries which retries in a new run if it fails

Arnoud van Dommelen

01/18/2023, 3:50 PM

Hi @daniel, thanks for the quick response! The

run_retries

unfortunately retries all jobs that are classified as failed, which is not favorable in our setup (rerun of a failure will not resolve the underlying issue). Does the

max_resume_run_attempts

feature only retry the jobs that were stuck on starting, as that is currently what is missing in our setup? Thank you in advance!

daniel

01/18/2023, 3:51 PM

We don't currently have support for only retrying runs that failed during starting unfortunately, although its a very reasonable feature request

daniel

01/18/2023, 3:51 PM

Is that something you'd be willing to file an issue for?

Arnoud van Dommelen

01/18/2023, 3:52 PM

Yes ofcourse! Where should I do this?

daniel

01/18/2023, 3:52 PM

From here: https://github.com/dagster-io/dagster/issues/new/choose

daniel

01/18/2023, 3:52 PM

Thanks!

Arnoud van Dommelen

01/18/2023, 4:03 PM

https://github.com/dagster-io/dagster/issues/11761

🙏 1

Arnoud van Dommelen

02/16/2023, 8:43 AM

Hi @daniel, is there any updates regarding this issue? The feature will really help us a lot, as it is almost impossible to track all the runs that are marked as failed by the monitoring daemon (since in production we are going to run 10.000+ jobs...). Thank you!

daniel

02/16/2023, 11:04 AM

No updates to share at this time on that particular issue unfortunately

👍 1

9 Views

Open in Slack

Previous Next