Bianca Rosa
12/01/2022, 12:59 PMStarting
on the interface. After some painful tshooting, I see that the stoppedReason is Timeout waiting for network interface provisioning to complete.
which is documented as a possible flakyness on ECS Scheduler and requires a manual attempt to run again, but I don’t think dagster-ecs
is handling this. We are using v`0.15.9.`daniel
12/01/2022, 1:04 PMBianca Rosa
12/01/2022, 1:25 PMdaniel
12/01/2022, 4:34 PMjohann
12/02/2022, 11:03 AMBianca Rosa
12/05/2022, 8:46 PMK8sRunLauncher
• CeleryK8sRunLauncher
• DockerRunLauncher
daniel
12/07/2022, 4:26 PMBianca Rosa
12/07/2022, 5:16 PMArnoud van Dommelen
01/18/2023, 11:37 AMdagster==1.1.10 dagit==1.1.10 dagster-graphql==1.1.10 dagster-aws==0.17.10 dagster-postgres==0.17.10
And the dagster.yaml looks as follows (only included relevant part):
run_launcher:
module: dagster_aws.ecs
class: EcsRunLauncher
run_monitoring:
enabled: true
start_timeout_seconds: 600
max_resume_run_attempts: 3
poll_interval_seconds: 120
Run monitoring works when I remove the "max_resume_run_attempts" and "poll_interval_seconds".
Is this not yet implemented for ECSRunLauncher or am I missing something?
Thank you in advance!daniel
01/18/2023, 3:30 PMrun_retries:
enabled: true
max_retries: 3 # Sets a default for all jobs. 0 if not set
The max_resume_run_attempts
is a separate more experimental thing that tries to keep the same run going and recover if the run process crashes, as opposed to run retries which retries in a new run if it failsArnoud van Dommelen
01/18/2023, 3:50 PMrun_retries
unfortunately retries all jobs that are classified as failed, which is not favorable in our setup (rerun of a failure will not resolve the underlying issue). Does the max_resume_run_attempts
feature only retry the jobs that were stuck on starting, as that is currently what is missing in our setup?
Thank you in advance!daniel
01/18/2023, 3:51 PMArnoud van Dommelen
01/18/2023, 3:52 PMdaniel
01/18/2023, 3:52 PMArnoud van Dommelen
01/18/2023, 4:03 PMdaniel
02/16/2023, 11:04 AM