Folks we have some tasks on Dagster that get stuck...
# deployment-ecs
b
Folks we have some tasks on Dagster that get stuck on
Starting
on the interface. After some painful tshooting, I see that the stoppedReason is
Timeout waiting for network interface provisioning to complete.
which is documented as a possible flakyness on ECS Scheduler and requires a manual attempt to run again, but I don’t think
dagster-ecs
is handling this. We are using v`0.15.9.`
❤️ 1
d
Hi Bianca - I believe in the latest version of dagster this will be detected and moved into a failure state, and a run-level retry can be set up to restart the job on failure
b
Great, thanks! - do you happen to know if I can find this change on dagster-ecs changelog?
Just checking if there is a upgrade path that doesnt require going from 0.15 to v1 right away
d
I would actually expect this to be available in 0.15.9 if run_monitoring is enabled on your instance: https://docs.dagster.io/0.15.9/deployment/run-monitoring#run-monitoring
j
Just to clarify, is this with cloud or open source?
b
Open source!
Reading the docs from run monitoring, it seems like its only available for some run launchers:
Dagster can detect hanging runs and restart crashed run workers. Run monitoring is currently only supported on instances using one of the following run launchers: •
K8sRunLauncher
CeleryK8sRunLauncher
DockerRunLauncher
We’re using ECSRunLauncher thinking with blobs
d
ECSRunLauncher should be in that list - we'll get that added
b
Awesome! ty =D
a
@Hi Hi @daniel, cool feature! I am not able to activate the run retry in the dagster.yaml as it tells me that this is not supported for ECSRunLauncher. My environment runs on the following versions:
Copy code
dagster==1.1.10 dagit==1.1.10 dagster-graphql==1.1.10 dagster-aws==0.17.10 dagster-postgres==0.17.10
And the dagster.yaml looks as follows (only included relevant part):
Copy code
run_launcher:
  module: dagster_aws.ecs
  class: EcsRunLauncher
run_monitoring:
  enabled: true
  start_timeout_seconds: 600
  max_resume_run_attempts: 3
  poll_interval_seconds: 120
Run monitoring works when I remove the "max_resume_run_attempts" and "poll_interval_seconds". Is this not yet implemented for ECSRunLauncher or am I missing something? Thank you in advance!
d
hi Arnoud, it's a bit confusing but there's a separate 'run_retries' key for retries: https://docs.dagster.io/deployment/run-retries#configuration
Copy code
run_retries:
  enabled: true
  max_retries: 3 # Sets a default for all jobs. 0 if not set
The
max_resume_run_attempts
is a separate more experimental thing that tries to keep the same run going and recover if the run process crashes, as opposed to run retries which retries in a new run if it fails
a
Hi @daniel, thanks for the quick response! The
run_retries
unfortunately retries all jobs that are classified as failed, which is not favorable in our setup (rerun of a failure will not resolve the underlying issue). Does the
max_resume_run_attempts
feature only retry the jobs that were stuck on starting, as that is currently what is missing in our setup? Thank you in advance!
d
We don't currently have support for only retrying runs that failed during starting unfortunately, although its a very reasonable feature request
Is that something you'd be willing to file an issue for?
a
Yes ofcourse! Where should I do this?
d
Thanks!
a
Hi @daniel, is there any updates regarding this issue? The feature will really help us a lot, as it is almost impossible to track all the runs that are marked as failed by the monitoring daemon (since in production we are going to run 10.000+ jobs...). Thank you!
d
No updates to share at this time on that particular issue unfortunately
👍 1