I have an issue in ecs that some task fails and ca...
# deployment-ecs
j
I have an issue in ecs that some task fails and can see in AWS that task is stopped and in log that it failed, but dagster still showing starting label. This way I have all running tasks in starting state but all are stopped and failed in aws. Did you have similar problem?
s
You can use run monitor. That is, add following to your `daemon`’s `dagster.yaml`:
Copy code
run_monitoring:
  enabled: true
  start_timeout_seconds: 300 # ECS runs can take a long time to start (~80 seconds is normal)
  max_resume_run_attempts: 0
  poll_interval_seconds: 120
❤️ 1
p
Jakub, are there anything in the logs indicating why they failed?
j
They failed, because I deployed dagster repository and I had the issues with sqlalchemy 2.0.0. I then updated the newest version of services, but the runs got stuck in starting phase
a
Hi, is
poll_interval_seconds
the maximum allowed duration of an Op before the monitoring daemon fails the run?
j
it's how frequent daemon check
and start_timeout_seconds is minimal time after which job is marked as failed. In worst case your job will be marked after poll_interval_seconds+start_timeout_seconds.
❤️ 1
a
Ah so it just checks whether the underlying ECS task is still running? I currently only have:
Copy code
run_monitoring:
  enabled: true
  start_timeout_seconds: 600
So this setup does not check for hanging "running" jobs but just jobs that are stuck on "starting"?