I have an issue in ecs that some task fails and can see in A dagster #deployment-ecs

I have an issue in ecs that some task fails and ca...

Jakub Zgrzebnicki

02/01/2023, 1:02 PM

I have an issue in ecs that some task fails and can see in AWS that task is stopped and in log that it failed, but dagster still showing starting label. This way I have all running tasks in starting state but all are stopped and failed in aws. Did you have similar problem?

Shahab Tasharrofi

02/01/2023, 2:37 PM

You can use run monitor. That is, add following to your `daemon`’s `dagster.yaml`:

Copy code

run_monitoring:
  enabled: true
  start_timeout_seconds: 300 # ECS runs can take a long time to start (~80 seconds is normal)
  max_resume_run_attempts: 0
  poll_interval_seconds: 120

❤️ 1

prha

02/01/2023, 5:03 PM

Jakub, are there anything in the logs indicating why they failed?

Jakub Zgrzebnicki

02/02/2023, 6:11 AM

They failed, because I deployed dagster repository and I had the issues with sqlalchemy 2.0.0. I then updated the newest version of services, but the runs got stuck in starting phase

Arnoud van Dommelen

02/22/2023, 1:37 PM

Hi, is

poll_interval_seconds

the maximum allowed duration of an Op before the monitoring daemon fails the run?

Jakub Zgrzebnicki

02/22/2023, 1:39 PM

it's how frequent daemon check

Jakub Zgrzebnicki

02/22/2023, 1:40 PM

and start_timeout_seconds is minimal time after which job is marked as failed. In worst case your job will be marked after poll_interval_seconds+start_timeout_seconds.

❤️ 1

Arnoud van Dommelen

02/22/2023, 1:46 PM

Ah so it just checks whether the underlying ECS task is still running? I currently only have:

Copy code

run_monitoring:
  enabled: true
  start_timeout_seconds: 600

So this setup does not check for hanging "running" jobs but just jobs that are stuck on "starting"?

4 Views

Open in Slack

Previous Next