I am running Dagster using AWS ECS. I have the run...
# ask-community
s
I am running Dagster using AWS ECS. I have the run_monitoring configuration in my dagster.yaml as following. The run was marked as failed, but it didn't send out the slack message. How to let Dagster send a slack message when this happens again?
Copy code
run_launcher:
  module: dagster_aws.ecs
  class: EcsRunLauncher
  config:
    include_sidecars: true
    secrets_tag: "" 

run_monitoring:
  enabled: true
  start_timeout_seconds: 180
  max_resume_run_attempts: 0 
  poll_interval_seconds: 120
Here is the error message:
Copy code
Run 49f436f0-5cde-4506-9343-8ee395ba86a3 has been running for 240.16203594207764 seconds, which is longer than the timeout of 180 seconds to start. Marking run failed
Does anyone know how to fix the issue of the job run hanging at
Starting
status sometimes?
o
hi @Sean Han -- are you using a run_failure_sensor, or something else to send these slack alerts? as for the job getting stuck on Starting, this generally indicates a failure in spinning up a run worker. there's a wide variety of reasons that this might happen, but checking the ecs console might be a good place to start to look for relevant logs
s
use @slack_on_failure("#{channel}".format(channel=os.getenv("SLACK_CHANNEL")), dagit_base_url=os.getenv("DAGIT_BASE_URL")) over the job.
thanks for your help
If I use
Copy code
@run_failure_sensor
, do I need to add it to every job I have?
o
ah I see, yeah those slack_on_failure hooks only execute within the context of an already-running job, so you'll want to use run_failure_sensor. by default, a run failure sensor will monitor all jobs in the code location that it's set up for, so you should be able to just create a single one
s
thanks, @owen let me give a try.