https://dagster.io/ logo
#ask-community
Title
# ask-community
s

Stephen Bailey

05/19/2022, 2:54 PM
I'm having a lot of
Failure to Start
issues, apparently caused by my Kubernetes cluster getting swamped with requests from a backfill. The challenge is, since these are not starting, they actually don't show up in the Status page timeline graph, and they also aren't emitting failure hooks to Slack/ Datadog. I think one way to address the alerting issue is to add a failure sensor. But do I need to add a sensor to each repo, or is there a way to have the sensor read cross-repo?
🤖 1
Example of the Failure to Start reason in a pod
Copy code
vents:
  Type     Reason             Age                 From                Message
  ----     ------             ----                ----                -------
  Normal   NotTriggerScaleUp  17m (x31 over 22m)  cluster-autoscaler  pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max node group size reached
  Warning  FailedScheduling   16m (x17 over 22m)  default-scheduler   0/6 nodes are available: 6 Insufficient cpu.
shows up as a timeout in dagit
d

daniel

05/19/2022, 3:02 PM
Hey Stephen - for slack specifically, Dagster Cloud also has a native slack alerting feature that would fire in this case (and can be cross-repo): https://docs.dagster.cloud/guides/alerts - that doesn't cover Datadog though. Failure sensors let you write arbitrary hooks on failure and would kick in here, but they do need to be repo-scoped. We're working on reconciling the different features / properties there between the native alerts and failure sensors, as there's some confusing overlap there. I'll check separately about the timeline graph question.
👍 1
s

Stephen Bailey

05/19/2022, 3:12 PM
There could be a bug in the Dagster Cloud Slack alerting -- I thought it would fire too, but for some reason it did not in this case. I just added the job failure sensor and that seems to work great, so I'll just add that into the job repos for now.
Would another option to be to create a sensor that queries the Dagster GraphQL API and returns instance-level job failures?
here's example for the timeline -- jobs run hourly, but about 4 hours got skipped. looking at the individual jobs, you can see the ticks have indeed failed.
d

daniel

05/19/2022, 3:16 PM
Possible to post or DM a link to the run that didn't fire an alert? We can take a look in the logs and see why it didn't fire
1
You could create that instance-level sensor but I think you could make a strong case that we should just provide it for you out of the box 🙂 the run status sensor implementation actually specifically filters out runs that they pick up that aren't from the current repository, so I think offering an instance-scoped sensor would be as simple as giving a toggle that removes that filter
amaze 1
s

Stephen Bailey

05/19/2022, 3:22 PM
yeah, i mean i'm not saying dont do that 🙂
d

daniel

05/19/2022, 7:03 PM
OK, we have a fix for the alert not firing that'll go out in Cloud later today, and a fix for them not showing up in the timeline view that should go out shortly thereafter
r

rex

05/19/2022, 11:03 PM
Slack alerts should now fire on these types of failures (the fix just went out in Cloud). The timeline view should be fixed once https://github.com/dagster-io/dagster/pull/7972 is merged
🎉 1
s

Stephen Bailey

05/19/2022, 11:55 PM
thanks!