Stephen Bailey05/19/2022, 2:54 PM
issues, apparently caused by my Kubernetes cluster getting swamped with requests from a backfill. The challenge is, since these are not starting, they actually don't show up in the Status page timeline graph, and they also aren't emitting failure hooks to Slack/ Datadog. I think one way to address the alerting issue is to add a failure sensor. But do I need to add a sensor to each repo, or is there a way to have the sensor read cross-repo?
Failure to Start
shows up as a timeout in dagit
vents: Type Reason Age From Message ---- ------ ---- ---- ------- Normal NotTriggerScaleUp 17m (x31 over 22m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max node group size reached Warning FailedScheduling 16m (x17 over 22m) default-scheduler 0/6 nodes are available: 6 Insufficient cpu.
daniel05/19/2022, 3:02 PM
Stephen Bailey05/19/2022, 3:12 PM
daniel05/19/2022, 3:16 PM
Stephen Bailey05/19/2022, 3:22 PM
daniel05/19/2022, 7:03 PM
rex05/19/2022, 11:03 PM
Stephen Bailey05/19/2022, 11:55 PM