Stephen Bailey
05/19/2022, 2:54 PMFailure to Start
issues, apparently caused by my Kubernetes cluster getting swamped with requests from a backfill. The challenge is, since these are not starting, they actually don't show up in the Status page timeline graph, and they also aren't emitting failure hooks to Slack/ Datadog.
I think one way to address the alerting issue is to add a failure sensor. But do I need to add a sensor to each repo, or is there a way to have the sensor read cross-repo?vents:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal NotTriggerScaleUp 17m (x31 over 22m) cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max node group size reached
Warning FailedScheduling 16m (x17 over 22m) default-scheduler 0/6 nodes are available: 6 Insufficient cpu.
shows up as a timeout in dagitdaniel
05/19/2022, 3:02 PMStephen Bailey
05/19/2022, 3:12 PMdaniel
05/19/2022, 3:16 PMStephen Bailey
05/19/2022, 3:22 PMdaniel
05/19/2022, 7:03 PMrex
05/19/2022, 11:03 PMStephen Bailey
05/19/2022, 11:55 PM