any idea what would cause this error? I'm on GKE, ...
# deployment-kubernetes
c
any idea what would cause this error? I'm on GKE, and looking at Logs Explorer, there aren't any OOM or storage exceeded errors:
Copy code
2022-05-23 17:26:42 +0000 - dagster - DEBUG - powerschool_resync - 51ab1aae-c307-4df5-aebb-d76373179284 - 1 - ENGINE_EVENT - Multiprocess executor: received termination signal - forwarding to active child processes
🤖 1
I'm guessing it could be CPU, if it's not just some freak occurrence, but I'm not sure what to look for. Don't see any logs pointing to the CPU
d
Are there any clues if you describe the pod? Sometimes this can happen if k8s is bringing down the whole node
that the pod is on
c
using kubectl you mean?
d
yeah
c
i'll check
l
i haven’t looked deeply into errors like this but previously when I was exceeding CPU and my resource allocation wasn’t configured properly, my pods appear to get stuck (jobs run way longer than they should be running) but that was before, when I was using xgboost without a limit on # threads
c
so nothing suspect, but I'm now noticing that the initial run failed to start with
Copy code
Run b447e7fb-8a16-4d9a-b80d-a582c0bd7f12 has not started execution after 313.06851840019226 seconds, which is longer than the timeout of 300 seconds. Marking run as failed.
gonna dig through the logs some more and see what I come up with
d
That timeout is configurable in Dagster Cloud if you'd like it to give more time before deciding that the run is unlikely to ever start
typically there would be some clue in the pod / k8s job about why it didn't start up (is it possible your cluster doesn't have sufficient resources to schedule the pod?)
c
yeah looking something like that
Copy code
0/4 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1653326797}, that the pod didn't tolerate, 2 Insufficient memory, 3 Insufficient cpu.
I'm using Autopilot, so it might have been taking longer than usual to find a node
👍 1