any idea what would cause this error I m on GKE and looking dagster #deployment-kubernetes

any idea what would cause this error? I'm on GKE, ...

Charlie Bini

05/23/2022, 5:38 PM

any idea what would cause this error? I'm on GKE, and looking at Logs Explorer, there aren't any OOM or storage exceeded errors:

Copy code

2022-05-23 17:26:42 +0000 - dagster - DEBUG - powerschool_resync - 51ab1aae-c307-4df5-aebb-d76373179284 - 1 - ENGINE_EVENT - Multiprocess executor: received termination signal - forwarding to active child processes

🤖 1

Charlie Bini

05/23/2022, 5:39 PM

I'm guessing it could be CPU, if it's not just some freak occurrence, but I'm not sure what to look for. Don't see any logs pointing to the CPU

daniel

05/23/2022, 5:40 PM

Are there any clues if you describe the pod? Sometimes this can happen if k8s is bringing down the whole node

daniel

05/23/2022, 5:41 PM

that the pod is on

Charlie Bini

05/23/2022, 5:41 PM

using kubectl you mean?

daniel

05/23/2022, 5:41 PM

yeah

Charlie Bini

05/23/2022, 5:41 PM

i'll check

Liezl Puzon

05/23/2022, 5:53 PM

i haven’t looked deeply into errors like this but previously when I was exceeding CPU and my resource allocation wasn’t configured properly, my pods appear to get stuck (jobs run way longer than they should be running) but that was before, when I was using xgboost without a limit on # threads

Charlie Bini

05/23/2022, 6:29 PM

so nothing suspect, but I'm now noticing that the initial run failed to start with

Copy code

Run b447e7fb-8a16-4d9a-b80d-a582c0bd7f12 has not started execution after 313.06851840019226 seconds, which is longer than the timeout of 300 seconds. Marking run as failed.

Charlie Bini

05/23/2022, 6:29 PM

gonna dig through the logs some more and see what I come up with

daniel

05/23/2022, 6:29 PM

That timeout is configurable in Dagster Cloud if you'd like it to give more time before deciding that the run is unlikely to ever start

daniel

05/23/2022, 6:40 PM

typically there would be some clue in the pod / k8s job about why it didn't start up (is it possible your cluster doesn't have sufficient resources to schedule the pod?)

Charlie Bini

05/23/2022, 6:41 PM

yeah looking something like that

Charlie Bini

05/23/2022, 6:41 PM

Copy code

0/4 nodes are available: 1 node(s) had taint {ToBeDeletedByClusterAutoscaler: 1653326797}, that the pod didn't tolerate, 2 Insufficient memory, 3 Insufficient cpu.

Charlie Bini

05/23/2022, 6:42 PM

I'm using Autopilot, so it might have been taking longer than usual to find a node

👍 1

25 Views

Open in Slack

Previous Next