Hello all! I am getting an error when trying to de...
# ask-community
n
Hello all! I am getting an error when trying to deploy a new repo & job to k8s. It shows up in Dagit, but the job gets stuck at “Starting” and never progresses or throws an error. In kubectl the pod shows as
RUNNING
. The pod only has one log entry:
Copy code
2022-08-12 17:02:22 +0000 - dagster.code_server - INFO - Started Dagster code server for file /build/projects/service_price_history/service_price_history/run_job.py on port 4000 in process 1
But running a
kubectl describe pod/$POD_NAME
I see the following error under events:
Copy code
Warning  Unhealthy  8m38s  kubelet            Readiness probe failed: <_InactiveRpcError of RPC that terminated with:
           status = StatusCode.UNAVAILABLE
           details = "failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused"
           debug_error_string = "UNKNOWN:Failed to pick subchannel {created_time:"2022-08-12T17:02:21.15012514+00:00", children:[UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused {created_time:"2022-08-12T17:02:21.150121462+00:00", grpc_status:14}]}"
>
Unable to connect to gRPC server: 0
This message is fairly opaque and I am not sure how else to debug it. (restarting the pod did not help)
When I run the readiness check directly on the pod (
dagster api grpc-health-check -p 4000
) it succeeds… but the jobs still hang at “Starting”
Turns out that warning was expected (or at least not the problem). Dagster failed to create the job pod but never detected or reported on that error in the Dagit UI (which I consider a bug… makes debugging this difficult). Here is the actual error from `kubectl get events`:
Copy code
Error creating: pods "dagster-run-22e193b0-e3f9-4235-b3c2-f58cdea227bf-" is forbidden: error looking up service account airedale-dagster/airedale-dagster-service-price-history-sa: serviceaccount "airedale-dagster-service-price-history-sa" not found
s
@johann - mind taking a look?
n
More info: The AWS EKS service account did not exist, so the failure to spin up the pod was expected. The “bug” on the dagster side is not detecting and/or reporting on this failure, but instead displaying the job forever in the “Starting” state
j
Hi Nathan- yeah ideally dagster would surface this out of the box. We have a run monitoring feature that’s currently opt-in: https://docs.dagster.io/deployment/run-monitoring#run-monitoring
n
@johann Thank you for the information. We will play with these settings.