Hello all I am getting an error when trying to deploy a new dagster #ask-community

Hello all! I am getting an error when trying to de...

Nathan Skone

08/12/2022, 5:12 PM

Hello all! I am getting an error when trying to deploy a new repo & job to k8s. It shows up in Dagit, but the job gets stuck at “Starting” and never progresses or throws an error. In kubectl the pod shows as

RUNNING

. The pod only has one log entry:

Copy code

2022-08-12 17:02:22 +0000 - dagster.code_server - INFO - Started Dagster code server for file /build/projects/service_price_history/service_price_history/run_job.py on port 4000 in process 1

But running a

kubectl describe pod/$POD_NAME

I see the following error under events:

Copy code

Warning  Unhealthy  8m38s  kubelet            Readiness probe failed: <_InactiveRpcError of RPC that terminated with:
           status = StatusCode.UNAVAILABLE
           details = "failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused"
           debug_error_string = "UNKNOWN:Failed to pick subchannel {created_time:"2022-08-12T17:02:21.15012514+00:00", children:[UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: Failed to connect to remote host: Connection refused {created_time:"2022-08-12T17:02:21.150121462+00:00", grpc_status:14}]}"
>
Unable to connect to gRPC server: 0

This message is fairly opaque and I am not sure how else to debug it. (restarting the pod did not help)

Nathan Skone

08/12/2022, 6:14 PM

When I run the readiness check directly on the pod (

dagster api grpc-health-check -p 4000

) it succeeds… but the jobs still hang at “Starting”

Nathan Skone

08/12/2022, 6:38 PM

Turns out that warning was expected (or at least not the problem). Dagster failed to create the job pod but never detected or reported on that error in the Dagit UI (which I consider a bug… makes debugging this difficult). Here is the actual error from `kubectl get events`:

Copy code

Error creating: pods "dagster-run-22e193b0-e3f9-4235-b3c2-f58cdea227bf-" is forbidden: error looking up service account airedale-dagster/airedale-dagster-service-price-history-sa: serviceaccount "airedale-dagster-service-price-history-sa" not found

sandy

08/15/2022, 6:16 PM

@johann - mind taking a look?

Nathan Skone

08/15/2022, 6:18 PM

More info: The AWS EKS service account did not exist, so the failure to spin up the pod was expected. The “bug” on the dagster side is not detecting and/or reporting on this failure, but instead displaying the job forever in the “Starting” state

johann

08/22/2022, 3:02 PM

Hi Nathan- yeah ideally dagster would surface this out of the box. We have a run monitoring feature that’s currently opt-in: https://docs.dagster.io/deployment/run-monitoring#run-monitoring

Nathan Skone

08/22/2022, 8:50 PM

@johann Thank you for the information. We will play with these settings.

2 Views

Open in Slack

Previous Next