We have a hybrid setup and yesterday we had an error that im dagster #dagster-plus

We have a hybrid setup and yesterday we had an err...

Qwame

04/13/2023, 3:17 PM

We have a hybrid setup and yesterday we had an error that impacted our schedules because of the error. I am seeing another error today

Copy code

The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "DNS resolution failed for analytics-prod-3df4c5.dagster:4000: C-ares status is not ARES_SUCCESS qtype=A name=analytics-prod-3df4c5.dagster is_balancer=0: Could not contact DNS servers"
	debug_error_string = "{"created":"@16822.9984","description":"DNS resolution failed for analytics-prod-3df4c5.dagster:4000: C-ares status is not ARES_SUCCESS qtype=A name=analytics-prod-3df4c5.dagster is_balancer=0: Could not contact DNS servers","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}"

I'm not sure what's happening. Any help because our jobs that are supposed to run on schedule are not and this is impacting our team and the others we serve. Thanks!

Joe

04/13/2023, 3:38 PM

hey @Qwame has anything changed with the configuration of your hybrid agent?

Joe

04/13/2023, 3:38 PM

specifically is the vpc the agent is in using route53 for dns still?

Joe

04/13/2023, 3:39 PM

you can try to restart the code location with this in the ui

Joe

04/13/2023, 3:39 PM

which might also resolve this

Joe

04/13/2023, 3:40 PM

can you share what version of dagster your agent is running as well?

Qwame

04/13/2023, 3:42 PM

My agent is running version 1.2.0

Qwame

04/13/2023, 3:47 PM

Do you think I should upgrade my agent to the latest version of dagster?

Qwame

04/13/2023, 3:48 PM

AFAIK, we haven't changed anything in the configuration of our hybrid agent

Qwame

04/13/2023, 3:49 PM

@Joe route53 is an AWS service. Our agent is running on GKE which has its own kubedns client that is aware of and can resolve other hostnames in the cluster and presumably public DNS as well.

Joe

04/13/2023, 3:50 PM

ah sorry we usually see this issue in ecs agents

Joe

04/13/2023, 3:50 PM

did redploying the code-location with the errors resolve it?

Qwame

04/13/2023, 3:56 PM

It did but it means I have to babysit this and be checking regularly to ensure the deployment health is okay. I'm not sure I understand what is happening. While I posted this question, a job was running and it stopped because of a user code unreachable error which has got to do with this same dns

Qwame

04/13/2023, 3:57 PM

I had to manually reexecute two of the steps in the job

Joe

04/13/2023, 3:58 PM

makes sense this is something you don't want to think about. So my guess is the kube-dns pod that manages internal DNS was down or unavailable for some reason

Joe

04/13/2023, 3:58 PM

if you have access to kubectl you should be able to look at the cluster and figure that out

Joe

04/13/2023, 3:59 PM

looking if the dns pods restarted/terminated or if there are errors in their logs

Qwame

04/13/2023, 4:13 PM

Is there a way to get a notification when an error like this occurs. i.e. error that impacts the health of the entire deployment?

Joe

04/13/2023, 4:27 PM

you should be able to setup some monitoring on your k8s dns service, we're also currently exploring some options to making the grpc servers recover better from these types of issues

👍 1

Open in Slack

Previous Next