We have a hybrid setup and yesterday we had an err...
# dagster-plus
q
We have a hybrid setup and yesterday we had an error that impacted our schedules because of the error. I am seeing another error today
Copy code
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "DNS resolution failed for analytics-prod-3df4c5.dagster:4000: C-ares status is not ARES_SUCCESS qtype=A name=analytics-prod-3df4c5.dagster is_balancer=0: Could not contact DNS servers"
	debug_error_string = "{"created":"@16822.9984","description":"DNS resolution failed for analytics-prod-3df4c5.dagster:4000: C-ares status is not ARES_SUCCESS qtype=A name=analytics-prod-3df4c5.dagster is_balancer=0: Could not contact DNS servers","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}"
I'm not sure what's happening. Any help because our jobs that are supposed to run on schedule are not and this is impacting our team and the others we serve. Thanks!
j
hey @Qwame has anything changed with the configuration of your hybrid agent?
specifically is the vpc the agent is in using route53 for dns still?
you can try to restart the code location with this in the ui
which might also resolve this
can you share what version of dagster your agent is running as well?
q
My agent is running version 1.2.0
Do you think I should upgrade my agent to the latest version of dagster?
AFAIK, we haven't changed anything in the configuration of our hybrid agent
@Joe route53 is an AWS service. Our agent is running on GKE which has its own kubedns client that is aware of and can resolve other hostnames in the cluster and presumably public DNS as well.
j
ah sorry we usually see this issue in ecs agents
did redploying the code-location with the errors resolve it?
q
It did but it means I have to babysit this and be checking regularly to ensure the deployment health is okay. I'm not sure I understand what is happening. While I posted this question, a job was running and it stopped because of a user code unreachable error which has got to do with this same dns
I had to manually reexecute two of the steps in the job
j
makes sense this is something you don't want to think about. So my guess is the kube-dns pod that manages internal DNS was down or unavailable for some reason
if you have access to kubectl you should be able to look at the cluster and figure that out
looking if the dns pods restarted/terminated or if there are errors in their logs
q
Is there a way to get a notification when an error like this occurs. i.e. error that impacts the health of the entire deployment?
j
you should be able to setup some monitoring on your k8s dns service, we're also currently exploring some options to making the grpc servers recover better from these types of issues
👍 1