Hello Team, I’m looking for a guide on how to trou...
# ask-community
a
Hello Team, I’m looking for a guide on how to troubleshoot the connection to a deployment. I have dagster deployed on an AWS K8s (EKS) cluster. Everything worked 2 weeks ago. However, last week there was a redeployment of the cluster, so was my helm charts. After that, I got this error message on Dagit GUI:
Copy code
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "upstream connect error or disconnect/reset before headers. reset reason: protocol error" debug_error_string = "{"created":"@1648544039.239355713","description":"Error received from peer ipv4:172.20.8.123:3030","file":"src/core/lib/surface/call.cc","file_line":903,"grpc_message":"upstream connect error or disconnect/reset before headers. reset reason: protocol error","grpc_status":14}" >
From within the dagit pod, I could telnet to the deployment k8s service (
telnet my_user_app_deployment_name 3030
-->
connected
). May I get help on this issue? Thanks a lot.
d
Hi, if you go to the Workspace tab in dagit and press the reload button, do you still get the error? do you have a full stack trace for the error?
a
Hi Daniel, Sorry for the late response (I'm in a bad timezone). Reloading doesn't help. This is all what I could see from logs on dagit pod:
Copy code
/usr/local/lib/python3.7/site-packages/dagster/core/workspace/context.py:560: UserWarning: Error loading repository location user-code-example:dagster.core.errors.DagsterUserCodeUnreachableError: Could not reach user code server

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/core/workspace/context.py", line 555, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/workspace/context.py", line 481, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/core/host_representation/origin.py", line 291, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.7/site-packages/dagster/core/host_representation/repository_location.py", line 526, in __init__
    list_repositories_response = sync_list_repositories_grpc(self.client)
  File "/usr/local/lib/python3.7/site-packages/dagster/api/list_repositories.py", line 19, in sync_list_repositories_grpc
    api_client.list_repositories(),
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 164, in list_repositories
    res = self._query("ListRepositories", api_pb2.ListRepositoriesRequest)
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 110, in _query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e

The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "upstream connect error or disconnect/reset before headers. reset reason: protocol error"
	debug_error_string = "{"created":"@1648716422.905634674","description":"Error received from peer ipv4:172.20.39.151:3030","file":"src/core/lib/surface/call.cc","file_line":903,"grpc_message":"upstream connect error or disconnect/reset before headers. reset reason: protocol error","grpc_status":14}"
>

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/grpc/client.py", line 107, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)

  location_name=location_name, error_string=error.to_string()
d
if you redeploy the helm chart, does it still happen?
Would it be possible to share or DM your values.yaml file? This is using the dagster helm chart?
Lastly, what version were you on before, and what version are you on now when it is not?
a
yes, I tried to uninstall both helm charts and reinstall them - no luck
let me DM the values files
d
Are there any clues in the logs for the user code deployment pod?
a
no, there's no logs in that pod
Copy code
2022-03-30 01:48:43 +0000 - dagster.code_server - INFO - Started Dagster code server for package analytx on port 3030 in process 1
is there any python code that I can run on the dagit pod to debug the connection?
I'm using
telnet
to test only
d
There's a
dagster api grpc-health-check
command that you could run on the dagit pod - e.g.
dagster api grpc-health-check -p 4000
. It will raise an error if there's an issue connecting to the gRPC server on that port, and return cleanly if its able to connect
curious what that command outputs on the dagit pod
a
Copy code
root@dagster-dagit-76d57b9598-vvdpn:/# nslookup
> k8s-swyftx-user-app
Server:		172.20.0.10
Address:	172.20.0.10#53

Name:	k8s-swyftx-user-app.dagster.svc.cluster.local
Address: 172.20.39.151
> 
root@dagster-dagit-76d57b9598-vvdpn:/# telnet k8s-swyftx-user-app 3030
Trying 172.20.39.151...
Connected to k8s-swyftx-user-app.dagster.svc.cluster.local.
Escape character is '^]'.

^C
Connection closed by foreign host.
root@dagster-dagit-76d57b9598-vvdpn:/# ^C
root@dagster-dagit-76d57b9598-vvdpn:/# dagster api grpc-health-check -h k8s-swyftx-user-app -p 3030
<_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "upstream connect error or disconnect/reset before headers. reset reason: protocol error"
	debug_error_string = "{"created":"@1648761065.717335601","description":"Error received from peer ipv4:172.20.39.151:3030","file":"src/core/lib/surface/call.cc","file_line":903,"grpc_message":"upstream connect error or disconnect/reset before headers. reset reason: protocol error","grpc_status":14}"
>
root@dagster-dagit-76d57b9598-vvdpn:/#
d
And what exactly changed between when it was working before and when it stopped working? any details you can provide would help
specific versions, etc.
a
it was 0.14.3 when it was running last week
when I reinstall the helms, it's 0.14.6
our platform team (who manages the EKS cluster) said they installed istio
d
did any other changes happen in the cluster at the same time as the upgrade - are the user code deployments and dagit installed in the same cluster / expected to have network access to each other?
installing istio seems like it could be related for sure
a
I'm not aware of any other changes. The services are expected to have network access to each other.
d
Here's a thread with some other folks who ran into connection errors related to istio - https://dagster.slack.com/archives/C01U954MEER/p1634042336128100
a
I also suspect that was Istio. However, telnet shows that the networks are connected
d
I'm looking through this issue which seems possibly relevant https://github.com/istio/istio/issues/27513
but doesn't have an obvious resolution
a
thanks Dan
condagster 1
let me come back to talk to the EKS team
if you have some python code that I could use to mimic grpc-health-check, that would be great
d
a
Thanks!
@daniel I got a bit of progress with this issue. According to our K8s team, the service's port name is incorrect (?). The correct name should be
grpc
instead of
http
. https://github.com/dagster-io/dagster/blob/3b55c4e864775b7a70ed8ff539629317a120250[…]ter/charts/dagster-user-deployments/templates/service-user.yaml After having changed this, my dagit pod could connect to the dagster-app service. I'm not sure whether this service port name is something agreed globally, or it's only an Istio implementation.
Anyway, after that issue has been cleared, I stumped upon the same issue mentioned in this thread https://dagster.slack.com/archives/C01U954MEER/p1635400200206600
d
Ah great! We can include that in the docs for future people who run into this
👍 1
@Dagster Bot docs include instructions for getting the user code deployments working with istio
d
a
Regarding the issue mentioned by Mohammad above, the solution was to add a custom annotation to dagster-run pods. Since I'm using software defined asset approach, is there something similar to
pod_template_spec_metadata
?
I found it. Thanks.
r
hi I dont see anything in the github issue nor the docs. im having issues connecting to the code location when using istio too