Hey Team, we are getting a gRPC timeout error in O...
# ask-community
a
Hey Team, we are getting a gRPC timeout error in OSS deployment.
Copy code
dagster._core.errors.DagsterUserCodeProcessError: Exception: Timed out waiting for gRPC server to start after 180s with arguments: "/usr/local/bin/python -m dagster api grpc --lazy-load-user-code --socket /tmp/tmp7dvqfsym --heartbeat --heartbeat-timeout 30 --fixed-server-id 8a29931e-61e7-409d-b3a3-6dd9f0179f41 --log-level info --location-name spina.py --container-image australia-southeast1/test:latest --container-context {"k8s": {"env": [{"name": "ENV", "value": "staging"}], "env_config_maps": ["dagster-user-deployments-data-os-data-pipeline-user-env", "dat-pipeline-configmap"], "image_pull_policy": "Always", "namespace": "dagster", "service_account_name": "dagster-user-deployments-sa"}} -f spina.py -d /app". Most recent connection error: dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE
It's timing out at 180s while we are setting the timeout value at 300s using this - Are we doing something wrong?
@daniel not tagging to get a quick reply but just to let you know as it might be related to 1.3.9 release and this discussion - https://github.com/dagster-io/dagster/discussions/14709
d
Try adding —startup-timeout 300 to your codeServerArgs in the helm chart Note that after today I will be out on leave for the next several weeks, but others on the team will be able to help out with questions
a
Do I still need the
additionalInstanceConfig
bit? @owen tagging you as Daniel is away..
Also, the code reload still fails at times and other times it reloads properly. I'm not sure if the
--startup-timeout
setting is even working or not. The reason is, right now it is taking very close to 180s so we wanted to increase it to 300s. Anyway to confirm that? Does it show up somewhere..?
Even after doing the changes suggested by Daniel, I am seeing this error at odd times (outside of deployment scenarios) I think Dagit intermittently reloads the code but why the timeout value is not matching to what I have specified in yaml file? @owen
Copy code
Copy
dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 60 seconds to complete.

  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 599, in _load_location
    origin.reload_location(self.instance) if reload else origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 368, in create_location
    return GrpcServerCodeLocation(self)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/code_location.py", line 626, in __init__
    self,
  File "/usr/local/lib/python3.7/site-packages/dagster/_api/snapshot_repository.py", line 29, in sync_get_streaming_external_repositories_data_grpc
    repository_name,
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 351, in streaming_external_repository
    defer_snapshots=defer_snapshots,
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 185, in _streaming_query
    e, timeout=timeout, custom_timeout_message=custom_timeout_message
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 138, in _raise_grpc_exception
    ) from e

The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1687399795.751577543","description":"Error received from peer ipv4:10.101.0.212:3030","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>

  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 181, in _streaming_query
    method, request=request_type(**kwargs), timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 169, in _get_streaming_response
    yield from getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self
In the k8s logs, I see that code server was shut down for some reason. I had to hit reload. Do you know why that might happen?
@rex could you help here? The environment is pretty unstable because of this..
a
just for clarification, you are currently using
codeServerArgs
instead of
dagsterApiGrpcArgs
? there is an environment variable you can set
DAGSTER_GRPC_TIMEOUT_SECONDS
to increase the timeout for each grpc request up from
60
seconds
In the k8s logs, I see that code server was shut down for some reason
what precisely are you seeing in the logs
additional useful context would be what you are doing to generate your dagster definitions and the size of the resulting definitions
a
Copy code
- name: "pipeline"
    image:
      repository: "<image path>"
      tag: latest
      pullPolicy: Always
    codeServerArgs:
      - "--python-file"
      - "spina.py"
      - "--startup-timeout"
      - "300"
    port: 3030
    envConfigMaps:
     - name: test-configmap
    env:
    - name: ENV
      value: "__MTX_ENV"
    additionalInstanceConfig:
      code_servers:
        reload_timeout: 300 # value in seconds to wait
This is how the yaml looks. So, I should remove
additionalInstanceConfig
bit? I will try adding the environment variable.
additional useful context would be what you are doing to generate your dagster definitions and the size of the resulting definitions
In our code definitions load, we query an API to get metadata for our customers. So, it takes around 200 seconds to load the definitions. We are thinking of ways to optimise but this what we have right now.
what precisely are you seeing in the logs
I saw the log that code server was shutdown and UI was showing failure to load definitions. I pressed relaod and it worked. Thanks for your reply.
a
So, I should remove additionalInstanceConfig bit?
no i think thats fine to keep so one fundamental trade off of using
codeServerArgs
to enable reloads is that it consumes more memory. Have you cross referenced any information on your k8s cluster to see if you are now getting processes OOM killed?
a
Hmm I will need to check when that happens next. The dagster-user-deployment pod itself was active but the last log was that the server had been shut down. What do you suggest to setup some monitoring to check code location status? I think we need it to be alerted in scenarios like this.
a
What do you suggest to setup some monitoring to check code location status?
you should be able to figure out a GraphQL query that you can poll. Another angle would be to alert on OOM kiils in kubernetes
a
Hmm I see. Any possibility of adding code location status function in Python GraphQL Client?
a
you can submit an issue requesting it or a PR to add it
👍 1
a
As you may have noticed, with addition of the ENV variable
DAGSTER_GRPC_TIMEOUT_SECONDS
, we have 3 snippets in our yaml file to increase the timeout. 1.
Copy code
codeServerArgs:
      - "--python-file"
      - "spina.py"
      - "--startup-timeout"
      - "300"
2.
Copy code
additionalInstanceConfig:
      code_servers:
        reload_timeout: 300 # value in seconds to wait
3. (newest addition to the party)
Copy code
- name: DAGSTER_GRPC_TIMEOUT_SECONDS
      value: "300"
Could I ask for some clarity here?
a
--startup-timeout - 300
when we initially start a code server process, how long do we wait til we timeout
reload_timeout
when doing a reload operation in the new proxy server, how long dil we timeout. Good chance this gets merged with the above in the future since you likely want the same value.
DAGSTER_GRPC_TIMEOUT_SECONDS
what do we set the timeout to for the grpc library, to control how long we wait for any given rpc call to the code server
a
Thanks @alex! Super helpful. Would be great to have 1 and 2 merged.
Hey @alex, I'm still seeing this -
Copy code
dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 60 seconds to complete.

  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 599, in _load_location
    origin.reload_location(self.instance) if reload else origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 368, in create_location
    return GrpcServerCodeLocation(self)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/code_location.py", line 626, in __init__
    self,
  File "/usr/local/lib/python3.7/site-packages/dagster/_api/snapshot_repository.py", line 29, in sync_get_streaming_external_repositories_data_grpc
    repository_name,
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 351, in streaming_external_repository
    defer_snapshots=defer_snapshots,
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 185, in _streaming_query
    e, timeout=timeout, custom_timeout_message=custom_timeout_message
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 138, in _raise_grpc_exception
    ) from e

The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1687734643.829500646","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>

  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 181, in _streaming_query
    method, request=request_type(**kwargs), timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 169, in _get_streaming_response
    yield from getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self
The
dagster-user-deployment
k8s pod right now is still showing logs from sensor daemons but I do see a log saying "Shutting down Dagster code server for file spina.py" from around 20 minutes back. So, it seems the code server went down for some reason. I though setting environment variable
DAGSTER_GRPC_TIMEOUT_SECONDS
to 300 would fix this but apparently not. Reloading the code from the UI would fix it but it should not be failing now that we have increased the timeout value. Our environment has become unstable because of this. Could you help?
Actually, I am not even sure if the timeout values I have put there are being honoured or not. I made a GraphQL call via Python client when the code server was down. I knew the call would fail but see the error message -
ERROR:root:Code location reload failed: Repository location reload failed because of a PythonError error: dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 180 seconds to complete.
It is still saying 180 seconds! Could you check if the configuration is correct here and tell me if I need to make any changes?
a
still seeing this … User code server request timed out due to taking longer than 60 seconds to complete
DAGSTER_GRPC_TIMEOUT_SECONDS
is a caller specified timeout, so you’ll need to set that env var on the daemon and dagit pods
not even sure if the timeout values I have put there are being honoured
That is strange. I see an additional
local_startup_timeout
setting https://docs.dagster.io/deployment/dagster-instance#grpc-servers that you can try to set as well
Our environment has become unstable because of this
When was it last stable? Before moving from
dagsterApiGrpcArgs
to
codeServerArgs
? Did something change with the generation process causing it to take longer?
a
DAGSTER_GRPC_TIMEOUT_SECONDS
is a caller specified timeout, so you’ll need to set that env var on the daemon and dagit pods
For dagit - I think I found the yaml file for dagit. It looks like this, I can just add the environment variable. For the daemon pods, how do I set it?
Copy code
dagit:
  replicaCount: 1
  image:
    # When a tag is not supplied for a Dagster provided image,
    # it will default as the Helm chart version.
    repository: "australia-southeast1-docker.pkg.dev/warchest-develop/dockerhub/dagster/dagster-celery-k8s"
    tag: ~
    pullPolicy: Always
a
dagsterDaemon:
section
a
Hey @alex, I added the environment variable to the dagit and daemon pods. In addition to that, I do have the 3 snippets in my user-deployment-values.yml file but today again we got this after a deployment (not a random reload).. It does work after a few refreshes but what I'm not getting is the 180 seconds thing. We have set it to 300 at so many places but it is still pulling 180s from somewhere..
Copy code
dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 180 seconds to complete.

  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 599, in _load_location
    origin.reload_location(self.instance) if reload else origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 350, in reload_location
    self.create_client().reload_code(timeout=instance.code_server_reload_timeout)
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 300, in reload_code
    return self._query("ReloadCode", api_pb2.ReloadCodeRequest, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 158, in _query
    e, timeout=timeout, custom_timeout_message=custom_timeout_message
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 138, in _raise_grpc_exception
    ) from e

The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.DEADLINE_EXCEEDED
    details = "Deadline Exceeded"
    debug_error_string = "{"created":"@1688089225.995508927","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>

  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 155, in _query
    return self._get_response(method, request=request_type(**kwargs), timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 130, in _get_response
    return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
I haven't tried
local_startup_timeout
yet as we have below -
Copy code
codeServerArgs:
      - "--python-file"
      - "spina.py"
      - "--startup-timeout"
      - "300
Should I still give it a go?
👌 1
Hey @alex, we are thinking of trying a different approach. Could you have a look at these questions? The 1st question is more important than the 2nd one as we are thinking of creating code location per customer.