Hey Team we are getting a gRPC timeout error in OSS deployme dagster #ask-community

Hey Team, we are getting a gRPC timeout error in O...

Abhishek Agrawal

06/19/2023, 5:18 AM

Hey Team, we are getting a gRPC timeout error in OSS deployment.

Copy code

dagster._core.errors.DagsterUserCodeProcessError: Exception: Timed out waiting for gRPC server to start after 180s with arguments: "/usr/local/bin/python -m dagster api grpc --lazy-load-user-code --socket /tmp/tmp7dvqfsym --heartbeat --heartbeat-timeout 30 --fixed-server-id 8a29931e-61e7-409d-b3a3-6dd9f0179f41 --log-level info --location-name spina.py --container-image australia-southeast1/test:latest --container-context {"k8s": {"env": [{"name": "ENV", "value": "staging"}], "env_config_maps": ["dagster-user-deployments-data-os-data-pipeline-user-env", "dat-pipeline-configmap"], "image_pull_policy": "Always", "namespace": "dagster", "service_account_name": "dagster-user-deployments-sa"}} -f spina.py -d /app". Most recent connection error: dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE

It's timing out at 180s while we are setting the timeout value at 300s using this - Are we doing something wrong?

Abhishek Agrawal

06/19/2023, 9:15 AM

@daniel not tagging to get a quick reply but just to let you know as it might be related to 1.3.9 release and this discussion - https://github.com/dagster-io/dagster/discussions/14709

daniel

06/19/2023, 11:49 AM

Try adding —startup-timeout 300 to your codeServerArgs in the helm chart Note that after today I will be out on leave for the next several weeks, but others on the team will be able to help out with questions

Abhishek Agrawal

06/20/2023, 12:27 PM

Do I still need the

additionalInstanceConfig

bit? @owen tagging you as Daniel is away..

Abhishek Agrawal

06/20/2023, 12:38 PM

Also, the code reload still fails at times and other times it reloads properly. I'm not sure if the

--startup-timeout

setting is even working or not. The reason is, right now it is taking very close to 180s so we wanted to increase it to 300s. Anyway to confirm that? Does it show up somewhere..?

Abhishek Agrawal

06/22/2023, 2:16 AM

Even after doing the changes suggested by Daniel, I am seeing this error at odd times (outside of deployment scenarios) I think Dagit intermittently reloads the code but why the timeout value is not matching to what I have specified in yaml file? @owen

Copy code

Copy
dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 60 seconds to complete.

  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 599, in _load_location
    origin.reload_location(self.instance) if reload else origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 368, in create_location
    return GrpcServerCodeLocation(self)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/code_location.py", line 626, in __init__
    self,
  File "/usr/local/lib/python3.7/site-packages/dagster/_api/snapshot_repository.py", line 29, in sync_get_streaming_external_repositories_data_grpc
    repository_name,
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 351, in streaming_external_repository
    defer_snapshots=defer_snapshots,
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 185, in _streaming_query
    e, timeout=timeout, custom_timeout_message=custom_timeout_message
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 138, in _raise_grpc_exception
    ) from e

The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1687399795.751577543","description":"Error received from peer ipv4:10.101.0.212:3030","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>

  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 181, in _streaming_query
    method, request=request_type(**kwargs), timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 169, in _get_streaming_response
    yield from getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self

Abhishek Agrawal

06/22/2023, 6:33 AM

In the k8s logs, I see that code server was shut down for some reason. I had to hit reload. Do you know why that might happen?

Abhishek Agrawal

06/22/2023, 2:32 PM

@rex could you help here? The environment is pretty unstable because of this..

alex

06/22/2023, 3:15 PM

just for clarification, you are currently using

codeServerArgs

instead of

dagsterApiGrpcArgs

? there is an environment variable you can set

DAGSTER_GRPC_TIMEOUT_SECONDS

to increase the timeout for each grpc request up from

seconds

alex

06/22/2023, 3:15 PM

In the k8s logs, I see that code server was shut down for some reason

what precisely are you seeing in the logs

alex

06/22/2023, 3:17 PM

additional useful context would be what you are doing to generate your dagster definitions and the size of the resulting definitions

Abhishek Agrawal

06/22/2023, 3:38 PM

Copy code

- name: "pipeline"
    image:
      repository: "<image path>"
      tag: latest
      pullPolicy: Always
    codeServerArgs:
      - "--python-file"
      - "spina.py"
      - "--startup-timeout"
      - "300"
    port: 3030
    envConfigMaps:
     - name: test-configmap
    env:
    - name: ENV
      value: "__MTX_ENV"
    additionalInstanceConfig:
      code_servers:
        reload_timeout: 300 # value in seconds to wait

This is how the yaml looks. So, I should remove

additionalInstanceConfig

bit? I will try adding the environment variable.

additional useful context would be what you are doing to generate your dagster definitions and the size of the resulting definitions

In our code definitions load, we query an API to get metadata for our customers. So, it takes around 200 seconds to load the definitions. We are thinking of ways to optimise but this what we have right now.

what precisely are you seeing in the logs

I saw the log that code server was shutdown and UI was showing failure to load definitions. I pressed relaod and it worked. Thanks for your reply.

alex

06/22/2023, 3:56 PM

So, I should remove additionalInstanceConfig bit?

no i think thats fine to keep so one fundamental trade off of using

codeServerArgs

to enable reloads is that it consumes more memory. Have you cross referenced any information on your k8s cluster to see if you are now getting processes OOM killed?

Abhishek Agrawal

06/22/2023, 4:04 PM

Hmm I will need to check when that happens next. The dagster-user-deployment pod itself was active but the last log was that the server had been shut down. What do you suggest to setup some monitoring to check code location status? I think we need it to be alerted in scenarios like this.

alex

06/22/2023, 4:07 PM

What do you suggest to setup some monitoring to check code location status?

you should be able to figure out a GraphQL query that you can poll. Another angle would be to alert on OOM kiils in kubernetes

Abhishek Agrawal

06/22/2023, 4:47 PM

Hmm I see. Any possibility of adding code location status function in Python GraphQL Client?

alex

06/22/2023, 5:13 PM

you can submit an issue requesting it or a PR to add it

👍 1

Abhishek Agrawal

06/23/2023, 4:22 AM

As you may have noticed, with addition of the ENV variable

DAGSTER_GRPC_TIMEOUT_SECONDS

, we have 3 snippets in our yaml file to increase the timeout. 1.

Copy code

codeServerArgs:
      - "--python-file"
      - "spina.py"
      - "--startup-timeout"
      - "300"

Copy code

additionalInstanceConfig:
      code_servers:
        reload_timeout: 300 # value in seconds to wait

3. (newest addition to the party)

Copy code

- name: DAGSTER_GRPC_TIMEOUT_SECONDS
      value: "300"

Could I ask for some clarity here?

alex

06/23/2023, 1:58 PM

--startup-timeout - 300

when we initially start a code server process, how long do we wait til we timeout

reload_timeout

when doing a reload operation in the new proxy server, how long dil we timeout. Good chance this gets merged with the above in the future since you likely want the same value.

DAGSTER_GRPC_TIMEOUT_SECONDS

what do we set the timeout to for the grpc library, to control how long we wait for any given rpc call to the code server

Abhishek Agrawal

06/23/2023, 11:17 PM

Thanks @alex! Super helpful. Would be great to have 1 and 2 merged.

Abhishek Agrawal

06/26/2023, 12:03 AM

Hey @alex, I'm still seeing this -

Copy code

dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 60 seconds to complete.

  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 599, in _load_location
    origin.reload_location(self.instance) if reload else origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 368, in create_location
    return GrpcServerCodeLocation(self)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/code_location.py", line 626, in __init__
    self,
  File "/usr/local/lib/python3.7/site-packages/dagster/_api/snapshot_repository.py", line 29, in sync_get_streaming_external_repositories_data_grpc
    repository_name,
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 351, in streaming_external_repository
    defer_snapshots=defer_snapshots,
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 185, in _streaming_query
    e, timeout=timeout, custom_timeout_message=custom_timeout_message
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 138, in _raise_grpc_exception
    ) from e

The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1687734643.829500646","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>

  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 181, in _streaming_query
    method, request=request_type(**kwargs), timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 169, in _get_streaming_response
    yield from getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self

The

dagster-user-deployment

k8s pod right now is still showing logs from sensor daemons but I do see a log saying "Shutting down Dagster code server for file spina.py" from around 20 minutes back. So, it seems the code server went down for some reason. I though setting environment variable

DAGSTER_GRPC_TIMEOUT_SECONDS

to 300 would fix this but apparently not. Reloading the code from the UI would fix it but it should not be failing now that we have increased the timeout value. Our environment has become unstable because of this. Could you help?

Abhishek Agrawal

06/26/2023, 12:21 AM

Actually, I am not even sure if the timeout values I have put there are being honoured or not. I made a GraphQL call via Python client when the code server was down. I knew the call would fail but see the error message -

ERROR:root:Code location reload failed: Repository location reload failed because of a PythonError error: dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 180 seconds to complete.

It is still saying 180 seconds! Could you check if the configuration is correct here and tell me if I need to make any changes?

alex

06/26/2023, 2:46 PM

still seeing this … User code server request timed out due to taking longer than 60 seconds to complete

DAGSTER_GRPC_TIMEOUT_SECONDS

is a caller specified timeout, so you’ll need to set that env var on the daemon and dagit pods

not even sure if the timeout values I have put there are being honoured

That is strange. I see an additional

local_startup_timeout

setting https://docs.dagster.io/deployment/dagster-instance#grpc-servers that you can try to set as well

Our environment has become unstable because of this

When was it last stable? Before moving from

dagsterApiGrpcArgs

codeServerArgs

? Did something change with the generation process causing it to take longer?

Abhishek Agrawal

06/26/2023, 3:46 PM

DAGSTER_GRPC_TIMEOUT_SECONDS
is a caller specified timeout, so you’ll need to set that env var on the daemon and dagit pods

For dagit - I think I found the yaml file for dagit. It looks like this, I can just add the environment variable. For the daemon pods, how do I set it?

Copy code

dagit:
  replicaCount: 1
  image:
    # When a tag is not supplied for a Dagster provided image,
    # it will default as the Helm chart version.
    repository: "australia-southeast1-docker.pkg.dev/warchest-develop/dockerhub/dagster/dagster-celery-k8s"
    tag: ~
    pullPolicy: Always

alex

06/26/2023, 3:55 PM

dagsterDaemon:

section

Abhishek Agrawal

06/30/2023, 10:20 PM

Hey @alex, I added the environment variable to the dagit and daemon pods. In addition to that, I do have the 3 snippets in my user-deployment-values.yml file but today again we got this after a deployment (not a random reload).. It does work after a few refreshes but what I'm not getting is the 180 seconds thing. We have set it to 300 at so many places but it is still pulling 180s from somewhere..

Copy code

dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 180 seconds to complete.

  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 599, in _load_location
    origin.reload_location(self.instance) if reload else origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 350, in reload_location
    self.create_client().reload_code(timeout=instance.code_server_reload_timeout)
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 300, in reload_code
    return self._query("ReloadCode", api_pb2.ReloadCodeRequest, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 158, in _query
    e, timeout=timeout, custom_timeout_message=custom_timeout_message
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 138, in _raise_grpc_exception
    ) from e

The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
    status = StatusCode.DEADLINE_EXCEEDED
    details = "Deadline Exceeded"
    debug_error_string = "{"created":"@1688089225.995508927","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>

  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 155, in _query
    return self._get_response(method, request=request_type(**kwargs), timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 130, in _get_response
    return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)

I haven't tried

local_startup_timeout

yet as we have below -

Copy code

codeServerArgs:
      - "--python-file"
      - "spina.py"
      - "--startup-timeout"
      - "300

Should I still give it a go?

👌 1

Abhishek Agrawal

07/08/2023, 3:01 AM

Hey @alex, we are thinking of trying a different approach. Could you have a look at these questions? The 1st question is more important than the 2nd one as we are thinking of creating code location per customer.

50 Views

Open in Slack

Previous Next