Abhishek Agrawal
06/19/2023, 5:18 AMdagster._core.errors.DagsterUserCodeProcessError: Exception: Timed out waiting for gRPC server to start after 180s with arguments: "/usr/local/bin/python -m dagster api grpc --lazy-load-user-code --socket /tmp/tmp7dvqfsym --heartbeat --heartbeat-timeout 30 --fixed-server-id 8a29931e-61e7-409d-b3a3-6dd9f0179f41 --log-level info --location-name spina.py --container-image australia-southeast1/test:latest --container-context {"k8s": {"env": [{"name": "ENV", "value": "staging"}], "env_config_maps": ["dagster-user-deployments-data-os-data-pipeline-user-env", "dat-pipeline-configmap"], "image_pull_policy": "Always", "namespace": "dagster", "service_account_name": "dagster-user-deployments-sa"}} -f spina.py -d /app". Most recent connection error: dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE
It's timing out at 180s while we are setting the timeout value at 300s using this -
Are we doing something wrong?Abhishek Agrawal
06/19/2023, 9:15 AMdaniel
06/19/2023, 11:49 AMAbhishek Agrawal
06/20/2023, 12:27 PMadditionalInstanceConfig
bit?
@owen tagging you as Daniel is away..Abhishek Agrawal
06/20/2023, 12:38 PM--startup-timeout
setting is even working or not. The reason is, right now it is taking very close to 180s so we wanted to increase it to 300s.
Anyway to confirm that? Does it show up somewhere..?Abhishek Agrawal
06/22/2023, 2:16 AMCopy
dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 60 seconds to complete.
File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 599, in _load_location
origin.reload_location(self.instance) if reload else origin.create_location()
File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 368, in create_location
return GrpcServerCodeLocation(self)
File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/code_location.py", line 626, in __init__
self,
File "/usr/local/lib/python3.7/site-packages/dagster/_api/snapshot_repository.py", line 29, in sync_get_streaming_external_repositories_data_grpc
repository_name,
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 351, in streaming_external_repository
defer_snapshots=defer_snapshots,
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 185, in _streaming_query
e, timeout=timeout, custom_timeout_message=custom_timeout_message
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 138, in _raise_grpc_exception
) from e
The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "{"created":"@1687399795.751577543","description":"Error received from peer ipv4:10.101.0.212:3030","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 181, in _streaming_query
method, request=request_type(**kwargs), timeout=timeout
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 169, in _get_streaming_response
yield from getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
return self._next()
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
raise self
Abhishek Agrawal
06/22/2023, 6:33 AMAbhishek Agrawal
06/22/2023, 2:32 PMalex
06/22/2023, 3:15 PMcodeServerArgs
instead of dagsterApiGrpcArgs
?
there is an environment variable you can set DAGSTER_GRPC_TIMEOUT_SECONDS
to increase the timeout for each grpc request up from 60
secondsalex
06/22/2023, 3:15 PMIn the k8s logs, I see that code server was shut down for some reasonwhat precisely are you seeing in the logs
alex
06/22/2023, 3:17 PMAbhishek Agrawal
06/22/2023, 3:38 PM- name: "pipeline"
image:
repository: "<image path>"
tag: latest
pullPolicy: Always
codeServerArgs:
- "--python-file"
- "spina.py"
- "--startup-timeout"
- "300"
port: 3030
envConfigMaps:
- name: test-configmap
env:
- name: ENV
value: "__MTX_ENV"
additionalInstanceConfig:
code_servers:
reload_timeout: 300 # value in seconds to wait
This is how the yaml looks. So, I should remove additionalInstanceConfig
bit? I will try adding the environment variable.
additional useful context would be what you are doing to generate your dagster definitions and the size of the resulting definitions
In our code definitions load, we query an API to get metadata for our customers. So, it takes around 200 seconds to load the definitions. We are thinking of ways to optimise but this what we have right now.
what precisely are you seeing in the logs
I saw the log that code server was shutdown and UI was showing failure to load definitions. I pressed relaod and it worked. Thanks for your reply.
alex
06/22/2023, 3:56 PMSo, I should remove additionalInstanceConfig bit?no i think thats fine to keep so one fundamental trade off of using
codeServerArgs
to enable reloads is that it consumes more memory. Have you cross referenced any information on your k8s cluster to see if you are now getting processes OOM killed?Abhishek Agrawal
06/22/2023, 4:04 PMalex
06/22/2023, 4:07 PMWhat do you suggest to setup some monitoring to check code location status?you should be able to figure out a GraphQL query that you can poll. Another angle would be to alert on OOM kiils in kubernetes
Abhishek Agrawal
06/22/2023, 4:47 PMalex
06/22/2023, 5:13 PMAbhishek Agrawal
06/23/2023, 4:22 AMDAGSTER_GRPC_TIMEOUT_SECONDS
, we have 3 snippets in our yaml file to increase the timeout.
1.
codeServerArgs:
- "--python-file"
- "spina.py"
- "--startup-timeout"
- "300"
2.
additionalInstanceConfig:
code_servers:
reload_timeout: 300 # value in seconds to wait
3. (newest addition to the party)
- name: DAGSTER_GRPC_TIMEOUT_SECONDS
value: "300"
Could I ask for some clarity here?alex
06/23/2023, 1:58 PM--startup-timeout - 300when we initially start a code server process, how long do we wait til we timeout
reload_timeoutwhen doing a reload operation in the new proxy server, how long dil we timeout. Good chance this gets merged with the above in the future since you likely want the same value.
DAGSTER_GRPC_TIMEOUT_SECONDSwhat do we set the timeout to for the grpc library, to control how long we wait for any given rpc call to the code server
Abhishek Agrawal
06/23/2023, 11:17 PMAbhishek Agrawal
06/26/2023, 12:03 AMdagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 60 seconds to complete.
File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 599, in _load_location
origin.reload_location(self.instance) if reload else origin.create_location()
File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 368, in create_location
return GrpcServerCodeLocation(self)
File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/code_location.py", line 626, in __init__
self,
File "/usr/local/lib/python3.7/site-packages/dagster/_api/snapshot_repository.py", line 29, in sync_get_streaming_external_repositories_data_grpc
repository_name,
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 351, in streaming_external_repository
defer_snapshots=defer_snapshots,
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 185, in _streaming_query
e, timeout=timeout, custom_timeout_message=custom_timeout_message
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 138, in _raise_grpc_exception
) from e
The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "{"created":"@1687734643.829500646","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 181, in _streaming_query
method, request=request_type(**kwargs), timeout=timeout
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 169, in _get_streaming_response
yield from getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
return self._next()
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
raise self
The dagster-user-deployment
k8s pod right now is still showing logs from sensor daemons but I do see a log saying "Shutting down Dagster code server for file spina.py" from around 20 minutes back. So, it seems the code server went down for some reason. I though setting environment variable DAGSTER_GRPC_TIMEOUT_SECONDS
to 300 would fix this but apparently not.
Reloading the code from the UI would fix it but it should not be failing now that we have increased the timeout value.
Our environment has become unstable because of this. Could you help?Abhishek Agrawal
06/26/2023, 12:21 AMERROR:root:Code location reload failed: Repository location reload failed because of a PythonError error: dagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 180 seconds to complete.
It is still saying 180 seconds! Could you check if the configuration is correct here and tell me if I need to make any changes?alex
06/26/2023, 2:46 PMstill seeing this … User code server request timed out due to taking longer than 60 seconds to complete
DAGSTER_GRPC_TIMEOUT_SECONDS
is a caller specified timeout, so you’ll need to set that env var on the daemon and dagit pods
not even sure if the timeout values I have put there are being honouredThat is strange. I see an additional
local_startup_timeout
setting
https://docs.dagster.io/deployment/dagster-instance#grpc-servers
that you can try to set as well
Our environment has become unstable because of thisWhen was it last stable? Before moving from
dagsterApiGrpcArgs
to codeServerArgs
? Did something change with the generation process causing it to take longer?Abhishek Agrawal
06/26/2023, 3:46 PMFor dagit - I think I found the yaml file for dagit. It looks like this, I can just add the environment variable. For the daemon pods, how do I set it?is a caller specified timeout, so you’ll need to set that env var on the daemon and dagit podsDAGSTER_GRPC_TIMEOUT_SECONDS
dagit:
replicaCount: 1
image:
# When a tag is not supplied for a Dagster provided image,
# it will default as the Helm chart version.
repository: "australia-southeast1-docker.pkg.dev/warchest-develop/dockerhub/dagster/dagster-celery-k8s"
tag: ~
pullPolicy: Always
alex
06/26/2023, 3:55 PMdagsterDaemon:
sectionAbhishek Agrawal
06/30/2023, 10:20 PMdagster._core.errors.DagsterUserCodeUnreachableError: User code server request timed out due to taking longer than 180 seconds to complete.
File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 599, in _load_location
origin.reload_location(self.instance) if reload else origin.create_location()
File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 350, in reload_location
self.create_client().reload_code(timeout=instance.code_server_reload_timeout)
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 300, in reload_code
return self._query("ReloadCode", api_pb2.ReloadCodeRequest, timeout=timeout)
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 158, in _query
e, timeout=timeout, custom_timeout_message=custom_timeout_message
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 138, in _raise_grpc_exception
) from e
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "{"created":"@1688089225.995508927","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 155, in _query
return self._get_response(method, request=request_type(**kwargs), timeout=timeout)
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 130, in _get_response
return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
I haven't tried local_startup_timeout
yet as we have below -
codeServerArgs:
- "--python-file"
- "spina.py"
- "--startup-timeout"
- "300
Should I still give it a go?Abhishek Agrawal
07/08/2023, 3:01 AM