Any thoughts here on this error Running k8s executor dagit d dagster #ask-community

Any thoughts here on this error? Running k8s execu...

Matthieu Oliveira

04/26/2023, 9:23 PM

Any thoughts here on this error? Running k8s executor dagit/dagster version 1.2.7

Copy code

Operation name: JobMetadataQuery

Message: Failure loading edgeshare: dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 535, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 460, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 329, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/repository_location.py", line 606, in __init__
    self,
  File "/usr/local/lib/python3.7/site-packages/dagster/_api/snapshot_repository.py", line 29, in sync_get_streaming_external_repositories_data_grpc
    repository_name,
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 336, in streaming_external_repository
    defer_snapshots=defer_snapshots,
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 166, in _streaming_query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e

The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1682543922.819565004","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 163, in _streaming_query
    method, request=request_type(**kwargs), timeout=timeout
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 152, in _get_streaming_response
    yield from getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self


Path: ["assetNodes"]

Locations: [{"line":10,"column":3}]

Matthieu Oliveira

04/26/2023, 9:30 PM

fwiw: a fresh resolves this but then it keeps on happening.

alex

04/26/2023, 9:36 PM

StatusCode.DEADLINE_EXCEEDED

means it took longer than 60 seconds for the dagit webserver to fetch the workspace snapshot (representation of the definitions) from the code server via GRPC Do you have a very large workspace in on code location? many many jobs/ops/assets? Otherwise its possible limited resources are slowing things down. You can set env var

DAGSTER_GRPC_TIMEOUT_SECONDS

to change the timeout

Matthieu Oliveira

04/26/2023, 9:44 PM

Okay. got it. I have about 150 jobs/10 sensors running. I also make calls to the server via graphql query to refresh the repo. Yet I see this frequently

Copy code

Dagster Reload Response: {'data': {'reloadRepositoryLocation': {'__typename': 'WorkspaceLocationEntry', 'name': 'edgeshare', 'locationOrLoadError': {'__typename': 'PythonError', 'message': 'dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
'}}}}

would this be the same thing going on?

alex

04/26/2023, 9:47 PM

you may need to fetch more of the error object to see the chained exception to see what the grpc status code is but i would speculate theres a good chance that its the same

👍 1

alex

04/26/2023, 9:47 PM

if you have the right

securityContext

settings you can use a profiler like

py-spy

to see whats taking the user code server so long

Matthieu Oliveira

04/26/2023, 9:48 PM

Okay yeh I'm only looking at response["errors"].. illl expose the whole body

alex

04/26/2023, 9:48 PM

how many ops/assets are in the 150 jobs ? is there any very large metadata attached to them?

alex

04/26/2023, 9:49 PM

there are some performance improvements in 1.3.2 coming out today/tomorrow that may help

Matthieu Oliveira

04/26/2023, 9:49 PM

eh, ~500 ops, no metadata and only a description.

Matthieu Oliveira

04/26/2023, 9:52 PM

I'm going to push a build and look at more of the response body

Matthieu Oliveira

04/26/2023, 10:24 PM

DAGSTER_GRPC_TIMEOUT_SECONDS increase may have helped but im still not 100% sure yet

Open in Slack

Previous Next