Hi team, could you help with `DEADLINE_EXCEEDED` w...
# ask-community
h
Hi team, could you help with
DEADLINE_EXCEEDED
with “file”: “src/core/lib/surface/call.cc” please? It seems that repo appears to be unreachable from Dagit but we could still see it running sensors..
Copy code
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.DEADLINE_EXCEEDED details = "Deadline Exceeded" debug_error_string = "{"created":"@1678381231.466377137","description":"Error received from peer ipv4:172.30.227.74:3030","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}" >
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 122, in _streaming_query
    yield from response_stream
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self
d
Hi Hebo - this usually comes from a sensor that is taking > 60 seconds to run the sensor function and timing out - a full stack trace would help verify that. I think this error has been made less cryptic in more recent versions of Dagster
h
Thanks Daniel..I only got this
Copy code
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 552, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 476, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 329, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/repository_location.py", line 547, in _init_
    list_repositories_response = sync_list_repositories_grpc(self.client)
  File "/usr/local/lib/python3.7/site-packages/dagster/_api/list_repositories.py", line 19, in sync_list_repositories_grpc
    api_client.list_repositories(),
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 169, in list_repositories
    res = self._query("ListRepositories", api_pb2.ListRepositoriesRequest)
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 115, in _query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses" debug_error_string = "{"created":"@1678403306.434323254","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1678403306.434322899","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}" >
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 112, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in _call_
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
It just seems to error out from Dagit. The repo pod itself doesn’t have any error logs and is still running sensors
actually, I just realized this is a slightly different error than before.This time it’s
file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}
A
d
Are there any logs from your gRPC server / user code deployment that might help explain why it’s timing out? Did anything in particular change between when it was last working and when the problem started happening
Could the server be overloaded? Does restarting it help?
h
There is no major changes other than adding more and more pipelines and assets
I did a reload from the UI and it helped. The repo pod mem and CPU are also fairly normal
@daniel Following up on this as
grpc_message":"Deadline Exceeded","grpc_status":4
seems to occur more frequently with out repo recently. When this happens, it seems that our repo is still processing sensor but just not responding to Dagit. Restarting the repo resolves it. Digging into the stack trace from Dagit, this seems to be due to sql query timeout. Would increasing the postgres DB timeout help?
Copy code
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1678897749.619204048","description":"Error received from peer ipv4:172.30.227.74:3030","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 122, in _streaming_query
    yield from response_stream
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self

  location_name=location_name, error_string=error.to_string()
An error occurred while resolving field Pipeline.runs
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1820, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.QueryCanceled: canceling statement due to statement timeout


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/graphql/execution/executor.py", line 452, in resolve_or_error
    return executor.execute(resolve_fn, source, info, **args)
  File "/usr/local/lib/python3.7/site-packages/graphql/execution/executors/sync.py", line 16, in execute
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/schema/pipelines/pipeline.py", line 749, in resolve_runs
    self._external_pipeline.name, kwargs.get("limit")
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/loader.py", line 200, in get_run_records_for_job
    return self._get(RepositoryDataType.JOB_RUNS, job_name, limit)
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/loader.py", line 59, in _get
    self._fetch(data_type, limit)
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/loader.py", line 72, in _fetch
    bucket_by=JobBucket(bucket_limit=limit, job_names=job_names),
  File "/usr/local/lib/python3.7/site-packages/dagster/_utils/__init__.py", line 631, in inner
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/instance/__init__.py", line 1315, in get_run_records
    filters, limit, order_by, ascending, cursor, bucket_by
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/storage/runs/sql_run_storage.py", line 452, in get_run_records
    rows = self.fetchall(query)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/storage/runs/sql_run_storage.py", line 82, in fetchall
    result_proxy = conn.execute(query)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1306, in execute
    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 333, in _execute_on_connection
    self, multiparams, params, execution_options
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1508, in _execute_clauseelement
    cache_hit=cache_hit,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1863, in _execute_context
    e, statement, parameters, cursor, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2044, in _handle_dbapi_exception
    sqlalchemy_exception, with_traceback=exc_info[2], from_=e
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1820, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.QueryCanceled) canceling statement due to statement timeout

[SQL: SELECT subquery.id, subquery.run_body, subquery.status, subquery.create_timestamp, subquery.update_timestamp, subquery.start_time, subquery.end_time 
FROM (SELECT runs.id AS id, runs.run_body AS run_body, runs.status AS status, runs.create_timestamp AS create_timestamp, runs.update_timestamp AS update_timestamp, runs.start_time AS start_time, runs.end_time AS end_time, rank() OVER (PARTITION BY runs.pipeline_name ORDER BY runs.id DESC) AS rank 
FROM runs 
WHERE runs.pipeline_name IN
 …
subquery 
WHERE subquery.rank <= %(rank_1)s ORDER BY subquery.rank ASC]
…
[parameters:
d
What version are you on again?
h
0.15.8
d
I see - if upgrading is an option, we've made substantial performance improvements across the board since that version
i'm having trouble understanding why that would be timing out a grpc call though - are you making queries against the runs database while loading your code?
Or maybe those are two different errors that happen next to each other in the logs?
(the grpc timeout and the postgres timeout)
h
I see. Yeah, upgrade is a bit painful for us with our infra setup. We have to do it next quarter.
hmmm..we are not making queries against the runs database while loading the code. However we do query it inside the sensor.
d
Ah this is in a sensor? The stack trace you posted earlier was not within a sensor
Can you share the code of the sensor in the part where it queries the runs db?
h
the log is from Dagit. I think you are right, these are probably not related. We have another service that queries Dagit for pipeline runs. Maybe that’s what the second error is about
For the sensor code, we try to check upstream asset job runs
Copy code
runs = self.context.instance.get_runs(
                    RunsFilter(
                        job_name=f"fabricator_{source_name}_job",
                        statuses=[PipelineRunStatus.SUCCESS],
                        tags={EVENT_RECORD_PARTITION_KEY: partition},
                    ),
                    limit=1,
                )
hi @daniel I am wondering if you have any pointers on this error please? I checked again and I could confirm that the SQL timeout wasn’t relevant. However, the stacktrace from Dagit doesn’t have any other info. We can saw this error in both Dagit and Daemon. It seems that a Dagit instance won’t be able to recover from this error once it happens. However, our code repo has been responding to run sensors the whole time.
d
I've kind of lost the thread with all the different errors that have been posted here. Can you confirm which error is causing you problems?
and post a recent full stack trace? I see a couple idfferent ones up there
h
sry this one
Copy code
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 552, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 476, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 329, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/repository_location.py", line 583, in __init__
    self,
  File "/usr/local/lib/python3.7/site-packages/dagster/_api/snapshot_repository.py", line 29, in sync_get_streaming_external_repositories_data_grpc
    repository_name,
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 265, in streaming_external_repository
    external_repository_origin
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 124, in _streaming_query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.DEADLINE_EXCEEDED details = "Deadline Exceeded" debug_error_string = "{"created":"@1679605240.260476809","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}" >
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 122, in _streaming_query
    yield from response_stream
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self
d
Ok, from just that error in isolation its a little hard to tell what's going on. It seems like it's timing out while trying to load your code. Is there any way we can reproduce this problem ourselves to get more information?
There's of course the perennial fact that you all are on pre-1.0 code still, and substantial performance improvements have been made across the board since the version that you're currently using
h
yeah, we know an upgrade is overdue but it’s getting harder and harder as major upgrades require downtime and also draining the running jobs.
We are hoping for an upgrade in Q2
d
I think what would be helpful is to know what's happening on the gRPC server during those 60 seconds where the call is timing out
either via logs or something like py-spy or a way to reproduce the timeout ourselves with similar cod3e
h
It is actually hard to repro this..this only happens to us intermittently as well. It also seems to only happen to certain dagit pods but not all of them
d
typically a timeout like that would indicate a problem on the gRPC server not dagit
h
Daemon seems to be able to self recover from this error. However, dagit pod doesn’t seem to.
d
I would expect reloading the code location in dagit to make it try again if this timeout happens
h
We have a few different user code repos, and this only happens to one of the code repo.
d
that's definitely useful information - maybe sensors are exhausting resources on that box somehow - doing some analysis on the pod with a tool like py-spy can sometimes help explain why things are running slow
👍 1
h
I see. Does reloading the code location restart the user code repo pod?
d
it doesn't, but if the daemon recovers, i would expect refreshing in dagit to do something similar
h
Oh..so the reloading button from Dagit essentially tries to re-connect to the existing user repo ?
d
yeah
h
Thanks Daniel! These are really very helpful info. Let me look into it!