Hi team could you help with `DEADLINE EXCEEDED` with file sr dagster #ask-community

Hi team, could you help with `DEADLINE_EXCEEDED` w...

Hebo Yang

03/09/2023, 5:30 PM

Hi team, could you help with

DEADLINE_EXCEEDED

with “file”: “src/core/lib/surface/call.cc” please? It seems that repo appears to be unreachable from Dagit but we could still see it running sensors..

Hebo Yang

03/09/2023, 5:30 PM

Copy code

grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.DEADLINE_EXCEEDED details = "Deadline Exceeded" debug_error_string = "{"created":"@1678381231.466377137","description":"Error received from peer ipv4:172.30.227.74:3030","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}" >
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 122, in _streaming_query
    yield from response_stream
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self

daniel

03/09/2023, 6:36 PM

Hi Hebo - this usually comes from a sensor that is taking > 60 seconds to run the sensor function and timing out - a full stack trace would help verify that. I think this error has been made less cryptic in more recent versions of Dagster

Hebo Yang

03/10/2023, 12:14 AM

Thanks Daniel..I only got this

Copy code

dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 552, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 476, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 329, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/repository_location.py", line 547, in _init_
    list_repositories_response = sync_list_repositories_grpc(self.client)
  File "/usr/local/lib/python3.7/site-packages/dagster/_api/list_repositories.py", line 19, in sync_list_repositories_grpc
    api_client.list_repositories(),
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 169, in list_repositories
    res = self._query("ListRepositories", api_pb2.ListRepositoriesRequest)
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 115, in _query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with: status = StatusCode.UNAVAILABLE details = "failed to connect to all addresses" debug_error_string = "{"created":"@1678403306.434323254","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1678403306.434322899","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}" >
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 112, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in _call_
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)

Hebo Yang

03/10/2023, 12:15 AM

It just seems to error out from Dagit. The repo pod itself doesn’t have any error logs and is still running sensors

Hebo Yang

03/10/2023, 12:16 AM

actually, I just realized this is a slightly different error than before.This time it’s

file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}

Hebo Yang

03/10/2023, 12:19 AM

daniel

03/10/2023, 12:34 AM

Are there any logs from your gRPC server / user code deployment that might help explain why it’s timing out? Did anything in particular change between when it was last working and when the problem started happening

daniel

03/10/2023, 12:35 AM

Could the server be overloaded? Does restarting it help?

Hebo Yang

03/10/2023, 12:51 AM

There is no major changes other than adding more and more pipelines and assets

Hebo Yang

03/10/2023, 12:52 AM

I did a reload from the UI and it helped. The repo pod mem and CPU are also fairly normal

Hebo Yang

03/15/2023, 5:08 PM

@daniel Following up on this as

grpc_message":"Deadline Exceeded","grpc_status":4

seems to occur more frequently with out repo recently. When this happens, it seems that our repo is still processing sensor but just not responding to Dagit. Restarting the repo resolves it. Digging into the stack trace from Dagit, this seems to be due to sql query timeout. Would increasing the postgres DB timeout help?

Hebo Yang

03/15/2023, 5:08 PM

Copy code

grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1678897749.619204048","description":"Error received from peer ipv4:172.30.227.74:3030","file":"src/core/lib/surface/call.cc","file_line":966,"grpc_message":"Deadline Exceeded","grpc_status":4}"
>

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 122, in _streaming_query
    yield from response_stream
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self

  location_name=location_name, error_string=error.to_string()
An error occurred while resolving field Pipeline.runs
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1820, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.QueryCanceled: canceling statement due to statement timeout


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/graphql/execution/executor.py", line 452, in resolve_or_error
    return executor.execute(resolve_fn, source, info, **args)
  File "/usr/local/lib/python3.7/site-packages/graphql/execution/executors/sync.py", line 16, in execute
    return fn(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/schema/pipelines/pipeline.py", line 749, in resolve_runs
    self._external_pipeline.name, kwargs.get("limit")
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/loader.py", line 200, in get_run_records_for_job
    return self._get(RepositoryDataType.JOB_RUNS, job_name, limit)
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/loader.py", line 59, in _get
    self._fetch(data_type, limit)
  File "/usr/local/lib/python3.7/site-packages/dagster_graphql/implementation/loader.py", line 72, in _fetch
    bucket_by=JobBucket(bucket_limit=limit, job_names=job_names),
  File "/usr/local/lib/python3.7/site-packages/dagster/_utils/__init__.py", line 631, in inner
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/instance/__init__.py", line 1315, in get_run_records
    filters, limit, order_by, ascending, cursor, bucket_by
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/storage/runs/sql_run_storage.py", line 452, in get_run_records
    rows = self.fetchall(query)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/storage/runs/sql_run_storage.py", line 82, in fetchall
    result_proxy = conn.execute(query)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1306, in execute
    return meth(self, multiparams, params, _EMPTY_EXECUTION_OPTS)
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/sql/elements.py", line 333, in _execute_on_connection
    self, multiparams, params, execution_options
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1508, in _execute_clauseelement
    cache_hit=cache_hit,
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1863, in _execute_context
    e, statement, parameters, cursor, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 2044, in _handle_dbapi_exception
    sqlalchemy_exception, with_traceback=exc_info[2], from_=e
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/util/compat.py", line 208, in raise_
    raise exception
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/base.py", line 1820, in _execute_context
    cursor, statement, parameters, context
  File "/usr/local/lib/python3.7/site-packages/sqlalchemy/engine/default.py", line 732, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.QueryCanceled) canceling statement due to statement timeout

[SQL: SELECT subquery.id, subquery.run_body, subquery.status, subquery.create_timestamp, subquery.update_timestamp, subquery.start_time, subquery.end_time 
FROM (SELECT runs.id AS id, runs.run_body AS run_body, runs.status AS status, runs.create_timestamp AS create_timestamp, runs.update_timestamp AS update_timestamp, runs.start_time AS start_time, runs.end_time AS end_time, rank() OVER (PARTITION BY runs.pipeline_name ORDER BY runs.id DESC) AS rank 
FROM runs 
WHERE runs.pipeline_name IN
 …
subquery 
WHERE subquery.rank <= %(rank_1)s ORDER BY subquery.rank ASC]
…
[parameters:

daniel

03/15/2023, 5:09 PM

What version are you on again?

Hebo Yang

03/15/2023, 5:09 PM

0.15.8

daniel

03/15/2023, 5:10 PM

I see - if upgrading is an option, we've made substantial performance improvements across the board since that version

daniel

03/15/2023, 5:10 PM

i'm having trouble understanding why that would be timing out a grpc call though - are you making queries against the runs database while loading your code?

daniel

03/15/2023, 5:10 PM

Or maybe those are two different errors that happen next to each other in the logs?

daniel

03/15/2023, 5:10 PM

(the grpc timeout and the postgres timeout)

Hebo Yang

03/15/2023, 5:13 PM

I see. Yeah, upgrade is a bit painful for us with our infra setup. We have to do it next quarter.

Hebo Yang

03/15/2023, 5:15 PM

hmmm..we are not making queries against the runs database while loading the code. However we do query it inside the sensor.

daniel

03/15/2023, 5:16 PM

Ah this is in a sensor? The stack trace you posted earlier was not within a sensor

daniel

03/15/2023, 5:16 PM

Can you share the code of the sensor in the part where it queries the runs db?

Hebo Yang

03/15/2023, 5:17 PM

the log is from Dagit. I think you are right, these are probably not related. We have another service that queries Dagit for pipeline runs. Maybe that’s what the second error is about

Hebo Yang

03/15/2023, 5:22 PM

For the sensor code, we try to check upstream asset job runs

Copy code

runs = self.context.instance.get_runs(
                    RunsFilter(
                        job_name=f"fabricator_{source_name}_job",
                        statuses=[PipelineRunStatus.SUCCESS],
                        tags={EVENT_RECORD_PARTITION_KEY: partition},
                    ),
                    limit=1,
                )

Hebo Yang

03/23/2023, 10:24 PM

hi @daniel I am wondering if you have any pointers on this error please? I checked again and I could confirm that the SQL timeout wasn’t relevant. However, the stacktrace from Dagit doesn’t have any other info. We can saw this error in both Dagit and Daemon. It seems that a Dagit instance won’t be able to recover from this error once it happens. However, our code repo has been responding to run sensors the whole time.

daniel

03/23/2023, 10:25 PM

I've kind of lost the thread with all the different errors that have been posted here. Can you confirm which error is causing you problems?

daniel

03/23/2023, 10:26 PM

and post a recent full stack trace? I see a couple idfferent ones up there

Hebo Yang

03/23/2023, 10:26 PM

sry this one

Copy code

dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 552, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 476, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 329, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/repository_location.py", line 583, in __init__
    self,
  File "/usr/local/lib/python3.7/site-packages/dagster/_api/snapshot_repository.py", line 29, in sync_get_streaming_external_repositories_data_grpc
    repository_name,
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 265, in streaming_external_repository
    external_repository_origin
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 124, in _streaming_query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with: status = StatusCode.DEADLINE_EXCEEDED details = "Deadline Exceeded" debug_error_string = "{"created":"@1679605240.260476809","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}" >
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 122, in _streaming_query
    yield from response_stream
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self

daniel

03/23/2023, 10:27 PM

Ok, from just that error in isolation its a little hard to tell what's going on. It seems like it's timing out while trying to load your code. Is there any way we can reproduce this problem ourselves to get more information?

daniel

03/23/2023, 10:28 PM

There's of course the perennial fact that you all are on pre-1.0 code still, and substantial performance improvements have been made across the board since the version that you're currently using

Hebo Yang

03/23/2023, 10:29 PM

yeah, we know an upgrade is overdue but it’s getting harder and harder as major upgrades require downtime and also draining the running jobs.

Hebo Yang

03/23/2023, 10:29 PM

We are hoping for an upgrade in Q2

daniel

03/23/2023, 10:30 PM

I think what would be helpful is to know what's happening on the gRPC server during those 60 seconds where the call is timing out

daniel

03/23/2023, 10:30 PM

either via logs or something like py-spy or a way to reproduce the timeout ourselves with similar cod3e

Hebo Yang

03/23/2023, 10:30 PM

It is actually hard to repro this..this only happens to us intermittently as well. It also seems to only happen to certain dagit pods but not all of them

daniel

03/23/2023, 10:31 PM

typically a timeout like that would indicate a problem on the gRPC server not dagit

Hebo Yang

03/23/2023, 10:32 PM

Daemon seems to be able to self recover from this error. However, dagit pod doesn’t seem to.

daniel

03/23/2023, 10:32 PM

I would expect reloading the code location in dagit to make it try again if this timeout happens

Hebo Yang

03/23/2023, 10:32 PM

We have a few different user code repos, and this only happens to one of the code repo.

daniel

03/23/2023, 10:33 PM

that's definitely useful information - maybe sensors are exhausting resources on that box somehow - doing some analysis on the pod with a tool like py-spy can sometimes help explain why things are running slow

👍 1

Hebo Yang

03/23/2023, 10:33 PM

I see. Does reloading the code location restart the user code repo pod?

daniel

03/23/2023, 10:33 PM

it doesn't, but if the daemon recovers, i would expect refreshing in dagit to do something similar

Hebo Yang

03/23/2023, 10:34 PM

Oh..so the reloading button from Dagit essentially tries to re-connect to the existing user repo ?

daniel

03/23/2023, 10:35 PM

yeah

Hebo Yang

03/23/2023, 10:36 PM

Thanks Daniel! These are really very helpful info. Let me look into it!

15 Views

Open in Slack

Previous Next