Ion Scerbatiuc
02/10/2023, 6:05 PMdagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
. In order to fix the issue we delete the dagster daemon pod and the new ones picks things back up. A few questions: has anyone seen this issue before? This looks like some sort of race condition in those daemons; is there any way to prevent it from happening? (sending some stack traces in thread)Ion Scerbatiuc
02/10/2023, 6:06 PM2023-02-01 20:02:12 +0000 - dagster.daemon.SensorDaemon - ERROR - Sensor daemon caught an error for sensor failure_dd : dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
Stack Trace:
File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/sensor.py", line 481, in _process_tick_generator
sensor_debug_crash_flags,
File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/sensor.py", line 546, in _evaluate_sensor
instigator_data.cursor if instigator_data else None,
File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/repository_location.py", line 815, in get_external_sensor_execution_data
cursor,
File "/usr/local/lib/python3.7/site-packages/dagster/_api/snapshot_sensor.py", line 61, in sync_get_external_sensor_execution_data_grpc
cursor=cursor,
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 333, in external_sensor_execution
sensor_execution_args
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 124, in _streaming_query
raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1675281732.303470649","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1675281732.303469909","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>
Stack Trace:
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 122, in _streaming_query
yield from response_stream
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
return self._next()
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
raise self
2023-02-01 20:02:12 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: start_dd
2023-02-01 20:02:26 +0000 - dagster.daemon.SchedulerDaemon - WARNING - Could not load location glue-demo to check for schedules due to the following error: dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
Stack Trace:
File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 559, in _load_location
location = self._create_location_from_origin(origin)
File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 483, in _create_location_from_origin
return origin.create_location()
File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 333, in create_location
return GrpcServerRepositoryLocation(self)
File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/repository_location.py", line 568, in __init__
list_repositories_response = sync_list_repositories_grpc(self.client)
File "/usr/local/lib/python3.7/site-packages/dagster/_api/list_repositories.py", line 19, in sync_list_repositories_grpc
api_client.list_repositories(),
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 169, in list_repositories
res = self._query("ListRepositories", api_pb2.ListRepositoriesRequest)
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 115, in _query
raise DagsterUserCodeUnreachableError("Could not reach user code server") from e
The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses"
debug_error_string = "{"created":"@1675281745.928212625","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1675281745.928211325","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>
Stack Trace:
File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 112, in _query
response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
return _end_unary_response_blocking(state, call, False, None)
File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
raise _InactiveRpcError(state)
Ion Scerbatiuc
02/10/2023, 6:07 PMjohann
02/10/2023, 7:35 PMdagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
means that the gRPC connection from the Daemon Pod to the user deployment Pod failed. Do you see any restarts of the user deployments?Ion Scerbatiuc
02/10/2023, 7:39 PMIon Scerbatiuc
02/10/2023, 7:40 PMhelm upgrade
on the chart, nothing specialjohann
02/10/2023, 7:43 PMdagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
?Ion Scerbatiuc
02/10/2023, 7:44 PMIon Scerbatiuc
02/10/2023, 7:44 PMError
state in the Dagit UIjohann
02/10/2023, 7:44 PMjohann
02/10/2023, 7:45 PMIon Scerbatiuc
02/10/2023, 7:45 PMIon Scerbatiuc
02/10/2023, 7:47 PM1.0.12
). Do you know if maybe the resiliency behavior you described was implemented in a later version?Ion Scerbatiuc
02/10/2023, 7:47 PMjohann
02/10/2023, 7:49 PMjohann
02/10/2023, 7:52 PMjohann
02/10/2023, 7:52 PMwhen I looked I noticed those two daemons being downhad they logged anything new?
Ion Scerbatiuc
02/10/2023, 7:53 PMError
state and didn't log anything elseIon Scerbatiuc
02/10/2023, 7:53 PMjohann
02/10/2023, 7:53 PMjohann
02/10/2023, 7:54 PMIt looked like after the errors they showed up inI’m not quite sure what this means?stateError
Ion Scerbatiuc
02/10/2023, 8:19 PMRunning
in greenIon Scerbatiuc
02/10/2023, 8:19 PMError
state, or maybe Failed
or something like that in redIon Scerbatiuc
02/10/2023, 8:20 PMIon Scerbatiuc
02/10/2023, 8:20 PMIon Scerbatiuc
02/10/2023, 8:23 PMIon Scerbatiuc
02/10/2023, 8:24 PMView errors (3)
linkIon Scerbatiuc
02/10/2023, 8:24 PMjohann
02/10/2023, 9:02 PMjohann
02/10/2023, 9:02 PMIon Scerbatiuc
02/10/2023, 11:33 PMIon Scerbatiuc
02/17/2023, 4:05 PMIon Scerbatiuc
02/17/2023, 4:06 PMIon Scerbatiuc
02/17/2023, 4:07 PMIon Scerbatiuc
02/17/2023, 4:07 PMIon Scerbatiuc
02/17/2023, 4:08 PMIon Scerbatiuc
02/17/2023, 4:24 PMgdb
and it looks like it's stuck on an accept syscallIon Scerbatiuc
02/17/2023, 4:24 PM(gdb) info program
Using the running image of attached LWP 1.
Program stopped at 0x39172174.
Type "info stack" or "info registers" for more information.
(gdb) info stack
#0 0x00007f0c39172174 in __libc_accept (fd=134297424, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
#1 0x00007f0c08013750 in ?? ()
#2 0x0000000000000000 in ?? ()
(gdb) info threads
Id Target Id Frame
* 1 LWP 1 "dagster-daemon" 0x00007f0c39172174 in __libc_accept (fd=134297424, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
2 LWP 9 "dagster-daemon" 0x00007f0c39172388 in __libc_recvfrom (fd=671094368, buf=0x189, len=0, flags=799579744, addr=..., addrlen=0xffffffff) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
3 LWP 138 "dagster-daemon" 0x00007f0c39172174 in __libc_accept (fd=44549584, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
4 LWP 139 "dagster-daemon" 0x00007f0c392778b3 in hol_append (more=0x7f0c39640c00, hol=0x7f0c396135d0 <_Py_CheckRecursionLimit>) at argp-help.c:866
5 LWP 140 "dagster-daemon" 0x00007f0c39172174 in __libc_accept (fd=44549584, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
6 LWP 21236 "dagster-daemon" 0x00007f0c39172174 in __libc_accept (fd=44549584, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
7 LWP 21237 "dagster-daemon" 0x00007f0c39172174 in __libc_accept (fd=44549584, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
8 LWP 21265 "dagster-daemon" 0x00007f0c39172174 in __libc_accept (fd=44549584, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
johann
02/17/2023, 4:31 PMIon Scerbatiuc
02/17/2023, 4:31 PMIon Scerbatiuc
02/17/2023, 4:31 PMjohann
02/17/2023, 4:32 PMjohann
02/17/2023, 4:32 PMIon Scerbatiuc
02/17/2023, 4:33 PMIon Scerbatiuc
02/17/2023, 4:33 PMgrpcio = ">=1.32.0,<1.48.1"
Ion Scerbatiuc
02/17/2023, 4:36 PMgrpcio-1.47.2
johann
02/17/2023, 4:39 PMIon Scerbatiuc
02/17/2023, 4:40 PMjohann
02/17/2023, 4:40 PMpy-spy dump
in the deadlocked container may be helpful as well if you’re able to get thatIon Scerbatiuc
02/17/2023, 4:41 PMjohann
02/17/2023, 4:42 PMIon Scerbatiuc
02/17/2023, 4:43 PMjohann
02/17/2023, 4:43 PMjohann
02/17/2023, 4:44 PMIon Scerbatiuc
02/17/2023, 5:00 PMjohann
02/21/2023, 3:47 PMjohann
02/21/2023, 3:48 PMjohann
02/21/2023, 3:50 PMIon Scerbatiuc
02/23/2023, 7:42 PMIon Scerbatiuc
02/23/2023, 7:42 PMIon Scerbatiuc
02/23/2023, 7:42 PMIon Scerbatiuc
02/23/2023, 7:43 PMIon Scerbatiuc
02/23/2023, 7:43 PMIon Scerbatiuc
02/23/2023, 7:43 PMIon Scerbatiuc
02/23/2023, 7:43 PM