Hello! I'm investigating an issue with our product...
# ask-community
i
Hello! I'm investigating an issue with our production dagster implementation. We're running dagster in k8s using the official helm chart (version 1.0.12 currently). Everything works fine, but after a while (I couldn't pin down any specific "triggers") the Sensor and Scheduler daemons crash with this error:
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
. In order to fix the issue we delete the dagster daemon pod and the new ones picks things back up. A few questions: has anyone seen this issue before? This looks like some sort of race condition in those daemons; is there any way to prevent it from happening? (sending some stack traces in thread)
Copy code
2023-02-01 20:02:12 +0000 - dagster.daemon.SensorDaemon - ERROR - Sensor daemon caught an error for sensor failure_dd : dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/sensor.py", line 481, in _process_tick_generator
    sensor_debug_crash_flags,
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/sensor.py", line 546, in _evaluate_sensor
    instigator_data.cursor if instigator_data else None,
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/repository_location.py", line 815, in get_external_sensor_execution_data
    cursor,
  File "/usr/local/lib/python3.7/site-packages/dagster/_api/snapshot_sensor.py", line 61, in sync_get_external_sensor_execution_data_grpc
    cursor=cursor,
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 333, in external_sensor_execution
    sensor_execution_args
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 124, in _streaming_query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e

The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1675281732.303470649","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1675281732.303469909","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 122, in _streaming_query
    yield from response_stream
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self

2023-02-01 20:02:12 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: start_dd
2023-02-01 20:02:26 +0000 - dagster.daemon.SchedulerDaemon - WARNING - Could not load location glue-demo to check for schedules due to the following error: dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 559, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 483, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 333, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/repository_location.py", line 568, in __init__
    list_repositories_response = sync_list_repositories_grpc(self.client)
  File "/usr/local/lib/python3.7/site-packages/dagster/_api/list_repositories.py", line 19, in sync_list_repositories_grpc
    api_client.list_repositories(),
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 169, in list_repositories
    res = self._query("ListRepositories", api_pb2.ListRepositoriesRequest)
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 115, in _query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e

The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1675281745.928212625","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1675281745.928211325","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 112, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)
as I mentioned, as soon as we replace the daemon pod, the sensors, and scheduler daemons are back without issues
j
Hi Ion, so
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
means that the gRPC connection from the Daemon Pod to the user deployment Pod failed. Do you see any restarts of the user deployments?
i
yes, we do frequent deploys, so some of the user deployments can get restarted when newer docker images are available. I could exactly pinpoint the timing of these errors with those deploys though. There was at least one data point where this happened around a deploy
if this is indeed the problem, how should we handle this issue more gracefully? We just use
helm upgrade
on the chart, nothing special
j
Once you’re in this state, do the schedule and sensors daemons just continually log
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server
?
i
no, they just seem to stop doing anything
and show in
Error
state in the Dagit UI
j
odd. How long have you waited with them in that state
During a user deployment upgrade, you might see one error if they sent a request at the wrong time, but they should recover on the next iteration
i
more than 16 hours last time. We noticed at some point that some jobs didn't run the day before through some data age monitoring and when I looked I noticed those two daemons being down
I see, yeah, that didn't happen. I know we're not on the latest version of dagster (currently on
1.0.12
). Do you know if maybe the resiliency behavior you described was implemented in a later version?
Also, by next iteration, do you mean the "next heartbeat"?
j
not to my knowledge but it’s always easier to debug more recent versions.
the daemons have internal loops that they iterate over, every few seconds
when I looked I noticed those two daemons being down
had they logged anything new?
i
yeah, that's not what I saw in the logs. It looked like after the errors they showed up in
Error
state and didn't log anything else
they're currently green, but I'll keep an eye on them and if they go down again, I'll try to gather more logs
j
That’d be great
It looked like after the errors they showed up in
Error
state
I’m not quite sure what this means?
i
Well now they show a status of
Running
in green
when they fail, there's an
Error
state, or maybe
Failed
or something like that in red
I didn't take a screenshot and don't remember exactly
it also had a link and when clicked it showed me the exact same error that I saw in the logs on the daemon pod
Ok, I just simulated a failure in a user deployment, and it looks like it got the error and then recovered
Shows as running and a
View errors (3)
link
it was different last time. It showed in an error state with a link to view the error
j
The latest experience is more what I’d expect. We keep old reported errors around for a day or so iirc. Need to add timestamps so that’s more clear.
LMK if you reproduce the previous behavior
i
as it goes, I can't reproduce it anymore 🙂 That makes me think if this starts happening after a while. I'll let the pod run and when it happens again, I'll try to debug it more deeply
ack 1
Ok, so this happened again yesterday
It looks like it failed with the same errors, just in a different order, or maybe at the same time (between the SensorDaemon and the SchedulerDaemon) and then it stopped processing ticks
logs.txt.py
After this happened, there's no more logs on the pod (as if the process died)
The UI shows
I attached to the dagster process using
gdb
and it looks like it's stuck on an accept syscall
Copy code
(gdb) info program
	Using the running image of attached LWP 1.
Program stopped at 0x39172174.
Type "info stack" or "info registers" for more information.
(gdb) info stack
#0  0x00007f0c39172174 in __libc_accept (fd=134297424, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
#1  0x00007f0c08013750 in ?? ()
#2  0x0000000000000000 in ?? ()
(gdb) info threads
  Id   Target Id                  Frame
* 1    LWP 1 "dagster-daemon"     0x00007f0c39172174 in __libc_accept (fd=134297424, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
  2    LWP 9 "dagster-daemon"     0x00007f0c39172388 in __libc_recvfrom (fd=671094368, buf=0x189, len=0, flags=799579744, addr=..., addrlen=0xffffffff) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
  3    LWP 138 "dagster-daemon"   0x00007f0c39172174 in __libc_accept (fd=44549584, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
  4    LWP 139 "dagster-daemon"   0x00007f0c392778b3 in hol_append (more=0x7f0c39640c00, hol=0x7f0c396135d0 <_Py_CheckRecursionLimit>) at argp-help.c:866
  5    LWP 140 "dagster-daemon"   0x00007f0c39172174 in __libc_accept (fd=44549584, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
  6    LWP 21236 "dagster-daemon" 0x00007f0c39172174 in __libc_accept (fd=44549584, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
  7    LWP 21237 "dagster-daemon" 0x00007f0c39172174 in __libc_accept (fd=44549584, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
  8    LWP 21265 "dagster-daemon" 0x00007f0c39172174 in __libc_accept (fd=44549584, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
j
Thanks for the detailed report! This is still dagster 1.0.12? What version of grpcio?
i
still dagster 1.0.12
let me check on grpcio
j
We recently pinned grpcio due to seeing hangs https://github.com/dagster-io/dagster/pull/11784
which python version?
i
we're on python 3.8.6
I'm actually not sure how to check for the grpcio version of dagster. In our user deployments, I checked a poetry.lock file and it's set at
grpcio = ">=1.32.0,<1.48.1"
I checked the python site-packages on the daemon pod and it looks like it's
grpcio-1.47.2
j
You’re using the default image for the daemon?
i
yep, we're using the dagster provided helm chart
👍 1
j
py-spy dump
in the deadlocked container may be helpful as well if you’re able to get that
i
sorry, I had to kick off that pod to restore the daemon because it was down for 18 hours now
j
Ok. Is there anything special about your networking between nodes that could be causing the grpc failures
i
I'm not sure 🙂
j
Or do you see any restarts in the grpc pods or any error logs
Where is your k8s cluster running out of curiousity? (EKS/self hosted/etc.)
i
it's in Amazon EKS
j
I’m at a bit of a loss. Py-spy could be very helpful the next time it hangs. I wonder if you could remove labels from the hung pod so the deployment would replace it without you having to delete it
Never a bad idea to upgrade, though I haven’t picked up on any fixies that would be relevant
The steps to repro would be: • EKS cluster • Install dagster helm chart 1.0.12 • Run some schedules and do helm upgrades for user deployments • wait for a hang? Any details I’m missing?
i
yeah, that's basically what I noticed
looks like the hangs are preceded by these errors
though it doesn't hangs always when these errors are present
it seems to happen after some time the daemon pod is running
like maybe several days to more than a week
we added some monitoring for when that happens so we can kill the pod and recreate
I can try and gather more information with py-spy next time
176 Views