Hello I m investigating an issue with our production dagster dagster #ask-community

Hello! I'm investigating an issue with our product...

Ion Scerbatiuc

02/10/2023, 6:05 PM

Hello! I'm investigating an issue with our production dagster implementation. We're running dagster in k8s using the official helm chart (version 1.0.12 currently). Everything works fine, but after a while (I couldn't pin down any specific "triggers") the Sensor and Scheduler daemons crash with this error:

dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server

. In order to fix the issue we delete the dagster daemon pod and the new ones picks things back up. A few questions: has anyone seen this issue before? This looks like some sort of race condition in those daemons; is there any way to prevent it from happening? (sending some stack traces in thread)

Ion Scerbatiuc

02/10/2023, 6:06 PM

Copy code

2023-02-01 20:02:12 +0000 - dagster.daemon.SensorDaemon - ERROR - Sensor daemon caught an error for sensor failure_dd : dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/sensor.py", line 481, in _process_tick_generator
    sensor_debug_crash_flags,
  File "/usr/local/lib/python3.7/site-packages/dagster/_daemon/sensor.py", line 546, in _evaluate_sensor
    instigator_data.cursor if instigator_data else None,
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/repository_location.py", line 815, in get_external_sensor_execution_data
    cursor,
  File "/usr/local/lib/python3.7/site-packages/dagster/_api/snapshot_sensor.py", line 61, in sync_get_external_sensor_execution_data_grpc
    cursor=cursor,
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 333, in external_sensor_execution
    sensor_execution_args
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 124, in _streaming_query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e

The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1675281732.303470649","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1675281732.303469909","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 122, in _streaming_query
    yield from response_stream
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 826, in _next
    raise self

2023-02-01 20:02:12 +0000 - dagster.daemon.SensorDaemon - INFO - Checking for new runs for sensor: start_dd
2023-02-01 20:02:26 +0000 - dagster.daemon.SchedulerDaemon - WARNING - Could not load location glue-demo to check for schedules due to the following error: dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 559, in _load_location
    location = self._create_location_from_origin(origin)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/workspace/context.py", line 483, in _create_location_from_origin
    return origin.create_location()
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/origin.py", line 333, in create_location
    return GrpcServerRepositoryLocation(self)
  File "/usr/local/lib/python3.7/site-packages/dagster/_core/host_representation/repository_location.py", line 568, in __init__
    list_repositories_response = sync_list_repositories_grpc(self.client)
  File "/usr/local/lib/python3.7/site-packages/dagster/_api/list_repositories.py", line 19, in sync_list_repositories_grpc
    api_client.list_repositories(),
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 169, in list_repositories
    res = self._query("ListRepositories", api_pb2.ListRepositoriesRequest)
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 115, in _query
    raise DagsterUserCodeUnreachableError("Could not reach user code server") from e

The above exception was caused by the following exception:
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses"
	debug_error_string = "{"created":"@1675281745.928212625","description":"Failed to pick subchannel","file":"src/core/ext/filters/client_channel/client_channel.cc","file_line":3260,"referenced_errors":[{"created":"@1675281745.928211325","description":"failed to connect to all addresses","file":"src/core/lib/transport/error_utils.cc","file_line":167,"grpc_status":14}]}"
>

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster/_grpc/client.py", line 112, in _query
    response = getattr(stub, method)(request_type(**kwargs), timeout=timeout)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 946, in __call__
    return _end_unary_response_blocking(state, call, False, None)
  File "/usr/local/lib/python3.7/site-packages/grpc/_channel.py", line 849, in _end_unary_response_blocking
    raise _InactiveRpcError(state)

Ion Scerbatiuc

02/10/2023, 6:07 PM

as I mentioned, as soon as we replace the daemon pod, the sensors, and scheduler daemons are back without issues

johann

02/10/2023, 7:35 PM

Hi Ion, so

dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server

means that the gRPC connection from the Daemon Pod to the user deployment Pod failed. Do you see any restarts of the user deployments?

Ion Scerbatiuc

02/10/2023, 7:39 PM

yes, we do frequent deploys, so some of the user deployments can get restarted when newer docker images are available. I could exactly pinpoint the timing of these errors with those deploys though. There was at least one data point where this happened around a deploy

Ion Scerbatiuc

02/10/2023, 7:40 PM

if this is indeed the problem, how should we handle this issue more gracefully? We just use

helm upgrade

on the chart, nothing special

johann

02/10/2023, 7:43 PM

Once you’re in this state, do the schedule and sensors daemons just continually log

dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server

Ion Scerbatiuc

02/10/2023, 7:44 PM

no, they just seem to stop doing anything

Ion Scerbatiuc

02/10/2023, 7:44 PM

and show in

Error

state in the Dagit UI

johann

02/10/2023, 7:44 PM

odd. How long have you waited with them in that state

johann

02/10/2023, 7:45 PM

During a user deployment upgrade, you might see one error if they sent a request at the wrong time, but they should recover on the next iteration

Ion Scerbatiuc

02/10/2023, 7:45 PM

more than 16 hours last time. We noticed at some point that some jobs didn't run the day before through some data age monitoring and when I looked I noticed those two daemons being down

Ion Scerbatiuc

02/10/2023, 7:47 PM

I see, yeah, that didn't happen. I know we're not on the latest version of dagster (currently on

1.0.12

). Do you know if maybe the resiliency behavior you described was implemented in a later version?

Ion Scerbatiuc

02/10/2023, 7:47 PM

Also, by next iteration, do you mean the "next heartbeat"?

johann

02/10/2023, 7:49 PM

not to my knowledge but it’s always easier to debug more recent versions.

johann

02/10/2023, 7:52 PM

the daemons have internal loops that they iterate over, every few seconds

johann

02/10/2023, 7:52 PM

when I looked I noticed those two daemons being down

had they logged anything new?

Ion Scerbatiuc

02/10/2023, 7:53 PM

yeah, that's not what I saw in the logs. It looked like after the errors they showed up in

Error

state and didn't log anything else

Ion Scerbatiuc

02/10/2023, 7:53 PM

they're currently green, but I'll keep an eye on them and if they go down again, I'll try to gather more logs

johann

02/10/2023, 7:53 PM

That’d be great

johann

02/10/2023, 7:54 PM

It looked like after the errors they showed up in
Error
state

I’m not quite sure what this means?

Ion Scerbatiuc

02/10/2023, 8:19 PM

Well now they show a status of

Running

in green

Ion Scerbatiuc

02/10/2023, 8:19 PM

when they fail, there's an

Error

state, or maybe

Failed

or something like that in red

Ion Scerbatiuc

02/10/2023, 8:20 PM

I didn't take a screenshot and don't remember exactly

Ion Scerbatiuc

02/10/2023, 8:20 PM

it also had a link and when clicked it showed me the exact same error that I saw in the logs on the daemon pod

Ion Scerbatiuc

02/10/2023, 8:23 PM

Ok, I just simulated a failure in a user deployment, and it looks like it got the error and then recovered

Ion Scerbatiuc

02/10/2023, 8:24 PM

Shows as running and a

View errors (3)

link

Ion Scerbatiuc

02/10/2023, 8:24 PM

it was different last time. It showed in an error state with a link to view the error

johann

02/10/2023, 9:02 PM

The latest experience is more what I’d expect. We keep old reported errors around for a day or so iirc. Need to add timestamps so that’s more clear.

johann

02/10/2023, 9:02 PM

LMK if you reproduce the previous behavior

Ion Scerbatiuc

02/10/2023, 11:33 PM

as it goes, I can't reproduce it anymore 🙂 That makes me think if this starts happening after a while. I'll let the pod run and when it happens again, I'll try to debug it more deeply

ack 1

Ion Scerbatiuc

02/17/2023, 4:05 PM

Ok, so this happened again yesterday

Ion Scerbatiuc

02/17/2023, 4:06 PM

It looks like it failed with the same errors, just in a different order, or maybe at the same time (between the SensorDaemon and the SchedulerDaemon) and then it stopped processing ticks

Ion Scerbatiuc

02/17/2023, 4:07 PM

logs.txt.py

Ion Scerbatiuc

02/17/2023, 4:07 PM

After this happened, there's no more logs on the pod (as if the process died)

Ion Scerbatiuc

02/17/2023, 4:08 PM

The UI shows

Ion Scerbatiuc

02/17/2023, 4:24 PM

I attached to the dagster process using

gdb

and it looks like it's stuck on an accept syscall

Ion Scerbatiuc

02/17/2023, 4:24 PM

Copy code

(gdb) info program
	Using the running image of attached LWP 1.
Program stopped at 0x39172174.
Type "info stack" or "info registers" for more information.
(gdb) info stack
#0  0x00007f0c39172174 in __libc_accept (fd=134297424, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
#1  0x00007f0c08013750 in ?? ()
#2  0x0000000000000000 in ?? ()
(gdb) info threads
  Id   Target Id                  Frame
* 1    LWP 1 "dagster-daemon"     0x00007f0c39172174 in __libc_accept (fd=134297424, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
  2    LWP 9 "dagster-daemon"     0x00007f0c39172388 in __libc_recvfrom (fd=671094368, buf=0x189, len=0, flags=799579744, addr=..., addrlen=0xffffffff) at ../sysdeps/unix/sysv/linux/recvfrom.c:27
  3    LWP 138 "dagster-daemon"   0x00007f0c39172174 in __libc_accept (fd=44549584, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
  4    LWP 139 "dagster-daemon"   0x00007f0c392778b3 in hol_append (more=0x7f0c39640c00, hol=0x7f0c396135d0 <_Py_CheckRecursionLimit>) at argp-help.c:866
  5    LWP 140 "dagster-daemon"   0x00007f0c39172174 in __libc_accept (fd=44549584, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
  6    LWP 21236 "dagster-daemon" 0x00007f0c39172174 in __libc_accept (fd=44549584, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
  7    LWP 21237 "dagster-daemon" 0x00007f0c39172174 in __libc_accept (fd=44549584, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26
  8    LWP 21265 "dagster-daemon" 0x00007f0c39172174 in __libc_accept (fd=44549584, addr=..., len=0x0) at ../sysdeps/unix/sysv/linux/accept.c:26

johann

02/17/2023, 4:31 PM

Thanks for the detailed report! This is still dagster 1.0.12? What version of grpcio?

Ion Scerbatiuc

02/17/2023, 4:31 PM

still dagster 1.0.12

Ion Scerbatiuc

02/17/2023, 4:31 PM

let me check on grpcio

johann

02/17/2023, 4:32 PM

We recently pinned grpcio due to seeing hangs https://github.com/dagster-io/dagster/pull/11784

johann

02/17/2023, 4:32 PM

which python version?

Ion Scerbatiuc

02/17/2023, 4:33 PM

we're on python 3.8.6

Ion Scerbatiuc

02/17/2023, 4:33 PM

I'm actually not sure how to check for the grpcio version of dagster. In our user deployments, I checked a poetry.lock file and it's set at

grpcio = ">=1.32.0,<1.48.1"

Ion Scerbatiuc

02/17/2023, 4:36 PM

I checked the python site-packages on the daemon pod and it looks like it's

grpcio-1.47.2

johann

02/17/2023, 4:39 PM

You’re using the default image for the daemon?

Ion Scerbatiuc

02/17/2023, 4:40 PM

yep, we're using the dagster provided helm chart

👍 1

johann

02/17/2023, 4:40 PM

py-spy dump

in the deadlocked container may be helpful as well if you’re able to get that

Ion Scerbatiuc

02/17/2023, 4:41 PM

sorry, I had to kick off that pod to restore the daemon because it was down for 18 hours now

johann

02/17/2023, 4:42 PM

Ok. Is there anything special about your networking between nodes that could be causing the grpc failures

Ion Scerbatiuc

02/17/2023, 4:43 PM

I'm not sure 🙂

johann

02/17/2023, 4:43 PM

Or do you see any restarts in the grpc pods or any error logs

johann

02/17/2023, 4:44 PM

Where is your k8s cluster running out of curiousity? (EKS/self hosted/etc.)

Ion Scerbatiuc

02/17/2023, 5:00 PM

it's in Amazon EKS

johann

02/21/2023, 3:47 PM

I’m at a bit of a loss. Py-spy could be very helpful the next time it hangs. I wonder if you could remove labels from the hung pod so the deployment would replace it without you having to delete it

johann

02/21/2023, 3:48 PM

Never a bad idea to upgrade, though I haven’t picked up on any fixies that would be relevant

johann

02/21/2023, 3:50 PM

The steps to repro would be: • EKS cluster • Install dagster helm chart 1.0.12 • Run some schedules and do helm upgrades for user deployments • wait for a hang? Any details I’m missing?

Ion Scerbatiuc

02/23/2023, 7:42 PM

yeah, that's basically what I noticed

Ion Scerbatiuc

02/23/2023, 7:42 PM

looks like the hangs are preceded by these errors

Ion Scerbatiuc

02/23/2023, 7:42 PM

though it doesn't hangs always when these errors are present

Ion Scerbatiuc

02/23/2023, 7:43 PM

it seems to happen after some time the daemon pod is running

Ion Scerbatiuc

02/23/2023, 7:43 PM

like maybe several days to more than a week

Ion Scerbatiuc

02/23/2023, 7:43 PM

we added some monitoring for when that happens so we can kill the pod and recreate

Ion Scerbatiuc

02/23/2023, 7:43 PM

I can try and gather more information with py-spy next time

176 Views

Open in Slack

Previous Next