Hello we have a sensor that yields RunRequests for a job on dagster #ask-community

Hello, we have a sensor that yields RunRequests fo...

Justin Taylor

06/21/2023, 3:13 PM

Hello, we have a sensor that yields RunRequests for a job on every evaluation. The sensor is currently configured to evaluate every hour and it uses run keys to track runs. When I test the sensor in a fresh

dagster dev

deployment with the "Test Sensor" feature in dagit, I see about 100 RunRequests. When I turn on the sensor completely, it will submit around 25 RunRequests every time the sensor ticks, but I would expect it to launch 100 RunRequests in the first tick. Can someone help me understand the mechanics of this behavior? In my search, I've learned that a sensor evaluation cannot last longer than 60 seconds. Could that be part of the issue? Or is there some other limit on the number of runs that can be submitted in a sensor evaluation?

alex

06/22/2023, 3:22 PM

are there past runs that have been done with the

run_key

? I don’t think the “test sensor” flow does the deduplication that the real path does for run keys

Justin Taylor

06/22/2023, 3:35 PM

@alex I don't believe so. We experimented with this more yesterday and found some interesting patterns. First, everyone was able to reproduce the behavior. Second, if we dial down the sensor evaluation interval to 30 seconds, all 100 RunRequests end up getting submitted after several sensor ticks, but each sensor tick will only submit a chunk of data at a time and the size of that chunk will vary slightly from tick to tick. We also noticed that when runs were submitted on a tick of the sensor, the status of the sensor showed up as "Failure", even though the chunk of runs were able to execute as expected. The specific error for the sensor was this:

dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE

After all the RunRequests have been submitted, the sensor status shows up as "Skipped" in subsequent ticks, which makes sense. It isn't clear to me why we would be getting the gRPC error though.

alex

06/22/2023, 3:39 PM

have you reproduced this in a deployed environment, or just locally with dagster dev (on multiple machines) ?

Justin Taylor

06/22/2023, 3:46 PM

Just locally with dagster dev on multiple machines. We are in the process of setting up a production deployment.

alex

06/22/2023, 3:52 PM

hmm, i think this situation may be a function of default config for

dagster dev

not handling this large sensor throughput well, and failing out in a way that should be improved by default for local dev, submitted runs are directly launched as subprocesses. This can be changed using instance config to async submit to queue model https://docs.dagster.io/deployment/dagster-instance#queuedruncoordinator by default for local dev, the sensor evaluation will happen in serial, there are threading options to enable to increase throughput https://docs.dagster.io/deployment/dagster-instance#sensor-evaluation My current hypothesis is that the serial submissions that are doing blocking subprocess creation is the problem. I would speculate that if you looked closely at the logs, you would see the daemon process restarting due to a heartbeat timeout failure

alex

06/22/2023, 3:53 PM

default “real” deployments have some of these options turned on by default - so would expect to not reproduce the issue there if you change your local instance settings to turn these things on, it may fix the issue

Justin Taylor

06/22/2023, 3:58 PM

@alex Thank you so much for taking the time! This is very helpful. I will try tweaking my local config to see if anything changes.

👍 1

alex

06/22/2023, 3:58 PM

let me know if you are seeing anything in the daemon log output upon closer inspection to confirm this line of thinking

Justin Taylor

06/22/2023, 4:44 PM

I haven't found anything that mentions a "heartbeat timeout failure" in those exact words, but it does say "Sensor daemon caught an error for sensor my_sensor". In that root error, there is a call to a function that appears to be involved in blocking. This is the exception that is the root cause of the exception I pasted earlier.

Copy code

grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.UNAVAILABLE
	details = "failed to connect to all addresses; last error: UNKNOWN: unix:/var/folders/zk/njdmph812pgg7_tg9w36gd9r0000gp/T/tmprz3kwuv4: No such file or directory"
	debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: unix:/var/folders/zk/njdmph812pgg7_tg9w36gd9r0000gp/T/tmprz3kwuv4: No such file or directory {created_time:"2023-06-22T12:33:12.530887-04:00", grpc_status:14}"
---

File "/Users/myuser/venv/lib/python3.11/site-packages/dagster/_grpc/client.py", line 155, in _query
    return self._get_response(method, request=request_type(**kwargs), timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/myuser/venv/lib/python3.11/site-packages/dagster/_grpc/client.py", line 130, in _get_response
    return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/myuser/venv/lib/python3.11/site-packages/grpc/_channel.py", line 1030, in __call__
    return _end_unary_response_blocking(state, call, False, None)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/myuser/venv/lib/python3.11/site-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

alex

06/22/2023, 4:53 PM

alright - could be another

dagster dev

specific feature where the code servers it manages get auto updated to try and stay in sync with local code and that slow serial loop is taking longer than the speed we refresh it and the reference used in that loop isn’t staying up to date

Justin Taylor

06/22/2023, 4:56 PM

Gotcha, thanks again for digging in. It sounds like this shouldn't be an issue for us when we get rolling in a real deployment.

23 Views

Open in Slack

Previous Next