Justin Taylor
06/21/2023, 3:13 PMdagster dev
deployment with the "Test Sensor" feature in dagit, I see about 100 RunRequests. When I turn on the sensor completely, it will submit around 25 RunRequests every time the sensor ticks, but I would expect it to launch 100 RunRequests in the first tick.
Can someone help me understand the mechanics of this behavior? In my search, I've learned that a sensor evaluation cannot last longer than 60 seconds. Could that be part of the issue? Or is there some other limit on the number of runs that can be submitted in a sensor evaluation?alex
06/22/2023, 3:22 PMrun_key
? I don’t think the “test sensor” flow does the deduplication that the real path does for run keysJustin Taylor
06/22/2023, 3:35 PMdagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE
After all the RunRequests have been submitted, the sensor status shows up as "Skipped" in subsequent ticks, which makes sense. It isn't clear to me why we would be getting the gRPC error though.alex
06/22/2023, 3:39 PMJustin Taylor
06/22/2023, 3:46 PMalex
06/22/2023, 3:52 PMdagster dev
not handling this large sensor throughput well, and failing out in a way that should be improved
by default for local dev, submitted runs are directly launched as subprocesses. This can be changed using instance config to async submit to queue model
https://docs.dagster.io/deployment/dagster-instance#queuedruncoordinator
by default for local dev, the sensor evaluation will happen in serial, there are threading options to enable to increase throughput
https://docs.dagster.io/deployment/dagster-instance#sensor-evaluation
My current hypothesis is that the serial submissions that are doing blocking subprocess creation is the problem.
I would speculate that if you looked closely at the logs, you would see the daemon process restarting due to a heartbeat timeout failurealex
06/22/2023, 3:53 PMJustin Taylor
06/22/2023, 3:58 PMalex
06/22/2023, 3:58 PMJustin Taylor
06/22/2023, 4:44 PMgrpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.UNAVAILABLE
details = "failed to connect to all addresses; last error: UNKNOWN: unix:/var/folders/zk/njdmph812pgg7_tg9w36gd9r0000gp/T/tmprz3kwuv4: No such file or directory"
debug_error_string = "UNKNOWN:failed to connect to all addresses; last error: UNKNOWN: unix:/var/folders/zk/njdmph812pgg7_tg9w36gd9r0000gp/T/tmprz3kwuv4: No such file or directory {created_time:"2023-06-22T12:33:12.530887-04:00", grpc_status:14}"
---
File "/Users/myuser/venv/lib/python3.11/site-packages/dagster/_grpc/client.py", line 155, in _query
return self._get_response(method, request=request_type(**kwargs), timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/myuser/venv/lib/python3.11/site-packages/dagster/_grpc/client.py", line 130, in _get_response
return getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/myuser/venv/lib/python3.11/site-packages/grpc/_channel.py", line 1030, in __call__
return _end_unary_response_blocking(state, call, False, None)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/myuser/venv/lib/python3.11/site-packages/grpc/_channel.py", line 910, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
alex
06/22/2023, 4:53 PMdagster dev
specific feature where the code servers it manages get auto updated to try and stay in sync with local code and that slow serial loop is taking longer than the speed we refresh it and the reference used in that loop isn’t staying up to dateJustin Taylor
06/22/2023, 4:56 PM