https://dagster.io/ logo
#ask-community
Title
# ask-community
c

Charlie Bini

04/13/2023, 7:59 PM
Is this anything? I'm troubleshooting sensor timeouts, and I'm seeing there are a few records still "Starting". Is this a display bug or is a sensor process really hung?
otherwise, all of my asset reconciliation sensors are timing out with this
Copy code
dagster._core.errors.DagsterUserCodeUnreachableError: dagster._core.errors.DagsterUserCodeUnreachableError: The sensor tick timed out due to taking longer than 60 seconds to execute the sensor function. One way to avoid this error is to break up the sensor work into chunks, using cursors to let subsequent sensor calls pick up where the previous call left off.

Stack Trace:
  File "/dagster-cloud/dagster_cloud/agent/dagster_cloud_agent.py", line 807, in _process_api_request
    api_result = self._handle_api_request(
  File "/dagster-cloud/dagster_cloud/agent/dagster_cloud_agent.py", line 665, in _handle_api_request
    serialized_sensor_data_or_error = client.external_sensor_execution(
  File "/dagster/dagster/_grpc/client.py", line 388, in external_sensor_execution
    chunks = list(
  File "/dagster/dagster/_grpc/client.py", line 184, in _streaming_query
    self._raise_grpc_exception(
  File "/dagster/dagster/_grpc/client.py", line 135, in _raise_grpc_exception
    raise DagsterUserCodeUnreachableError(

The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1681415679.819259821","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>

Stack Trace:
  File "/dagster/dagster/_grpc/client.py", line 180, in _streaming_query
    yield from self._get_streaming_response(
  File "/dagster/dagster/_grpc/client.py", line 169, in _get_streaming_response
    yield from getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 826, in _next
    raise self


  File "/dagster/dagster/_daemon/sensor.py", line 512, in _process_tick_generator
    yield from _evaluate_sensor(
  File "/dagster/dagster/_daemon/sensor.py", line 575, in _evaluate_sensor
    sensor_runtime_data = code_location.get_external_sensor_execution_data(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 647, in get_external_sensor_execution_data
    result = self.api_call(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 382, in api_call
    return dagster_cloud_api_call(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 131, in dagster_cloud_api_call
    for result in gen_dagster_cloud_api_call(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 280, in gen_dagster_cloud_api_call
    raise DagsterUserCodeUnreachableError(error_infos[0].to_string())
as are my regular sensors, which have all worked fine in the past and haven't changed. they're already using cursors
d

daniel

04/14/2023, 2:40 AM
Hey Charlie - any chance you could share the code of one of the regular sensors that's failing?
Did anything in particular change on your side between the time it was most recently working and the time things started timing out?
If you can link to the sensor that says Started, we can take a look at that
for the asset reconciliation sensor issues - one of the changes we're planning to make in 1.3 is to provide a way for these to be powered using a more built-in daemon process that doesn't use a sensor under the hood - that should give us some more flexibility with timeouts and the ability to tune performance here
d

daniel

04/14/2023, 3:07 PM
Do you have a way to invoke the sensor locally to verify how long it's taking? wonder if despite the cursor the number of assets its evaluating in each tick might have increased
tagging in @claire because I see that that code is calling add_dynamic_partitions inside the sensor function - I believe we have plans in the works if they're not live already to let you include the partition_key as part of the response of the sensor instead (and then we add it for you), which would let you move that out of the 60 second window that's timing - I don't totally recall if that's live yet
I don't currently have a local test, but if it's possible I can try that out
the only changes before it started acting up was switching from
run_request_for_partition
to a direct
RunRequest
but it still ran fine for about a day before problems started
d

daniel

04/14/2023, 3:12 PM
I wonder if applying a limit to asset_defs might help so that there's a hard limit on the number of things it processes on each tick
c

Charlie Bini

04/14/2023, 3:12 PM
actually I take that back, that one was running fine until a few days ago
the asset_reconciliation_sensors have been failing for longer
so I just reenabled the powerschool sensor and it's requesting runs as expected
d

daniel

04/14/2023, 3:15 PM
Did you consider having one sensor per asset def instead of one sensor to cover them all? that would give you 60 seconds to work with per sensor
er per asset i mean
but then you'd also have N ssh tunnels running at once which you might not want
c

Charlie Bini

04/14/2023, 3:16 PM
yeah, the ssh tunnel is the reason I need to group them
d

daniel

04/14/2023, 3:16 PM
how many asset_defs are there?
c

Charlie Bini

04/14/2023, 3:17 PM
there's about 45 on that sensor I think
d

daniel

04/14/2023, 3:17 PM
hm ok, I can definitely imagine that hitting the 60 second limit if there's potentially non-trivial work happening per asset
what do you think about the 'cap total amount of work' direction? taking out that partition key write may help too if that's a common thing
c

Charlie Bini

04/14/2023, 3:19 PM
if there's a passive way I can run the tunnel on the code location pod, that would be optimal, then I could more easily break up the number of assets
need more info on the partition key write, but not against it
but I'll monitor this, it appears to be working normally again
4 Views