Is this anything I m troubleshooting sensor timeouts and I m dagster #ask-community

Is this anything? I'm troubleshooting sensor timeo...

Charlie Bini

04/13/2023, 7:59 PM

Is this anything? I'm troubleshooting sensor timeouts, and I'm seeing there are a few records still "Starting". Is this a display bug or is a sensor process really hung?

Charlie Bini

04/13/2023, 7:59 PM

otherwise, all of my asset reconciliation sensors are timing out with this

Copy code

dagster._core.errors.DagsterUserCodeUnreachableError: dagster._core.errors.DagsterUserCodeUnreachableError: The sensor tick timed out due to taking longer than 60 seconds to execute the sensor function. One way to avoid this error is to break up the sensor work into chunks, using cursors to let subsequent sensor calls pick up where the previous call left off.

Stack Trace:
  File "/dagster-cloud/dagster_cloud/agent/dagster_cloud_agent.py", line 807, in _process_api_request
    api_result = self._handle_api_request(
  File "/dagster-cloud/dagster_cloud/agent/dagster_cloud_agent.py", line 665, in _handle_api_request
    serialized_sensor_data_or_error = client.external_sensor_execution(
  File "/dagster/dagster/_grpc/client.py", line 388, in external_sensor_execution
    chunks = list(
  File "/dagster/dagster/_grpc/client.py", line 184, in _streaming_query
    self._raise_grpc_exception(
  File "/dagster/dagster/_grpc/client.py", line 135, in _raise_grpc_exception
    raise DagsterUserCodeUnreachableError(

The above exception was caused by the following exception:
grpc._channel._MultiThreadedRendezvous: <_MultiThreadedRendezvous of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "{"created":"@1681415679.819259821","description":"Deadline Exceeded","file":"src/core/ext/filters/deadline/deadline_filter.cc","file_line":81,"grpc_status":4}"
>

Stack Trace:
  File "/dagster/dagster/_grpc/client.py", line 180, in _streaming_query
    yield from self._get_streaming_response(
  File "/dagster/dagster/_grpc/client.py", line 169, in _get_streaming_response
    yield from getattr(stub, method)(request, metadata=self._metadata, timeout=timeout)
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 426, in __next__
    return self._next()
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 826, in _next
    raise self


  File "/dagster/dagster/_daemon/sensor.py", line 512, in _process_tick_generator
    yield from _evaluate_sensor(
  File "/dagster/dagster/_daemon/sensor.py", line 575, in _evaluate_sensor
    sensor_runtime_data = code_location.get_external_sensor_execution_data(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 647, in get_external_sensor_execution_data
    result = self.api_call(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 382, in api_call
    return dagster_cloud_api_call(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 131, in dagster_cloud_api_call
    for result in gen_dagster_cloud_api_call(
  File "/dagster-cloud-backend/dagster_cloud_backend/user_code/workspace.py", line 280, in gen_dagster_cloud_api_call
    raise DagsterUserCodeUnreachableError(error_infos[0].to_string())

Charlie Bini

04/13/2023, 8:01 PM

as are my regular sensors, which have all worked fine in the past and haven't changed. they're already using cursors

daniel

04/14/2023, 2:40 AM

Hey Charlie - any chance you could share the code of one of the regular sensors that's failing?

daniel

04/14/2023, 2:41 AM

Did anything in particular change on your side between the time it was most recently working and the time things started timing out?

daniel

04/14/2023, 2:42 AM

If you can link to the sensor that says Started, we can take a look at that

daniel

04/14/2023, 2:43 AM

for the asset reconciliation sensor issues - one of the changes we're planning to make in 1.3 is to provide a way for these to be powered using a more built-in daemon process that doesn't use a sensor under the hood - that should give us some more flexibility with timeouts and the ability to tune performance here

Charlie Bini

04/14/2023, 3:04 PM

@daniel here's the code: https://github.com/TEAMSchools/teamster/blob/main/src/teamster/core/powerschool/sensors.py

daniel

04/14/2023, 3:07 PM

Do you have a way to invoke the sensor locally to verify how long it's taking? wonder if despite the cursor the number of assets its evaluating in each tick might have increased

daniel

04/14/2023, 3:08 PM

tagging in @claire because I see that that code is calling add_dynamic_partitions inside the sensor function - I believe we have plans in the works if they're not live already to let you include the partition_key as part of the response of the sensor instead (and then we add it for you), which would let you move that out of the 60 second window that's timing - I don't totally recall if that's live yet

Charlie Bini

04/14/2023, 3:09 PM

here's one of the sensors that has a hung "Started" tick: https://kipptaf.dagster.cloud/prod/locations/kippcamden/sensors/dbt_asset_reconciliatio[…]sor?success=false&failure=false&started=true&skipped=false

Charlie Bini

04/14/2023, 3:09 PM

I don't currently have a local test, but if it's possible I can try that out

Charlie Bini

04/14/2023, 3:11 PM

the only changes before it started acting up was switching from

run_request_for_partition

to a direct

RunRequest

but it still ran fine for about a day before problems started

daniel

04/14/2023, 3:12 PM

I wonder if applying a limit to asset_defs might help so that there's a hard limit on the number of things it processes on each tick

Charlie Bini

04/14/2023, 3:12 PM

actually I take that back, that one was running fine until a few days ago

Charlie Bini

04/14/2023, 3:12 PM

the asset_reconciliation_sensors have been failing for longer

Charlie Bini

04/14/2023, 3:15 PM

so I just reenabled the powerschool sensor and it's requesting runs as expected

daniel

04/14/2023, 3:15 PM

Did you consider having one sensor per asset def instead of one sensor to cover them all? that would give you 60 seconds to work with per sensor

daniel

04/14/2023, 3:15 PM

er per asset i mean

daniel

04/14/2023, 3:16 PM

but then you'd also have N ssh tunnels running at once which you might not want

Charlie Bini

04/14/2023, 3:16 PM

yeah, the ssh tunnel is the reason I need to group them

daniel

04/14/2023, 3:16 PM

how many asset_defs are there?

Charlie Bini

04/14/2023, 3:17 PM

there's about 45 on that sensor I think

daniel

04/14/2023, 3:17 PM

hm ok, I can definitely imagine that hitting the 60 second limit if there's potentially non-trivial work happening per asset

daniel

04/14/2023, 3:18 PM

what do you think about the 'cap total amount of work' direction? taking out that partition key write may help too if that's a common thing

Charlie Bini

04/14/2023, 3:19 PM

if there's a passive way I can run the tunnel on the code location pod, that would be optimal, then I could more easily break up the number of assets

Charlie Bini

04/14/2023, 3:19 PM

need more info on the partition key write, but not against it

Charlie Bini

04/14/2023, 3:23 PM

but I'll monitor this, it appears to be working normally again

4 Views

Open in Slack

Previous Next