Hello :wave: , I'm having an issue with a slow-run...
# ask-community
f
Hello đź‘‹ , I'm having an issue with a slow-running sensor that times out after 60 seconds. It iterates over ~70 tables to check if a run is needed and yields a RunRequest for the required tables. I'd like to avoid creating 70 sensors to prevent clutter in Dagit and to simplify monitoring. I attempted using a cursor to filter tables, but it still results in the same DEADLINE_EXCEEDED error. I'd appreciate any suggestions on how to resolve this issue. Thanks in advance!
🤖 1
c
I'm looking at a similar situation and I think "observable source assets' might be a better alternative to sensors here: https://docs.dagster.io/concepts/assets/asset-observations#observable-source-assets So there would be new source assets that represent the upstream tables, those are "observed" on whatever schedule you want, then the downstream tables could use auto-materialize. "DataVersion" here could be a timestamp, hash of timestamp+row count, or whatever you're checking in the cursor.
❤️ 1
d
Hi Felix - do you have a sense from running it locally how long its taking after adding the cursor? is it just over the 60 second limit, or taking much longer?
f
Hey @daniel, I initially thought using a cursor in a sensor would allow Dagster to resume after a timeout. However, it seems that when a sensor fails (even due to timeouts), the yielded RunRequests and cursor positions are ignored. I assume this is intended behavior? To resolve this, I defined a BATCH_SIZE constant and adjusted the sensor logic to evaluate a limited number of tables at a time, updating the cursor accordingly. Now it works, but the sensor interval no longer reflects the table refresh interval (e.g., if the batch_size represents half of the tables, I need to halve the sensor interval). For similar use cases, it would be helpful if: 1. Sensors could run for 15 minutes (or at least 5 minutes) instead of 1 minute. 2. Dagster could provide enhanced support for batched sensor evaluation by allowing users to customize the interval between batches vs the normal interval (minimum_interval_seconds). Ideally, the interval between batches could be set to 1 minute or less, while the normal interval would apply once all batches are complete. This could be achieved by enabling users to modify the next interval during a sensor run, either by skipping ticks for the next hour (assuming the normal interval is set to 1 minute) or by forcing a new tick in 60 seconds (assuming the normal interval is set to 1 hour). This flexibility would better accommodate use cases like ours. I'd love to hear your thoughts on these ideas. In any case, thank you for your help and your work on Dagster!
d
yeah we're thinking about ways to improve this in the next few months to have fewer restrictions here
đź‘Ť 1
I also think the observable source asset idea that Chris suggested is a good idea if that's an option for your use case
đź‘Ť 1
And was also thinking of increasing the default to something like 5 minutes instead of 1 minute on a shorter timeframe (next few weeks)
big dag eyes 1
f
Hey @Chris Comeau, thanks for the suggestion! Looks like the best approach for asset-based orchestration. However, for ops/jobs orchestration uses cases, like mine, I don't think it would work
p
Is there a GH issue that we can subscribe to for this problem? I'm having a similar issue and my usual workaround using
DAGSTER_GRPC_TIMEOUT_SECONDS
doesn't seem to help here.
đź‘€ 1
I've also tried the new
num_submit_workers
setting in 1.3.6, but it resulted in this error:
d
I would expect setting DAGSTER_GRPC_TIMEOUT_SECONDS to affect the 60 second timeout if you set it on the daemon. If you can paste the text of the error that you're seeing after you've triple checked that you've set that env var, we could take a look. We're working on a better solution than setting that env var though.
if you can include the text and full stack trace of the error rather than a screenshot, those are easier for us to work with
https://github.com/dagster-io/dagster/pull/14516 is a fix for the constraint issue you ran into - thanks for reporting that, but i don't think it has any bearing on the timeouts, and the num_submit_workers setting won't affect whether or not this particular grpc call times out
p
Yeah, sorry about the screenshot, I'll make sure to extract the stacktrace text next time. As for the
DAGSTER_GRPC_TIMEOUT_SECONDS
I've checked my k8s deployment and the environment variable is definitely set:
Copy code
env:
  - name: DAGSTER_HOME
    value: "/dagster-home"
  - name: POSTGRES_HOSTNAME
    value: "postgres"
  - name: DAGSTER_GRPC_TIMEOUT_SECONDS
    value: "300"
And this is what I saw in the logs when the sensor started:
Copy code
2023-05-29T14:34:20.376278480Z INFO:dagster.daemon.SensorDaemon:Checking for new runs for sensor: rtpd_publications_sensor
and when it failed:
Copy code
2023-05-29T14:37:05.672024247Z ERROR:dagster.daemon.SensorDaemon:Sensor daemon caught an error for sensor rtpd_publications_sensor
Full trace:
Copy code
Traceback (most recent call last):
  File "/app/.venv/lib/python3.10/site-packages/dagster/_daemon/sensor.py", line 520, in _process_tick_generator
    yield from _evaluate_sensor(
  File "/app/.venv/lib/python3.10/site-packages/dagster/_daemon/sensor.py", line 583, in _evaluate_sensor
    sensor_runtime_data = code_location.get_external_sensor_execution_data(
  File "/app/.venv/lib/python3.10/site-packages/dagster/_core/host_representation/code_location.py", line 845, in get_external_sensor_execution_data
    return sync_get_external_sensor_execution_data_grpc(
  File "/app/.venv/lib/python3.10/site-packages/dagster/_api/snapshot_sensor.py", line 63, in sync_get_external_sensor_execution_data_grpc
    api_client.external_sensor_execution(
  File "/app/.venv/lib/python3.10/site-packages/dagster/_grpc/client.py", line 388, in external_sensor_execution
    chunks = list(
  File "/app/.venv/lib/python3.10/site-packages/dagster/_grpc/client.py", line 184, in _streaming_query
    self._raise_grpc_exception(
  File "/app/.venv/lib/python3.10/site-packages/dagster/_grpc/client.py", line 140, in _raise_grpc_exception
    raise DagsterUserCodeUnreachableError(
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE
As you can see, the time between the sensor starting and it failing is more than 60s, but it's also not 300s. I tried bumping that to 600s and I got a similar result (though I don't have the exact timeout on hand at the moment). I'm happy to test this using "Test Sensor" if that should also respect the environment variable setting.
d
That doesn’t look like a timeout to me - that indicates that it can’t connect to the server, maybe it’s crashing or OOMing in the middle of the request?
p
Well, the sensor daemon and the code location are in the same process, so I'm not sure what could have crashed without both going down
But I agree that this doesn’t look like a timeout now that you mention it… That should have been
DEADLINE_EXCEEDED
Perhaps some other resource is being exhausted here, but there’s nothing in the logs to go on
d
We generally recommend running the code server in a separate pod when running at scale in production
p
Sure, but I have a tiny setup and running things in the same container is more cost effective for me at this time.
d
Fair enough - I think it would be a lot easier to understand why the server became unreachable if it were more isolated though. Is trying with that an option to see if that helps get to the bottom of what’s going on?
p
Yeah, I can definitely do that for debugging purposes
Will using “Test Sensor” allow me to reproduce this?
I mean, should “Test Sensor” behave the same way as a normal sensor run?
d
It’s worth a try - probably depends on the nature of the problem since it’s not exactly the same setup
p
right, I believe the sensor would run from the dagit process in that case?
d
I believe that’s right, yeah
đź‘Ť 1
I bet I know what is happening here - the daemon periodically reloads code servers if they aren’t being run separately, the tick is now long enough that that reload will happen in the middle of the tick and the old server shuts down
p
ah, yeah that makes sense; what’s the frequency on the code server reload?
d
I think its every 60 seconds but the servers stay around for a bit longer
running the code server separately is probably the easiest workaround with what's in master currently
p
Ok thanks and in terms of tracking a fix for this, should I create a GH issue?
d
that would be great, yeah
đź‘Ť 1
p