Hello wave I m having an issue with a slow running sensor th dagster #ask-community

Hello :wave: , I'm having an issue with a slow-run...

Félix Tremblay

05/10/2023, 10:04 PM

Hello 👋 , I'm having an issue with a slow-running sensor that times out after 60 seconds. It iterates over ~70 tables to check if a run is needed and yields a RunRequest for the required tables. I'd like to avoid creating 70 sensors to prevent clutter in Dagit and to simplify monitoring. I attempted using a cursor to filter tables, but it still results in the same DEADLINE_EXCEEDED error. I'd appreciate any suggestions on how to resolve this issue. Thanks in advance!

🤖 1

Chris Comeau

05/10/2023, 10:10 PM

I'm looking at a similar situation and I think "observable source assets' might be a better alternative to sensors here: https://docs.dagster.io/concepts/assets/asset-observations#observable-source-assets So there would be new source assets that represent the upstream tables, those are "observed" on whatever schedule you want, then the downstream tables could use auto-materialize. "DataVersion" here could be a timestamp, hash of timestamp+row count, or whatever you're checking in the cursor.

❤️ 1

daniel

05/11/2023, 2:34 PM

Hi Felix - do you have a sense from running it locally how long its taking after adding the cursor? is it just over the 60 second limit, or taking much longer?

Félix Tremblay

05/11/2023, 4:56 PM

Hey @daniel, I initially thought using a cursor in a sensor would allow Dagster to resume after a timeout. However, it seems that when a sensor fails (even due to timeouts), the yielded RunRequests and cursor positions are ignored. I assume this is intended behavior? To resolve this, I defined a BATCH_SIZE constant and adjusted the sensor logic to evaluate a limited number of tables at a time, updating the cursor accordingly. Now it works, but the sensor interval no longer reflects the table refresh interval (e.g., if the batch_size represents half of the tables, I need to halve the sensor interval). For similar use cases, it would be helpful if: 1. Sensors could run for 15 minutes (or at least 5 minutes) instead of 1 minute. 2. Dagster could provide enhanced support for batched sensor evaluation by allowing users to customize the interval between batches vs the normal interval (minimum_interval_seconds). Ideally, the interval between batches could be set to 1 minute or less, while the normal interval would apply once all batches are complete. This could be achieved by enabling users to modify the next interval during a sensor run, either by skipping ticks for the next hour (assuming the normal interval is set to 1 minute) or by forcing a new tick in 60 seconds (assuming the normal interval is set to 1 hour). This flexibility would better accommodate use cases like ours. I'd love to hear your thoughts on these ideas. In any case, thank you for your help and your work on Dagster!

daniel

05/11/2023, 4:57 PM

yeah we're thinking about ways to improve this in the next few months to have fewer restrictions here

👍 1

daniel

05/11/2023, 4:58 PM

I also think the observable source asset idea that Chris suggested is a good idea if that's an option for your use case

👍 1

daniel

05/11/2023, 4:58 PM

And was also thinking of increasing the default to something like 5 minutes instead of 1 minute on a shorter timeframe (next few weeks)

big dag eyes 1

Félix Tremblay

05/11/2023, 4:59 PM

Hey @Chris Comeau, thanks for the suggestion! Looks like the best approach for asset-based orchestration. However, for ops/jobs orchestration uses cases, like mine, I don't think it would work

Philippe Laflamme

05/29/2023, 7:55 PM

Is there a GH issue that we can subscribe to for this problem? I'm having a similar issue and my usual workaround using

DAGSTER_GRPC_TIMEOUT_SECONDS

doesn't seem to help here.

👀 1

Philippe Laflamme

05/29/2023, 7:57 PM

I've also tried the new

num_submit_workers

setting in 1.3.6, but it resulted in this error:

daniel

05/30/2023, 1:11 AM

I would expect setting DAGSTER_GRPC_TIMEOUT_SECONDS to affect the 60 second timeout if you set it on the daemon. If you can paste the text of the error that you're seeing after you've triple checked that you've set that env var, we could take a look. We're working on a better solution than setting that env var though.

daniel

05/30/2023, 1:12 AM

if you can include the text and full stack trace of the error rather than a screenshot, those are easier for us to work with

daniel

05/30/2023, 1:47 AM

https://github.com/dagster-io/dagster/pull/14516 is a fix for the constraint issue you ran into - thanks for reporting that, but i don't think it has any bearing on the timeouts, and the num_submit_workers setting won't affect whether or not this particular grpc call times out

Philippe Laflamme

05/30/2023, 1:01 PM

Yeah, sorry about the screenshot, I'll make sure to extract the stacktrace text next time. As for the

DAGSTER_GRPC_TIMEOUT_SECONDS

I've checked my k8s deployment and the environment variable is definitely set:

Copy code

env:
  - name: DAGSTER_HOME
    value: "/dagster-home"
  - name: POSTGRES_HOSTNAME
    value: "postgres"
  - name: DAGSTER_GRPC_TIMEOUT_SECONDS
    value: "300"

And this is what I saw in the logs when the sensor started:

Copy code

2023-05-29T14:34:20.376278480Z INFO:dagster.daemon.SensorDaemon:Checking for new runs for sensor: rtpd_publications_sensor

and when it failed:

Copy code

2023-05-29T14:37:05.672024247Z ERROR:dagster.daemon.SensorDaemon:Sensor daemon caught an error for sensor rtpd_publications_sensor

Full trace:

Copy code

Traceback (most recent call last):
  File "/app/.venv/lib/python3.10/site-packages/dagster/_daemon/sensor.py", line 520, in _process_tick_generator
    yield from _evaluate_sensor(
  File "/app/.venv/lib/python3.10/site-packages/dagster/_daemon/sensor.py", line 583, in _evaluate_sensor
    sensor_runtime_data = code_location.get_external_sensor_execution_data(
  File "/app/.venv/lib/python3.10/site-packages/dagster/_core/host_representation/code_location.py", line 845, in get_external_sensor_execution_data
    return sync_get_external_sensor_execution_data_grpc(
  File "/app/.venv/lib/python3.10/site-packages/dagster/_api/snapshot_sensor.py", line 63, in sync_get_external_sensor_execution_data_grpc
    api_client.external_sensor_execution(
  File "/app/.venv/lib/python3.10/site-packages/dagster/_grpc/client.py", line 388, in external_sensor_execution
    chunks = list(
  File "/app/.venv/lib/python3.10/site-packages/dagster/_grpc/client.py", line 184, in _streaming_query
    self._raise_grpc_exception(
  File "/app/.venv/lib/python3.10/site-packages/dagster/_grpc/client.py", line 140, in _raise_grpc_exception
    raise DagsterUserCodeUnreachableError(
dagster._core.errors.DagsterUserCodeUnreachableError: Could not reach user code server. gRPC Error code: UNAVAILABLE

As you can see, the time between the sensor starting and it failing is more than 60s, but it's also not 300s. I tried bumping that to 600s and I got a similar result (though I don't have the exact timeout on hand at the moment). I'm happy to test this using "Test Sensor" if that should also respect the environment variable setting.

daniel

05/30/2023, 1:11 PM

That doesn’t look like a timeout to me - that indicates that it can’t connect to the server, maybe it’s crashing or OOMing in the middle of the request?

Philippe Laflamme

05/30/2023, 1:30 PM

Well, the sensor daemon and the code location are in the same process, so I'm not sure what could have crashed without both going down

Philippe Laflamme

05/30/2023, 1:37 PM

But I agree that this doesn’t look like a timeout now that you mention it… That should have been

DEADLINE_EXCEEDED

Perhaps some other resource is being exhausted here, but there’s nothing in the logs to go on

daniel

05/30/2023, 1:37 PM

We generally recommend running the code server in a separate pod when running at scale in production

Philippe Laflamme

05/30/2023, 1:39 PM

Sure, but I have a tiny setup and running things in the same container is more cost effective for me at this time.

daniel

05/30/2023, 1:41 PM

Fair enough - I think it would be a lot easier to understand why the server became unreachable if it were more isolated though. Is trying with that an option to see if that helps get to the bottom of what’s going on?

Philippe Laflamme

05/30/2023, 1:41 PM

Yeah, I can definitely do that for debugging purposes

Philippe Laflamme

05/30/2023, 1:42 PM

Will using “Test Sensor” allow me to reproduce this?

Philippe Laflamme

05/30/2023, 1:42 PM

I mean, should “Test Sensor” behave the same way as a normal sensor run?

daniel

05/30/2023, 1:42 PM

It’s worth a try - probably depends on the nature of the problem since it’s not exactly the same setup

Philippe Laflamme

05/30/2023, 1:43 PM

right, I believe the sensor would run from the dagit process in that case?

daniel

05/30/2023, 1:43 PM

I believe that’s right, yeah

👍 1

daniel

05/30/2023, 2:08 PM

I bet I know what is happening here - the daemon periodically reloads code servers if they aren’t being run separately, the tick is now long enough that that reload will happen in the middle of the tick and the old server shuts down

Philippe Laflamme

05/30/2023, 2:28 PM

ah, yeah that makes sense; what’s the frequency on the code server reload?

daniel

05/30/2023, 2:28 PM

I think its every 60 seconds but the servers stay around for a bit longer

daniel

05/30/2023, 2:28 PM

running the code server separately is probably the easiest workaround with what's in master currently

Philippe Laflamme

05/30/2023, 2:40 PM

Ok thanks and in terms of tracking a fix for this, should I create a GH issue?

daniel

05/30/2023, 2:40 PM

that would be great, yeah

👍 1

Philippe Laflamme

05/30/2023, 3:04 PM

https://github.com/dagster-io/dagster/issues/14523

🙏 1

5 Views

Open in Slack

Previous Next