hello recently we ve started encountering frequent sensor ti dagster #dagster-plus

hello - recently we've started encountering freque...

Leo Qin

07/21/2023, 6:44 PM

hello - recently we've started encountering frequent sensor timeouts; i think it's because we have a number of sensors that all run around the same time - they are all multi-asset sensors that are monitoring exclusive subsets of assets. I've been able to work around a bit by simply toggling them to introduce a bit of stagger, but is there a way to write these sensors in a more performant way?

Joe

07/21/2023, 9:41 PM

https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors#sensor-optimizations-using-cursors might help depending on how the sensor works

Joe

07/21/2023, 9:42 PM

also giving the grpc server (the thing that evaluates sensors) more cpu/mem will help

Leo Qin

07/21/2023, 9:42 PM

hmm - we're on serverless, is that a config somewhere?

Joe

07/21/2023, 9:42 PM

ah unfortunately not

Joe

07/21/2023, 9:43 PM

have you tried using cursors? if you're still having problems i can file a request to increase the cpu/mem allocated for your serverless deployment

Leo Qin

07/21/2023, 10:51 PM

ok - can confirm that we're using cursors (we just wait until all the assets in the selection have new materialization events and call _`advance_all_cursors`_ - a bump up in resources would be great.

Joe

07/22/2023, 3:54 AM

ok I will get that filled, it might not go out till next week thursday? is that ok?

Leo Qin

07/22/2023, 3:55 AM

that is fine! thank you!

Leo Qin

08/31/2023, 11:23 PM

hi @Joe - did the change to increase resources go out? We are still regularly encountering frequent sensor timeouts, all of which start around the same time. Redeploying the code location recovers, but it starts again a few hours later. cc @Henri Blancke

Shalabh Chaturvedi

09/01/2023, 12:20 AM

Hi Leo, due to a configuration issue it looks like the increase may have been recently reverted. I am fixing and the increase should be applied again in a few minutes.

Leo Qin

09/06/2023, 7:01 PM

hello @Shalabh Chaturvedi - is there any chance that the increase was reverted again? Starting around noon pacific time yesterday we have been encountering sensor timeouts - and re-deploying the code location does not seem to be helping

Leo Qin

09/06/2023, 7:06 PM

Starting around noon pacific time today - we are also encountering runs of dbt assets (via dbt-dagster, so very little user-provided code) that are failing with this error:

Copy code

dagster_cloud_cli.core.errors.GraphQLStorageError: Error in GraphQL response: [{'message': 'Internal Server Error (Trace ID: 278874557432852609)', 'locations': [{'line': 22, 'column': 13}], 'path': ['eventLogs', 'getEventRecords']}]

Shalabh Chaturvedi

09/06/2023, 7:11 PM

Hi Leo - I checked the configuration and it is still running with the increased resources. The max CPU usage is at 100% - do you have some heavy computation in the sensor that regularly pegs the cpu?

Leo Qin

09/06/2023, 7:13 PM

we have about 15 multi-asset sensors that run every 5 minutes, they do minimal computation of their own though - a few hit the slack API - but they are generally listening on an Asset Selection and they may trigger other assets or jobs

Leo Qin

09/06/2023, 7:19 PM

is there perhaps a way to increase the amount of jitter in these sensor runs? They need not all run at the same time

daniel

09/06/2023, 10:42 PM

@Leo Qin this doesn't answer your question about the sensors, but we believe we have implemented a fix for the GraphQLStorageError that you reported - the same asset run that was triggering it before should now be able to complete (let us know if that's different from what you are seeing)

🌈 1

Henri Blancke

09/07/2023, 1:26 PM

@daniel to add more context to the multi-asset sensor issue @Leo Qin mentioned: the only thing the sensors do is fetch the latest materialization event for a set of assets and evaluate if all assets have a new materialization event by comparing it to the cursor. Is it possible the sensors time out because it takes to long to retrieve the latest materialization event for each asset? Thanks for the help here 🙏

daniel

09/07/2023, 2:43 PM

That's very possible - that exact query (getting the most recent materialization event for a given asset) running slower than expected was actually what was causing the other GraphQLStorageError on this thread - we are seeing it take up to 15 seconds for certain assets which was hitting a timeout. We're going to be working on a better indexing scheme for these queries that should help these sensors run faster as well.

yay 1

Leo Qin

10/02/2023, 3:03 PM

hello - is there any update on when these indexing changes are going to happen? Is it soon, or would we get mileage out of trying some workarounds on our end?

prha

10/02/2023, 6:06 PM

Hey Leo. I just flipped on the faster queries. It looks like a bunch of your sensors are still failing though. Are you seeing any differences on your end?

Leo Qin

10/02/2023, 6:13 PM

@prha - still seeing a few failures due to timeout, yeah. We have a group of 14 sensors that run every 5 minutes, and it seems like some, but not all of them can run fast enough to succeed within the 60 second timeout

Leo Qin

10/02/2023, 6:15 PM

they're all multi asset sensors that do basically the same thing

prha

10/02/2023, 6:34 PM

I’m digging into why they’re timing out… can you share a link to one of your multi asset sensors?

Leo Qin

10/02/2023, 6:59 PM

PM'd

Henri Blancke

01/12/2024, 4:21 PM

hi dagster team, we're experiencing sensor timeouts again, would you be able to take a look? Thanks

Shalabh Chaturvedi

01/12/2024, 7:17 PM

Hi Henri - do you know if the sensor evaluation is slower for the failing sensor due to slower external queries? I'll look into increasing the resources for your code server, however this will only help if it is actually cpu or memory pegged. Otherwise the sensor may need to refactored to do less work. Another feature that we hope to land in the next month or so is the ability to increase sensor timeouts, which should help.

Henri Blancke

01/12/2024, 7:21 PM

@Shalabh Chaturvedi most of sensors (~90%) don't make external queries and are multi asset sensors

21 Views

Open in Slack

Previous Next