hello - recently we've started encountering freque...
# dagster-plus
l
hello - recently we've started encountering frequent sensor timeouts; i think it's because we have a number of sensors that all run around the same time - they are all multi-asset sensors that are monitoring exclusive subsets of assets. I've been able to work around a bit by simply toggling them to introduce a bit of stagger, but is there a way to write these sensors in a more performant way?
also giving the grpc server (the thing that evaluates sensors) more cpu/mem will help
l
hmm - we're on serverless, is that a config somewhere?
j
ah unfortunately not
have you tried using cursors? if you're still having problems i can file a request to increase the cpu/mem allocated for your serverless deployment
l
ok - can confirm that we're using cursors (we just wait until all the assets in the selection have new materialization events and call _`advance_all_cursors`_ - a bump up in resources would be great.
j
ok I will get that filled, it might not go out till next week thursday? is that ok?
l
that is fine! thank you!
hi @Joe - did the change to increase resources go out? We are still regularly encountering frequent sensor timeouts, all of which start around the same time. Redeploying the code location recovers, but it starts again a few hours later. cc @Henri Blancke
s
Hi Leo, due to a configuration issue it looks like the increase may have been recently reverted. I am fixing and the increase should be applied again in a few minutes.
l
hello @Shalabh Chaturvedi - is there any chance that the increase was reverted again? Starting around noon pacific time yesterday we have been encountering sensor timeouts - and re-deploying the code location does not seem to be helping
Starting around noon pacific time today - we are also encountering runs of dbt assets (via dbt-dagster, so very little user-provided code) that are failing with this error:
Copy code
dagster_cloud_cli.core.errors.GraphQLStorageError: Error in GraphQL response: [{'message': 'Internal Server Error (Trace ID: 278874557432852609)', 'locations': [{'line': 22, 'column': 13}], 'path': ['eventLogs', 'getEventRecords']}]
s
Hi Leo - I checked the configuration and it is still running with the increased resources. The max CPU usage is at 100% - do you have some heavy computation in the sensor that regularly pegs the cpu?
l
we have about 15 multi-asset sensors that run every 5 minutes, they do minimal computation of their own though - a few hit the slack API - but they are generally listening on an Asset Selection and they may trigger other assets or jobs
is there perhaps a way to increase the amount of jitter in these sensor runs? They need not all run at the same time
d
@Leo Qin this doesn't answer your question about the sensors, but we believe we have implemented a fix for the GraphQLStorageError that you reported - the same asset run that was triggering it before should now be able to complete (let us know if that's different from what you are seeing)
🌈 1
h
@daniel to add more context to the multi-asset sensor issue @Leo Qin mentioned: the only thing the sensors do is fetch the latest materialization event for a set of assets and evaluate if all assets have a new materialization event by comparing it to the cursor. Is it possible the sensors time out because it takes to long to retrieve the latest materialization event for each asset? Thanks for the help here 🙏
d
That's very possible - that exact query (getting the most recent materialization event for a given asset) running slower than expected was actually what was causing the other GraphQLStorageError on this thread - we are seeing it take up to 15 seconds for certain assets which was hitting a timeout. We're going to be working on a better indexing scheme for these queries that should help these sensors run faster as well.
yay 1
l
hello - is there any update on when these indexing changes are going to happen? Is it soon, or would we get mileage out of trying some workarounds on our end?
p
Hey Leo. I just flipped on the faster queries. It looks like a bunch of your sensors are still failing though. Are you seeing any differences on your end?
l
@prha - still seeing a few failures due to timeout, yeah. We have a group of 14 sensors that run every 5 minutes, and it seems like some, but not all of them can run fast enough to succeed within the 60 second timeout
they're all multi asset sensors that do basically the same thing
p
I’m digging into why they’re timing out… can you share a link to one of your multi asset sensors?
l
PM'd
h
hi dagster team, we're experiencing sensor timeouts again, would you be able to take a look? Thanks
s
Hi Henri - do you know if the sensor evaluation is slower for the failing sensor due to slower external queries? I'll look into increasing the resources for your code server, however this will only help if it is actually cpu or memory pegged. Otherwise the sensor may need to refactored to do less work. Another feature that we hope to land in the next month or so is the ability to increase sensor timeouts, which should help.
h
@Shalabh Chaturvedi most of sensors (~90%) don't make external queries and are multi asset sensors