Our dagster cloud deployment was having some kind ...
# dagster-plus
Our dagster cloud deployment was having some kind of issue between around 5am and 8:40am Eastern time. New runs were not created, and auto-materialize daemon, schedules, and sensors were failing, but the agent seemed to have been returning a heartbeat. I triggered a BAU deployment that seems to have resolved the issue - but services being unavailable while the agent was successfully heartbeating is a new one.
Were the sensor ticks failing due to a timeout? It’s odd that a redeployment resolved the issue… Is it possible that there’s something in the sensor evaluations that could leak memory?
It’s also possible that the sensors were timing out due to some degraded query performance on our side, but that wouldn’t get resolved with a redployment.
hello - it happened again today starting around 12am eastern time. Same signature - agent heartbearing, but no services. Doing a re-deployment again seems to have helped the issue. Re memory leaks - most of our sensors are multiasset sensors that simply kick off other assets. We have a few sensors that list on databases or sftp servers; those are all context-managed and backed-off, but if there other signatures that might cause memory leaks, i can take a look. This is somewhat new behavior for us, although we've previously had problems with the agent dying on us.
Another possibly related symptom - I was trying to re-execute several skipped ops (from the run page) as part of a run that had an upstream failure... I got an error something along the lines of timeout
call, or something like that. I was able to get a run to start successfully by targeting fewer ops at the same time.
the full error:
dagster._core.errors.DagsterUserCodeUnreachableError: Timed out waiting for call to user code GET_SUBSET_EXTERNAL_PIPELINE_RESULT
(i have the identifier at the end separately if you need it) @prha - possible there's some kind of resource exhaustion going on with our serverless instance?
Hi Leo. I’ve bumped the memory limits for your code servers… I think that should help.
🙏 1
hi @prha - the service interruption w/ successful heartbeat thing happened again over the weekend - starting around 11pm pacific on 2023-08-26 - a redeployment resolved the issue, but it is difficult to maintain constant operation while this is happening. Is there any update about the root cause?