https://dagster.io/ logo
#ask-community
Title
# ask-community
s

Scott Hood

08/08/2022, 2:01 PM
Hey All, our dagster deployment on the 6th at 9PM started giving us the following error:
Copy code
dagster.daemon - ←[34mERROR←[0m - ←[31mThread for SENSOR did not shut down gracefully
The weird thing is we have seen runs being kicked off by sensors after said date. When tailing the logs, there isn't a specific error or anything just that things didn't shut down gracefully. Any tips on troubleshooting this issue?
d

daniel

08/08/2022, 2:04 PM
Hey Scott - this message would come later in response to some previous event telling the daemon to shut down (for example, the cluster/box it is running on sending an interrupt signal). Does that give any clues about where the interrupt might have come from?
s

Scott Hood

08/08/2022, 2:17 PM
Ok so daemon would be told to shut down by x in which case the error that is causing the shut down might exist in one of the other processes?
d

daniel

08/08/2022, 2:23 PM
Yeah, but x wouldn't be a dagster process - could be the cluster running your daemon or the operating system, hard to give more concrete tips without knowing more about your deployment setup
s

Scott Hood

08/08/2022, 2:26 PM
Deployment is done via the dagster helm chart + k8s.
d

daniel

08/08/2022, 2:27 PM
Got it - describing the daemon pod to see why the cluster decided to interrupt it might give some clues
Maybe it decided to scale up or down and move it to a new node - the daemon should be able to recover and start back up where it left off when this happens
s

Scott Hood

08/08/2022, 2:39 PM
Noticing that one of the sensors basically attempts to start like every 7 minutes.... However same isn't happening for any other sensor:
Some of the other sensors have this as well but its not nearly as aggressive
d

daniel

08/08/2022, 2:44 PM
Could the daemon be hitting a memory limit or something every 7 minutes that causes it to shut down?
I don't 100% follow what exactly the unexpected thing is in that screenshot - what are you expecting to see there instead?
s

Scott Hood

08/08/2022, 2:45 PM
Typically for sensors I always noticed they go from Started to Requested or Skipped
but all of these Started and did nothing else
d

daniel

08/08/2022, 2:45 PM
I see - so it may be shutting down during the execution of that particular sensor
If that happened my hope is that it would be reflected in the daemon logs in some way
s

Scott Hood

08/08/2022, 2:48 PM
ya at least in the daemon itself the only error I see is:
d

daniel

08/08/2022, 2:49 PM
Do you see logs earlier that show it trying to execute that particular sensor that looks like it is misbehaving?
s

Scott Hood

08/08/2022, 3:11 PM
Not seeing anything in the deamon that really shows an error for any specific sensor...
d

daniel

08/08/2022, 3:12 PM
It should log when it starts and finishes each sensor tick with timestamps - that can help to get a picture of what's going on - eg it sounds from your description that it might be starting a tick but not finishing it
s

Scott Hood

08/08/2022, 3:14 PM
Would it be something that continuously repeats or something that started once back when the original errors began and just hung?
d

daniel

08/08/2022, 3:14 PM
It should log on every tick when everything is running smoothly, so the former
s

Scott Hood

08/08/2022, 4:15 PM
@daniel so we didn't see any logs... however turning the sensors off, and then turning them on, now the sensor daemon is fine....
Ya, very odd, turning everything off and then back on again caused everything to become healthy and run as expected..... No idea why.....
d

daniel

08/08/2022, 4:45 PM
Hmmm, would have to see logs from the bad times and the good times to fully evaluate this I think... in the short term, glad things are working as expected again. What version of dagster was this?
s

Scott Hood

08/08/2022, 4:45 PM
0.15.3
2 Views