Hey All, our dagster deployment on the 6th at 9PM ...
# ask-community
s
Hey All, our dagster deployment on the 6th at 9PM started giving us the following error:
Copy code
dagster.daemon - ←[34mERROR←[0m - ←[31mThread for SENSOR did not shut down gracefully
The weird thing is we have seen runs being kicked off by sensors after said date. When tailing the logs, there isn't a specific error or anything just that things didn't shut down gracefully. Any tips on troubleshooting this issue?
d
Hey Scott - this message would come later in response to some previous event telling the daemon to shut down (for example, the cluster/box it is running on sending an interrupt signal). Does that give any clues about where the interrupt might have come from?
s
Ok so daemon would be told to shut down by x in which case the error that is causing the shut down might exist in one of the other processes?
d
Yeah, but x wouldn't be a dagster process - could be the cluster running your daemon or the operating system, hard to give more concrete tips without knowing more about your deployment setup
s
Deployment is done via the dagster helm chart + k8s.
d
Got it - describing the daemon pod to see why the cluster decided to interrupt it might give some clues
Maybe it decided to scale up or down and move it to a new node - the daemon should be able to recover and start back up where it left off when this happens
s
Noticing that one of the sensors basically attempts to start like every 7 minutes.... However same isn't happening for any other sensor:
Some of the other sensors have this as well but its not nearly as aggressive
d
Could the daemon be hitting a memory limit or something every 7 minutes that causes it to shut down?
I don't 100% follow what exactly the unexpected thing is in that screenshot - what are you expecting to see there instead?
s
Typically for sensors I always noticed they go from Started to Requested or Skipped
but all of these Started and did nothing else
d
I see - so it may be shutting down during the execution of that particular sensor
If that happened my hope is that it would be reflected in the daemon logs in some way
s
ya at least in the daemon itself the only error I see is:
d
Do you see logs earlier that show it trying to execute that particular sensor that looks like it is misbehaving?
s
Not seeing anything in the deamon that really shows an error for any specific sensor...
d
It should log when it starts and finishes each sensor tick with timestamps - that can help to get a picture of what's going on - eg it sounds from your description that it might be starting a tick but not finishing it
s
Would it be something that continuously repeats or something that started once back when the original errors began and just hung?
d
It should log on every tick when everything is running smoothly, so the former
s
@daniel so we didn't see any logs... however turning the sensors off, and then turning them on, now the sensor daemon is fine....
Ya, very odd, turning everything off and then back on again caused everything to become healthy and run as expected..... No idea why.....
d
Hmmm, would have to see logs from the bad times and the good times to fully evaluate this I think... in the short term, glad things are working as expected again. What version of dagster was this?
s
0.15.3