Hello again! I'm running dagster with a databricks...
# ask-community
b
Hello again! I'm running dagster with a databricks step launcher. The jobs are launched and ran in the daemons. Another issue I'm having is that sometimes my daemon dies but dagit seems to think the job is still running. How can I make it detect thar the job has terminated and that it has to run it again?
j
Hi Bernardo, how do you have Dagster deployed? What version are you on?
What does it mean for your daemon to die? The box it’s running on goes down or the Dagster Daemon process crashes?
b
Hi! Im on 0.14.6. It's running on k8s and sometimes it gets OOMkilled, so the whole box dies. Then jt is restarted, but dagit thinks the job is still running.
j
And what run launcher are you using? I’m asking because with the helm chart it will default to the K8sRunLauncher which spins up runs in new K8s Jobs, so an OOM in the daemon shouldn’t impact them
If using the DefaultRunLauncher, the runs will be in subprocesses of the daemon and I can imagine that causing an OOM kill
We do have some new features in this area- note that they’re mostly oriented around the K8sRunLauncher https://docs.dagster.io/deployment/run-monitoring
b
It's the defaultrunlauncher
I'm just worried that a daemon may die and it isn't detected
j
Gotcha, we don’t currently have run monitoring support on the DefaultRunLauncher. In a lot of cases the Daemon will receive an interrupt and it will fail the runs gracefully, but this will be an issue for an ungraceful shutdown
b
Got it... Anything you suggest that would make this a bit safer?
j
Ultimately we just need to add monitoring for the DefaultRunLauncher. In the meantime it would be possible to craft run timeouts etc. inside a scheduled Job or something like that