Hello again I m running dagster with a databricks step launc dagster #ask-community

Hello again! I'm running dagster with a databricks...

Bernardo Caldas

03/31/2022, 12:14 PM

Hello again! I'm running dagster with a databricks step launcher. The jobs are launched and ran in the daemons. Another issue I'm having is that sometimes my daemon dies but dagit seems to think the job is still running. How can I make it detect thar the job has terminated and that it has to run it again?

johann

03/31/2022, 4:49 PM

Hi Bernardo, how do you have Dagster deployed? What version are you on?

johann

03/31/2022, 4:52 PM

What does it mean for your daemon to die? The box it’s running on goes down or the Dagster Daemon process crashes?

Bernardo Caldas

04/01/2022, 10:56 AM

Hi! Im on 0.14.6. It's running on k8s and sometimes it gets OOMkilled, so the whole box dies. Then jt is restarted, but dagit thinks the job is still running.

johann

04/01/2022, 4:41 PM

And what run launcher are you using? I’m asking because with the helm chart it will default to the K8sRunLauncher which spins up runs in new K8s Jobs, so an OOM in the daemon shouldn’t impact them

johann

04/01/2022, 4:42 PM

If using the DefaultRunLauncher, the runs will be in subprocesses of the daemon and I can imagine that causing an OOM kill

johann

04/01/2022, 4:42 PM

We do have some new features in this area- note that they’re mostly oriented around the K8sRunLauncher https://docs.dagster.io/deployment/run-monitoring

Bernardo Caldas

04/05/2022, 5:27 PM

It's the defaultrunlauncher

Bernardo Caldas

04/05/2022, 5:28 PM

I'm just worried that a daemon may die and it isn't detected

johann

04/05/2022, 5:36 PM

Gotcha, we don’t currently have run monitoring support on the DefaultRunLauncher. In a lot of cases the Daemon will receive an interrupt and it will fail the runs gracefully, but this will be an issue for an ungraceful shutdown

Bernardo Caldas

04/06/2022, 2:33 PM

Got it... Anything you suggest that would make this a bit safer?

johann

04/06/2022, 2:37 PM

Ultimately we just need to add monitoring for the DefaultRunLauncher. In the meantime it would be possible to craft run timeouts etc. inside a scheduled Job or something like that

2 Views

Open in Slack

Previous Next