Hi We ve been running Dagster for a few months now mostly to dagster #ask-community

Hi! We’ve been running Dagster for a few months no...

Huib Keemink

04/19/2022, 7:41 PM

Hi! We’ve been running Dagster for a few months now, mostly to kick of dbt. Last week we introduced a new job that needs to run every 15 minutes, and I’m seeing a lot of failures happening - for some reason the job can not start and is killed by the run monitoring after the timeout. There’s really nothing in the logs, and there should be enough capacity on the k8s cluster. What can I do to debug this? Any pointers?

johann

04/19/2022, 7:51 PM

kubectl describe job <dagster-run-…>

might show events on why the Job isn’t starting

Huib Keemink

04/21/2022, 6:54 AM

Sorry for taking so long to reply - had to increase the TTL of the jobs to be able to catch the output and wait for a job to fail to start

Huib Keemink

04/21/2022, 7:39 AM

I can’t really find anything different between a failed job and one that ran successfully, unfortunately

Huib Keemink

04/21/2022, 8:48 AM

Oh, I found one that is different - the job is simply never created, and neither is the pod.

Huib Keemink

04/21/2022, 8:52 AM

After a few minutes, when the job is timed out, kubernetes decides to create the job, which is then ignored by dagster

johann

04/21/2022, 2:38 PM

Got it. That might be improved by adding more resources to your cluster, otherwise the timeout is configurable here https://docs.dagster.io/deployment/run-monitoring#run-monitoring

Huib Keemink

04/21/2022, 2:39 PM

There should be more than enough resources available (running a nodepool with currently only dagster on it), so it’s probably a side-effect of running on AKS

Huib Keemink

04/21/2022, 2:40 PM

I think I’ll switch to another runner for this usecase, so I can have a container running 24/7, instead of spinning one up every 15 minutes

Huib Keemink

04/21/2022, 2:40 PM

(especially considering the job itself takes around 3mins to run, but scheduling takes anywhere from 1 to 10 minutes…)

johann

04/21/2022, 2:43 PM

Yeah we’re thinking about ways to let you alternate between the K8sRunLauncher (new K8s Job) and the DefaultRunLauncher (uses the standing gRPC server)

Huib Keemink

04/21/2022, 2:44 PM

Yeah that would be a great addition! For now wrapping the two launchers and switching based on something I pass from the job should work, so that would fix my issue for now 🙂

Huib Keemink

04/21/2022, 2:45 PM

Thanks for your time, much appreciated!

Open in Slack

Previous Next