Hi! We’ve been running Dagster for a few months no...
# ask-community
h
Hi! We’ve been running Dagster for a few months now, mostly to kick of dbt. Last week we introduced a new job that needs to run every 15 minutes, and I’m seeing a lot of failures happening - for some reason the job can not start and is killed by the run monitoring after the timeout. There’s really nothing in the logs, and there should be enough capacity on the k8s cluster. What can I do to debug this? Any pointers?
j
kubectl describe job <dagster-run-…>
might show events on why the Job isn’t starting
h
Sorry for taking so long to reply - had to increase the TTL of the jobs to be able to catch the output and wait for a job to fail to start
I can’t really find anything different between a failed job and one that ran successfully, unfortunately
Oh, I found one that is different - the job is simply never created, and neither is the pod.
After a few minutes, when the job is timed out, kubernetes decides to create the job, which is then ignored by dagster
j
Got it. That might be improved by adding more resources to your cluster, otherwise the timeout is configurable here https://docs.dagster.io/deployment/run-monitoring#run-monitoring
h
There should be more than enough resources available (running a nodepool with currently only dagster on it), so it’s probably a side-effect of running on AKS
I think I’ll switch to another runner for this usecase, so I can have a container running 24/7, instead of spinning one up every 15 minutes
(especially considering the job itself takes around 3mins to run, but scheduling takes anywhere from 1 to 10 minutes…)
j
Yeah we’re thinking about ways to let you alternate between the K8sRunLauncher (new K8s Job) and the DefaultRunLauncher (uses the standing gRPC server)
h
Yeah that would be a great addition! For now wrapping the two launchers and switching based on something I pass from the job should work, so that would fix my issue for now 🙂
Thanks for your time, much appreciated!