The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Hi team, we have a dagster job stuck in “starting” phase. We see `[K8sRunLauncher] Kubernetes run worker job created` but then nothing else after a day. I couldn’t find the job pod created in our k8s cluster though. Wondering if you know what might have caused this and how to debug please?

image.png

Hi Hebo - if the pod isn't there, what about the k8s job? Does "kubectl get jobs | grep &lt;job name&gt;" show anything?

Thanks Daniel! Yes! the job exists with
Pods Statuses:  0 Running / 0 Succeeded / 1 Failed

If you describe the job are there any clues?

other than that `Pods Statuses: 0 Running / 0 Succeeded / 1 Failed` , it does’t have much info

let me also check with our compute team..

I'm having the same issue.

Recently upgraded to 0.14.19, running on EKS. This problem appeared only after the upgrade, but I'm not sure it's related.

dagster-run.kubectl-describe.20220607T100000Z.txt

Here's the output when describing the K8s job:

dagster-run.kubctl-logs.20220607T100700Z.txt

I launched a new instance of the same run from the Launchpad, and now I see the pod actually gets scheduled, but fails immediately on launch. I don't know what to make of that stack trace.

now I see this bad import was removed in <https://github.com/dagster-io/dagster/pull/7886|a yet-unreleased PR>:
<https://github.com/dagster-io/dagster/commit/80153e0ecbf58f6e92754921c113a4fc8de556fd#diff-d8e2556c59e95084414a5ec35342b36fc2de5de0c632e1c062b21ae92c13cc1a|https://github.com/dagster-io/dagster/commit/80153e0ecbf58f6e92754921c113a4fc8de556fd#diff-d8e2556c59e95084414a5ec3[…]de5de0c632e1c062b21ae92c13cc1a>

~I'll upgrade the dagster dependency in my user-code image and report back.~
I see that my image is actually running dagster 0.13.18, explaining why that happened.

Hey assaf - that looks like a mismatch between your dagster-postgres package and your dagster package, yeah. There should be a pin that keeps them in lockstep but it looks like that might not be getting respected