Some of our dbt jobs within a GKE cluster are fail...
# ask-community
Some of our dbt jobs within a GKE cluster are failing with
on an irregular pattern:
Copy code
<job xyz> started a new run worker while the run was already in state DagsterRunStatus.STARTED. This most frequently happens when the run worker unexpectedly stops and is restarted by the cluster. Marking the run as failed.
It’s been difficult to troubleshoot since a) rerunning the same job is successful and b) the job/pods look normal and don’t have log output beyond the above. The only anomaly we could find is (attached) in the job, 2 pods are created, which affirms that the 1st pod terminated (with no info when we try to kubectl query for it) and the second one might have been created by the daemon during a heartbeat check on the pod Further info - helm run_monitor and run queue coordinator is turned off. multiprocess executor is used per default. the job is already provisioned and running when erroring (
Started execution of run for "job xyz"
), furthermore the pod resources are static in that the limits are the same as initial requests, so not likely a resource contention issue. we do however trigger the same job via graphql such that multiple copies are running in parallel, but the graphql request should be creating different jobs and different pods @daniel @Paul Swithers @Paz Turson
this channel is def busier these days but any ideas? could this be a gke thing only? waiting to decide whether we should contact their support.
d has some discussion of this - it'd likely more of a GKE thing than a dagster thing. I don't think dagster creates a 2nd copy of a pod on the same job anywhere. The issue I posted there has some discussion of checking the job controller to pull more logs about why the first pod was terminated
thank you box 1
but…@daniel why can’t dagster fail only if the job itself fails, and not if the pods are interrupted? since all the pod(s) running a particular job should have the same configurations, shouldn’t dagster be in sync with the job controller as it monitors continuity at the job level
That sounds like a good improvement in the future, but we don't currently have an integration like that. We do have run level retries which you could use to kick off a new job when the pod is interrupted and the run fails
ok, we do have dagster job-specific retries for now. thanks.