Solaris Wang07/18/2022, 9:33 PM
on an irregular pattern:
It’s been difficult to troubleshoot since a) rerunning the same job is successful and b) the job/pods look normal and don’t have log output beyond the above. The only anomaly we could find is (attached) in the job, 2 pods are created, which affirms that the 1st pod terminated (with no info when we try to kubectl query for it) and the second one might have been created by the daemon during a heartbeat check on the pod Further info - helm run_monitor and run queue coordinator is turned off. multiprocess executor is used per default. the job is already provisioned and running when erroring (
<job xyz> started a new run worker while the run was already in state DagsterRunStatus.STARTED. This most frequently happens when the run worker unexpectedly stops and is restarted by the cluster. Marking the run as failed.
), furthermore the pod resources are static in that the limits are the same as initial requests, so not likely a resource contention issue. we do however trigger the same job via graphql such that multiple copies are running in parallel, but the graphql request should be creating different jobs and different pods @daniel @Paul Swithers @Paz Turson
Started execution of run for "job xyz"
daniel07/20/2022, 10:30 PM
Solaris Wang07/25/2022, 9:23 PM
daniel07/25/2022, 9:24 PM
Solaris Wang07/25/2022, 9:26 PM