Solaris Wang
07/18/2022, 9:33 PMDagsterExecutionInterruptedError
on an irregular pattern:
<job xyz> started a new run worker while the run was already in state DagsterRunStatus.STARTED. This most frequently happens when the run worker unexpectedly stops and is restarted by the cluster. Marking the run as failed.
It’s been difficult to troubleshoot since a) rerunning the same job is successful and b) the job/pods look normal and don’t have log output beyond the above.
The only anomaly we could find is (attached) in the job, 2 pods are created, which affirms that the 1st pod terminated (with no info when we try to kubectl query for it) and the second one might have been created by the daemon during a heartbeat check on the pod
Further info - helm run_monitor and run queue coordinator is turned off. multiprocess executor is used per default. the job is already provisioned and running when erroring (Started execution of run for "job xyz"
), furthermore the pod resources are static in that the limits are the same as initial requests, so not likely a resource contention issue. we do however trigger the same job via graphql such that multiple copies are running in parallel, but the graphql request should be creating different jobs and different pods
@daniel @Paul Swithers @Paz Tursondaniel
07/20/2022, 10:30 PMSolaris Wang
07/25/2022, 9:23 PMdaniel
07/25/2022, 9:24 PMSolaris Wang
07/25/2022, 9:26 PM