Harry Tang12/07/2022, 5:05 PM
. Soon I saw
Step worker started for xxx
and caused the job to fail, while the job ran well and finished as I can see proper user code log line after these errors, and see a succeeded k8s job. My guess for the root cause is it “somehow” spawn two pods for one job, and the extra pod died soon after starting which caused the job to fail. Does anyone know whether that makes sense, if so what could be the reason?
Step run_taf_task[20220513_3_8_9_fpool202210_4] failed health check: Discovered failed Kubernetes job dagster-step-2bcdb518810ce5d440cb9124dd47c745 for step run_taf_task[20220513_3_8_9_fpool202210_4].