The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Hello dagster, I saw an issue which I don’t even know if it’s replicable. But still hope to see if anyone can share any insights on this:
We ran everything on k8s. We have a job which spawn N dynamic ops. There is one dynamic op where I saw two lines of `Step worker started for xxx`. Soon I saw `Step run_taf_task[20220513_3_8_9_fpool202210_4] failed health check: Discovered failed Kubernetes job dagster-step-2bcdb518810ce5d440cb9124dd47c745 for step run_taf_task[20220513_3_8_9_fpool202210_4].` and caused the job to fail, while the job ran well and finished as I can see proper user code log line after these errors, and see a succeeded k8s job. My guess for the root cause is it “somehow” spawn two pods for one job, and the extra pod died soon after starting which caused the job to fail. Does anyone know whether that makes sense, if so what could be the reason?

also confirmed that two pods were spawn up at the same time under with that job prefix. One of them died soon while the other ran for ~1h