The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Untitled

Got an error with k8s job creation. Looks like maybe the retry failed because the job actually did get created previously?

Judging from the exception, it looks like. weird though… if the previous error was a 500 i would expect the job not to be created at all. It would be interesting to know which exception triggered the retry.

If there's a way to pull that, I'm glad to.

This is after we added retries to the k8s_job_executor fairly recently right? Do you recall which exception was being hit before we added those retries? Seems like it might still be leaving the job created

I believe it was this one: <https://dagster.slack.com/archives/C014N0PK37E/p1677770137608739?thread_ts=1677770120.520539&amp;cid=C014N0PK37E>

I don't suppose k8s supports an idempotency key or anything like that, that's how i've seen this problem solved before (and how dagster solves it with our own api requests)

without something like that, retrying on 500s may have been a mistake (for this exact reason)

semi-relevant issue <https://github.com/kubernetes/kubernetes/issues/148>

maybe we should just not raise on AlreadyExists exceptions

A 500 error leaves no information about the state of the request, unfortunately. I still think retrying is not a bad idea. Ignoring an AlreadyExists might be a good alternative

i think <https://github.com/dagster-io/dagster/pull/14389> would address this

I see this in the release notes, hooray!
&gt;  [dagster-k8s] Fixed an issue where the `k8s_job_executor` would sometimes fail with a 409 Conflict error after retrying the creation of a Kubernetes pod for a step, due to the job having already been created during a previous attempt despite raising an error.
Would it require just an agent update to pick this up, or a user code update?