Got an error with k8s job creation. Looks like may...
# deployment-kubernetes
m
Got an error with k8s job creation. Looks like maybe the retry failed because the job actually did get created previously?
a
Judging from the exception, it looks like. weird though… if the previous error was a 500 i would expect the job not to be created at all. It would be interesting to know which exception triggered the retry.
m
If there's a way to pull that, I'm glad to.
d
This is after we added retries to the k8s_job_executor fairly recently right? Do you recall which exception was being hit before we added those retries? Seems like it might still be leaving the job created
d
what an unhelpful error
😄 1
I don't suppose k8s supports an idempotency key or anything like that, that's how i've seen this problem solved before (and how dagster solves it with our own api requests)
🤷🏻 1
without something like that, retrying on 500s may have been a mistake (for this exact reason)
maybe we should just not raise on AlreadyExists exceptions
a
A 500 error leaves no information about the state of the request, unfortunately. I still think retrying is not a bad idea. Ignoring an AlreadyExists might be a good alternative
d
m
I see this in the release notes, hooray!
[dagster-k8s] Fixed an issue where the
k8s_job_executor
would sometimes fail with a 409 Conflict error after retrying the creation of a Kubernetes pod for a step, due to the job having already been created during a previous attempt despite raising an error.
Would it require just an agent update to pick this up, or a user code update?
d
just a user code update
ty thankyou 1