https://dagster.io/ logo
#deployment-kubernetes
Title
# deployment-kubernetes
m

Mark Fickett

05/20/2023, 11:01 PM
Got an error with k8s job creation. Looks like maybe the retry failed because the job actually did get created previously?
a

Andrea Giardini

05/21/2023, 8:33 AM
Judging from the exception, it looks like. weird though… if the previous error was a 500 i would expect the job not to be created at all. It would be interesting to know which exception triggered the retry.
m

Mark Fickett

05/22/2023, 1:13 PM
If there's a way to pull that, I'm glad to.
d

daniel

05/22/2023, 2:19 PM
This is after we added retries to the k8s_job_executor fairly recently right? Do you recall which exception was being hit before we added those retries? Seems like it might still be leaving the job created
d

daniel

05/22/2023, 2:21 PM
what an unhelpful error
😄 1
I don't suppose k8s supports an idempotency key or anything like that, that's how i've seen this problem solved before (and how dagster solves it with our own api requests)
🤷🏻 1
without something like that, retrying on 500s may have been a mistake (for this exact reason)
maybe we should just not raise on AlreadyExists exceptions
a

Andrea Giardini

05/22/2023, 2:27 PM
A 500 error leaves no information about the state of the request, unfortunately. I still think retrying is not a bad idea. Ignoring an AlreadyExists might be a good alternative
d

daniel

05/22/2023, 4:43 PM
m

Mark Fickett

05/25/2023, 7:19 PM
I see this in the release notes, hooray!
[dagster-k8s] Fixed an issue where the
k8s_job_executor
would sometimes fail with a 409 Conflict error after retrying the creation of a Kubernetes pod for a step, due to the job having already been created during a previous attempt despite raising an error.
Would it require just an agent update to pick this up, or a user code update?
d

daniel

05/25/2023, 7:19 PM
just a user code update
ty thankyou 1