Hey smart people, I'm trying to set a retry policy...
# deployment-kubernetes
b
Hey smart people, I'm trying to set a retry policy on a Kubernetes op. I'm using
execute_k8s_job
within my op to trigger a Kubernetes job that I've configured and packaged up. The problem with retrying in this case is that the Kubernetes job that failed is not removed before the retry starts, so I get an error about the job already existing:
Copy code
jobs.batch \"43d8df833dbc1fb23ddf91218f914003\" already exists
This job is part of this hierarchy of Kubernetes pods/jobs:
Copy code
dagster-run-12345abcde    # responsible for this Dagster job run instance
ā””ā”€ dagster-step-9876zywxvu    # responsible for the meta processes for this op
   ā””ā”€ 43d8df833dbc1fb23ddf91218f914003    # packaged K8s job that is actually executing the op
I assume that, when a "normal" op retries, the
dagster-step-9876zywxvu
job is terminated first or something? Or maybe a new random guid is generated to prevent name overlap. Whatever the clever workaround is, when executing a
k8s_job
, the workaround is not applied to the child K8s job/pod. So, all retries of a K8s op look like the below image. Anybody have any ideas here on how to avoid this naming conflict? I thought about using
job_metadata
to pass in a manual name for the K8s job, but I am guessing that whatever manual name I use will just get reused too. I can't figure out how to create a name that is unique to a specific try/retry
d
Hey Brian - the
job_metadata
thing is a good idea but I think it would just override what you pass in šŸ˜• would adding an
k8s_job_name
param to execute_k8s_job help here, and asking you to use a unique suffix or something every time it's called? I don't think Dagster particularly cares what name you use for the k8s ob if you're creating it within your own op
b
A
k8s_job_name
would be helpful, yes, but only if I have a function that lets me access the unique suffix. As my code stands, even if I had a
k8s_job_name
parameter, I assume that name would be set in stone when the op is started, i.e. when
dagster-step-98734958
is created, and then each retry would still use that same name. That leaves me with the same problem I have now, except the name would be intelligible instead of a random hash. If both a
k8s_job_name
and a way to access a unique retry suffix were exposed, then I'd be set. Alternatively, a boolean flag like
rename_job_on_retry
would work as well.
d
Believe context.retry_number will give you that number
I'm not positive that we are talking about the same thing by 'unique retry suffix' though
b
Oh perfect! Then yes, a
k8s_job_name
would do it for me
d
i.e. i don't think i understand every nuance of what your op does, but we can definitely add a k8s_job_name field
b
By unique retry suffix, a simple retry number would do. Just so I could name the job
brians-k8s-job-1
and then, on retry, it becomes
brians-k8s-job-2
d
ah ok great - yes that should be fine
b
Do you think the job will actually get renamed on each retry? I could see this as something that gets set only once, when the op is started the first time. We'd have to be sure that the code gets evaluated again on each retry.
d
Dagster doesn't know anything about what happens inside the body of your ops - if you want the job to be renamed, you would need to write code to do so
i might not totally understand the question
but the body of the op is re-executed on each retry
b
the body of the op is re-executed on each retry
ok, yeah that makes sense, and i think that means we should be fine. sweet! well i eagerly await the
k8s_job_name
then šŸ™
d
er actually maybe we should just make it work this way for everybody (use the retry number in the created job name)
then it would work out of the box for more people
šŸ§  1
b
love it!
d
side-note: are you sure you need the dagster-step-xxx pod? it doesn't seem like it does very much
or do you have other ops that do other things that aren't launching k8s pods, and you want those to run in k8s pods in general
b
the latter. my job consists mostly of "normal" ops with Python code to do all the work, but one op in my job is just a call to
execute_k8s_job
to trigger some Java code that I've packaged up in a Kubernetes job. for this particular op, i guess
dagster-step
isn't doing anything. but for my other ops
dagster-step
actually does the work
d
makes sense - ideally we'd have a way for you to mix-and-match (every op except for this one runs in a k8s pod, this one can run locally since it just needs to itself spin up a k8s pod)
someday!
ā¤ļø 1
b
that'd be cool! but really the only minor complaint i have about the redundant pods is that the jobs that get spun up - and i guess all pods spun up by Dagster - is that the pod/job names don't mention the Dagster job/op names. when you're scrolling through the list of pods and you see this, you have to click each one and look at its metadata to figure out which op/job it is.
it'd be awesome if i had
Copy code
dagster-run-brians-job-12345abcde
ā””ā”€ dagster-step-first-op-9876zywxvu
   ā””ā”€ first-op-k8s-job-43d8df833dbc1fb23ddf91218f914003
d
yeah those names are bad
we should fix that too
(our names i mean)
b
haha well one step at a time. for now i'll wait for the change to update the k8s job name with each retry. let me know when you've got an issue/PR for that!
d
https://github.com/dagster-io/dagster/pull/11753 should squash this, thanks for the report
šŸŒˆ 1
ty spinny 1