Hey smart people I m trying to set a retry policy on a Kuber dagster #deployment-kubernetes

Hey smart people, I'm trying to set a retry policy...

Brian Pohl

01/17/2023, 7:55 PM

Hey smart people, I'm trying to set a retry policy on a Kubernetes op. I'm using

execute_k8s_job

within my op to trigger a Kubernetes job that I've configured and packaged up. The problem with retrying in this case is that the Kubernetes job that failed is not removed before the retry starts, so I get an error about the job already existing:

Copy code

jobs.batch \"43d8df833dbc1fb23ddf91218f914003\" already exists

This job is part of this hierarchy of Kubernetes pods/jobs:

Copy code

dagster-run-12345abcde    # responsible for this Dagster job run instance
└─ dagster-step-9876zywxvu    # responsible for the meta processes for this op
   └─ 43d8df833dbc1fb23ddf91218f914003    # packaged K8s job that is actually executing the op

I assume that, when a "normal" op retries, the

dagster-step-9876zywxvu

job is terminated first or something? Or maybe a new random guid is generated to prevent name overlap. Whatever the clever workaround is, when executing a

k8s_job

, the workaround is not applied to the child K8s job/pod. So, all retries of a K8s op look like the below image. Anybody have any ideas here on how to avoid this naming conflict? I thought about using

job_metadata

to pass in a manual name for the K8s job, but I am guessing that whatever manual name I use will just get reused too. I can't figure out how to create a name that is unique to a specific try/retry

daniel

01/17/2023, 8:04 PM

Hey Brian - the

job_metadata

thing is a good idea but I think it would just override what you pass in 😕 would adding an

k8s_job_name

param to execute_k8s_job help here, and asking you to use a unique suffix or something every time it's called? I don't think Dagster particularly cares what name you use for the k8s ob if you're creating it within your own op

Brian Pohl

01/17/2023, 8:06 PM

k8s_job_name

would be helpful, yes, but only if I have a function that lets me access the unique suffix. As my code stands, even if I had a

k8s_job_name

parameter, I assume that name would be set in stone when the op is started, i.e. when

dagster-step-98734958

is created, and then each retry would still use that same name. That leaves me with the same problem I have now, except the name would be intelligible instead of a random hash. If both a

k8s_job_name

and a way to access a unique retry suffix were exposed, then I'd be set. Alternatively, a boolean flag like

rename_job_on_retry

would work as well.

daniel

01/17/2023, 8:08 PM

Believe context.retry_number will give you that number

daniel

01/17/2023, 8:09 PM

I'm not positive that we are talking about the same thing by 'unique retry suffix' though

Brian Pohl

01/17/2023, 8:09 PM

Oh perfect! Then yes, a

k8s_job_name

would do it for me

daniel

01/17/2023, 8:10 PM

i.e. i don't think i understand every nuance of what your op does, but we can definitely add a k8s_job_name field

Brian Pohl

01/17/2023, 8:10 PM

By unique retry suffix, a simple retry number would do. Just so I could name the job

brians-k8s-job-1

and then, on retry, it becomes

brians-k8s-job-2

daniel

01/17/2023, 8:10 PM

ah ok great - yes that should be fine

Brian Pohl

01/17/2023, 8:11 PM

Do you think the job will actually get renamed on each retry? I could see this as something that gets set only once, when the op is started the first time. We'd have to be sure that the code gets evaluated again on each retry.

daniel

01/17/2023, 8:13 PM

Dagster doesn't know anything about what happens inside the body of your ops - if you want the job to be renamed, you would need to write code to do so

daniel

01/17/2023, 8:13 PM

i might not totally understand the question

daniel

01/17/2023, 8:14 PM

but the body of the op is re-executed on each retry

Brian Pohl

01/17/2023, 8:15 PM

the body of the op is re-executed on each retry

ok, yeah that makes sense, and i think that means we should be fine. sweet! well i eagerly await the

k8s_job_name

then 🙏

daniel

01/17/2023, 8:16 PM

er actually maybe we should just make it work this way for everybody (use the retry number in the created job name)

daniel

01/17/2023, 8:17 PM

then it would work out of the box for more people

🧠 1

Brian Pohl

01/17/2023, 8:20 PM

love it!

daniel

01/17/2023, 8:25 PM

side-note: are you sure you need the dagster-step-xxx pod? it doesn't seem like it does very much

daniel

01/17/2023, 8:25 PM

or do you have other ops that do other things that aren't launching k8s pods, and you want those to run in k8s pods in general

Brian Pohl

01/17/2023, 8:27 PM

the latter. my job consists mostly of "normal" ops with Python code to do all the work, but one op in my job is just a call to

execute_k8s_job

to trigger some Java code that I've packaged up in a Kubernetes job. for this particular op, i guess

dagster-step

isn't doing anything. but for my other ops

dagster-step

actually does the work

daniel

01/17/2023, 8:28 PM

makes sense - ideally we'd have a way for you to mix-and-match (every op except for this one runs in a k8s pod, this one can run locally since it just needs to itself spin up a k8s pod)

daniel

01/17/2023, 8:28 PM

someday!

❤️ 1

Brian Pohl

01/17/2023, 8:34 PM

that'd be cool! but really the only minor complaint i have about the redundant pods is that the jobs that get spun up - and i guess all pods spun up by Dagster - is that the pod/job names don't mention the Dagster job/op names. when you're scrolling through the list of pods and you see this, you have to click each one and look at its metadata to figure out which op/job it is.

Brian Pohl

01/17/2023, 8:35 PM

it'd be awesome if i had

Copy code

dagster-run-brians-job-12345abcde
└─ dagster-step-first-op-9876zywxvu
   └─ first-op-k8s-job-43d8df833dbc1fb23ddf91218f914003

daniel

01/17/2023, 8:50 PM

yeah those names are bad

daniel

01/17/2023, 8:50 PM

we should fix that too

daniel

01/17/2023, 8:51 PM

(our names i mean)

Brian Pohl

01/17/2023, 8:51 PM

haha well one step at a time. for now i'll wait for the change to update the k8s job name with each retry. let me know when you've got an issue/PR for that!

daniel

01/17/2023, 9:42 PM

https://github.com/dagster-io/dagster/pull/11753 should squash this, thanks for the report

🌈 1

ty spinny 1

2 Views

Open in Slack

Previous Next