Brian Pohl
01/17/2023, 7:55 PMexecute_k8s_job
within my op to trigger a Kubernetes job that I've configured and packaged up. The problem with retrying in this case is that the Kubernetes job that failed is not removed before the retry starts, so I get an error about the job already existing:
jobs.batch \"43d8df833dbc1fb23ddf91218f914003\" already exists
This job is part of this hierarchy of Kubernetes pods/jobs:
dagster-run-12345abcde # responsible for this Dagster job run instance
└─ dagster-step-9876zywxvu # responsible for the meta processes for this op
└─ 43d8df833dbc1fb23ddf91218f914003 # packaged K8s job that is actually executing the op
I assume that, when a "normal" op retries, the dagster-step-9876zywxvu
job is terminated first or something? Or maybe a new random guid is generated to prevent name overlap. Whatever the clever workaround is, when executing a k8s_job
, the workaround is not applied to the child K8s job/pod.
So, all retries of a K8s op look like the below image.
Anybody have any ideas here on how to avoid this naming conflict? I thought about using job_metadata
to pass in a manual name for the K8s job, but I am guessing that whatever manual name I use will just get reused too. I can't figure out how to create a name that is unique to a specific try/retrydaniel
01/17/2023, 8:04 PMjob_metadata
thing is a good idea but I think it would just override what you pass in 😕 would adding an k8s_job_name
param to execute_k8s_job help here, and asking you to use a unique suffix or something every time it's called? I don't think Dagster particularly cares what name you use for the k8s ob if you're creating it within your own opBrian Pohl
01/17/2023, 8:06 PMk8s_job_name
would be helpful, yes, but only if I have a function that lets me access the unique suffix. As my code stands, even if I had a k8s_job_name
parameter, I assume that name would be set in stone when the op is started, i.e. when dagster-step-98734958
is created, and then each retry would still use that same name. That leaves me with the same problem I have now, except the name would be intelligible instead of a random hash.
If both a k8s_job_name
and a way to access a unique retry suffix were exposed, then I'd be set. Alternatively, a boolean flag like rename_job_on_retry
would work as well.daniel
01/17/2023, 8:08 PMBrian Pohl
01/17/2023, 8:09 PMk8s_job_name
would do it for medaniel
01/17/2023, 8:10 PMBrian Pohl
01/17/2023, 8:10 PMbrians-k8s-job-1
and then, on retry, it becomes brians-k8s-job-2
daniel
01/17/2023, 8:10 PMBrian Pohl
01/17/2023, 8:11 PMdaniel
01/17/2023, 8:13 PMBrian Pohl
01/17/2023, 8:15 PMthe body of the op is re-executed on each retryok, yeah that makes sense, and i think that means we should be fine. sweet! well i eagerly await the
k8s_job_name
then 🙏daniel
01/17/2023, 8:16 PMBrian Pohl
01/17/2023, 8:20 PMdaniel
01/17/2023, 8:25 PMBrian Pohl
01/17/2023, 8:27 PMexecute_k8s_job
to trigger some Java code that I've packaged up in a Kubernetes job.
for this particular op, i guess dagster-step
isn't doing anything. but for my other ops dagster-step
actually does the workdaniel
01/17/2023, 8:28 PMBrian Pohl
01/17/2023, 8:34 PMdagster-run-brians-job-12345abcde
└─ dagster-step-first-op-9876zywxvu
└─ first-op-k8s-job-43d8df833dbc1fb23ddf91218f914003
daniel
01/17/2023, 8:50 PMBrian Pohl
01/17/2023, 8:51 PMdaniel
01/17/2023, 9:42 PM