Hey, we encountered a weird behavior with `RetryRe...
# ask-community
r
Hey, we encountered a weird behavior with
RetryRequested
using
K8sRunLauncher
and
K8sExecutor
. We raise a
RetryRequested
during our jobs due to known limitations (AWS quotas, for example). We get a
STEP_UP_FOR_RETRY
event, but Dagit (and our K8s logs) doesn’t show any sign of actually executing a new pod for that task. Usually, we get an
ENGINE_EVENT
of
Executing step <step_name> in Kubernetes job …
, but nothing for the retried steps. The run always fails with
Copy code
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
with the following message
Copy code
jobs.batch \"dagster-step-<something>-1\" not found
It worked well when we used the default run launcher using static Celery executors. (btw we’re using Dagster 0.14.17). Thanks.
j
@Dagster Bot issue Investigate RetryRequests with k8s_job_executor
d
r
Thanks, @johann. Let me know if I can help somehow.
v
@Roei Jacobovich Hello! Have you found a solution to this problem? I have the same.
r
Hi @Vladislav Khokhlov, unfortunately, we’re using some sort of
backoff
on Ops instead. It works but we’re spinning up some compute during the backing off time. @johann it would be great if someone could take a look at the issue and the bug 🙏 thanks
h
@johann @Reich Canlas Hi, we are facing the same issue. Our teammate did some investigation and put comments on the GitHub issue though we are still not sure how to fix it. Can you take a look at it?
j
Sorry for delayed response here- is it possible to upgrade your Dagster version? There was a bug in k8s retries that manifested like this, let me check which version
h
@johann We are using 1.0.7 (core) / 0.16.7 (libraries) and getting errors.
@johann Are 1.0.7 (core) / 0.16.7 (libraries) affected by the “bug in k8s retries”?
Do we have a link to an issue or a PR?
j
In 1.0.7/0.16.7 this would have to be a different bug. I’ll take a look
h
@johann Thank you for taking a look at the issue 🙇 Is there any update on it?
j
Apologies! This slipped through the cracks, thanks for pinging. I’ll aim to get a fix out in the release next week
thank you box 1
❤️ 1
Slipped through this week. I’ll let you know when I get a diff up next week
thank you box 1
The fix is up, but likely will be in next week’s release. https://github.com/dagster-io/dagster/pull/10458
I spoke too soon- the fix went out in 1.0.17 today. Let me know when you get a chance to verify that it’s working for you