Hey all, I've been having some trouble with retrie...
# deployment-kubernetes
k
Hey all, I've been having some trouble with retries and for what ever reason jobs not getting found after a retry op is run. A retry is triggered for something like our
dbt_cloud_run_op
and then it ultimately fails when it tries to retry a second time with something like:
Copy code
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'X', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'X', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'X', 'Date': 'Sun, 03 Apr 2022 14:31:49 GMT', 'Content-Length': '284'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch \"dagster-step-fcb024c52f02ea006fb2b73294153771-2\" not found","reason":"NotFound","details":{"name":"dagster-step-fcb024c52f02ea006fb2b73294153771-2","group":"batch","kind":"jobs"},"code":404}
It always happens if a job fails twice but is set to retry a few times. In this case it had a job for
dagster-step-fcb024c52f02ea006fb2b73294153771
and
dagster-step-fcb024c52f02ea006fb2b73294153771-1
. Any idea why it couldn't retry more than once?
As an addon - does the original job have to still exist for this to work? What happens if
dagster-step
and
dagster-step-1
are deleted while
dagster-step-2
is triggered to run?
j
Thanks for the report, what dagster version are you on?
And the prior k8s steps should not need to still exist
k
0.14.6
👍 I do have a job that cleans up completed jobs and pods on the 15 min mark every hour, so I was wondering if that contributed
j
Are you using k8s_job_executor or celery_k8s_job_executor
k
k8s_job_executor
j
Is thre any stack trace for the error?
k
Copy code
File "/usr/local/lib/python3.7/site-packages/dagster/core/execution/api.py", line 785, in pipeline_execution_iterator
    for event in pipeline_context.executor.execute(pipeline_context, execution_plan):
  File "/usr/local/lib/python3.7/site-packages/dagster/core/executor/step_delegating/step_delegating_executor.py", line 217, in execute
    plan_context, [step], active_execution
  File "/usr/local/lib/python3.7/site-packages/dagster_k8s/executor.py", line 230, in check_step_health
    job = self._batch_api.read_namespaced_job(namespace=self._job_namespace, name=job_name)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api/batch_v1_api.py", line 2657, in read_namespaced_job
    return self.read_namespaced_job_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api/batch_v1_api.py", line 2758, in read_namespaced_job_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
    headers=headers)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 244, in GET
    query_params=query_params)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 234, in request
    raise ApiException(http_resp=r)
j
Ah- that actually does point to it being related to the job cleanup. I’d be curious if you see the issue if you suspend that or increase the tolerance
The k8s_job_executor is looking for the job to check that it’s healthy since it hasn’t processed that the step finished already. We should be able to optimize this to not check in that case
k
Ah interesting - yeah I have a general job to do it on the hour mark. But I was considering doing it as a post job hook, plus I could ignore failed steps too.
Do you know if there's an easy way to get the k8s pods and jobs that were created for a specific dagster run? I'm attempting to circumvent my pipelines that delete these things by having a sensor that runs a clean up task for all the resources in completed job
j
As of 0.14.6, we added a
dagster/run-id
label to run workers and step workers https://github.com/dagster-io/dagster/pull/7167
Previously the run id was only present in the K8s Job names
Either way, it should be feasible to put together a k8s api query that would find them all. Alternatively the K8s Job names are logged in event log metadata. But since the intent is to clean up the k8s api server, I think it makes the most sense to me to use it to find the jobs
k
Ah that's perfect! I can definitely use those labels, we just updated to 0.14.6. Thanks!