Hey all I ve been having some trouble with retries and for w dagster #deployment-kubernetes

Hey all, I've been having some trouble with retrie...

Kirk Stennett

04/04/2022, 6:07 PM

Hey all, I've been having some trouble with retries and for what ever reason jobs not getting found after a retry op is run. A retry is triggered for something like our

dbt_cloud_run_op

and then it ultimately fails when it tries to retry a second time with something like:

Copy code

kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'X', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'X', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'X', 'Date': 'Sun, 03 Apr 2022 14:31:49 GMT', 'Content-Length': '284'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch \"dagster-step-fcb024c52f02ea006fb2b73294153771-2\" not found","reason":"NotFound","details":{"name":"dagster-step-fcb024c52f02ea006fb2b73294153771-2","group":"batch","kind":"jobs"},"code":404}

It always happens if a job fails twice but is set to retry a few times. In this case it had a job for

dagster-step-fcb024c52f02ea006fb2b73294153771

and

dagster-step-fcb024c52f02ea006fb2b73294153771-1

. Any idea why it couldn't retry more than once?

Kirk Stennett

04/04/2022, 6:39 PM

As an addon - does the original job have to still exist for this to work? What happens if

dagster-step

and

dagster-step-1

are deleted while

dagster-step-2

is triggered to run?

johann

04/04/2022, 6:48 PM

Thanks for the report, what dagster version are you on?

johann

04/04/2022, 6:49 PM

And the prior k8s steps should not need to still exist

Kirk Stennett

04/04/2022, 8:05 PM

0.14.6

Kirk Stennett

04/04/2022, 8:05 PM

👍 I do have a job that cleans up completed jobs and pods on the 15 min mark every hour, so I was wondering if that contributed

johann

04/04/2022, 8:13 PM

Are you using k8s_job_executor or celery_k8s_job_executor

Kirk Stennett

04/04/2022, 8:25 PM

k8s_job_executor

johann

04/04/2022, 8:48 PM

Is thre any stack trace for the error?

Kirk Stennett

04/04/2022, 9:02 PM

Copy code

File "/usr/local/lib/python3.7/site-packages/dagster/core/execution/api.py", line 785, in pipeline_execution_iterator
    for event in pipeline_context.executor.execute(pipeline_context, execution_plan):
  File "/usr/local/lib/python3.7/site-packages/dagster/core/executor/step_delegating/step_delegating_executor.py", line 217, in execute
    plan_context, [step], active_execution
  File "/usr/local/lib/python3.7/site-packages/dagster_k8s/executor.py", line 230, in check_step_health
    job = self._batch_api.read_namespaced_job(namespace=self._job_namespace, name=job_name)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api/batch_v1_api.py", line 2657, in read_namespaced_job
    return self.read_namespaced_job_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api/batch_v1_api.py", line 2758, in read_namespaced_job_with_http_info
    collection_formats=collection_formats)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 353, in call_api
    _preload_content, _request_timeout, _host)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 184, in __call_api
    _request_timeout=_request_timeout)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/api_client.py", line 377, in request
    headers=headers)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 244, in GET
    query_params=query_params)
  File "/usr/local/lib/python3.7/site-packages/kubernetes/client/rest.py", line 234, in request
    raise ApiException(http_resp=r)

johann

04/04/2022, 9:15 PM

Ah- that actually does point to it being related to the job cleanup. I’d be curious if you see the issue if you suspend that or increase the tolerance

johann

04/04/2022, 9:16 PM

The k8s_job_executor is looking for the job to check that it’s healthy since it hasn’t processed that the step finished already. We should be able to optimize this to not check in that case

Kirk Stennett

04/04/2022, 10:41 PM

Ah interesting - yeah I have a general job to do it on the hour mark. But I was considering doing it as a post job hook, plus I could ignore failed steps too.

Kirk Stennett

04/06/2022, 12:14 AM

Do you know if there's an easy way to get the k8s pods and jobs that were created for a specific dagster run? I'm attempting to circumvent my pipelines that delete these things by having a sensor that runs a clean up task for all the resources in completed job

johann

04/06/2022, 1:14 AM

As of 0.14.6, we added a

dagster/run-id

label to run workers and step workers https://github.com/dagster-io/dagster/pull/7167

johann

04/06/2022, 1:15 AM

Previously the run id was only present in the K8s Job names

johann

04/06/2022, 1:17 AM

Either way, it should be feasible to put together a k8s api query that would find them all. Alternatively the K8s Job names are logged in event log metadata. But since the intent is to clean up the k8s api server, I think it makes the most sense to me to use it to find the jobs

Kirk Stennett

04/06/2022, 3:24 PM

Ah that's perfect! I can definitely use those labels, we just updated to 0.14.6. Thanks!

130 Views

Open in Slack

Previous Next