Hey there! I am running some jobs and keep seeing...
# ask-community
Hey there! I am running some jobs and keep seeing this error: Detected run worker status UNKNOWN: 'dagster_k8s.client.DagsterK8sUnrecoverableAPIError: Unexpected error encountered in Kubernetes API Client.
Copy code
Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster_k8s/launcher.py", line 379, in check_run_worker_health
  File "/usr/local/lib/python3.7/site-packages/dagster_k8s/client.py", line 366, in get_job_status
    return k8s_api_retry(_get_job_status, max_retries=3, timeout=wait_time_between_attempts)
  File "/usr/local/lib/python3.7/site-packages/dagster_k8s/client.py", line 124, in k8s_api_retry
    ) from e

The above exception was caused by the following exception:
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'a65413c7-fd00-46ef-94e2-e056f1261a2e', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '802301fe-53ea-4d4d-aeb0-efd3891f18ac', 'X-Kubernetes-Pf-Prioritylevel-Uid': '2952e8a6-fb75-43dc-b244-b39008d24b04', 'Date': 'Mon, 03 Apr 2023 23:23:27 GMT', 'Content-Length': '290'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch \"dagster-run-52fc0109-804d-441f-b1b9-c077f14a4005-1\" not found","reason":"NotFound","details":{"name":"dagster-run-52fc0109-804d-441f-b1b9-c077f14a4005-1","group":"batch","kind":"jobs"},"code":404}
Hi Pablo - does that kubernetes job exist in your cluster if you run 'kubectl get jobs'? That error is saying that it can't find the kubernetes job that was created when the run was launched
So the job actually fully runs as far as I can tell and it seems that this error happens at the end of it.
any chance you could share a debug file from a run where you saw this? using this dropdown
is your k8s cluster possibly set up to remove k8s jobs very soon after they finish? If the job ttl is quite low i wonder if there could be a race condition
I will check with the TTL I sent you a dm with the debug file
ok, i took a look at the debug file you sent me - i see that error you posted happening in the middle of the run, so it does seem like something in your k8s cluster is removing the k8s jobs in the middle of the run. I also see some indication that the run pod was interrupted partway through later on by the cluster, which typically happens during things like autoscaling when the cluster adds or removes a node in the middle of the run. This isn't an error that I've seen before so it does seem unique to your cluster configuration in some way. It's hard to provide more useful information without more access to your cluster but my best guess would be that there's some significant changes happening to your k8s cluster in the middle of the run - if there's a way to get more insight into k8s logging events going on during these runs it might help give more insight into where all this instability is coming from. Dagster can provide observability when something like the k8s job disappearing happens, but isn't going to be great at finding the root cause of something external like that.
Ok I will look into what could be causing that. Another weird issue I have been seeing is that when this occurs the job is 'resumed' which causes duplicate runs. I have this in my values.yaml:
Copy code
    enabled: true
    # Timeout for runs to start (avoids runs hanging in STARTED)
    startTimeoutSeconds: 180
    # How often to check on in progress runs
    pollIntervalSeconds: 120
    # Max number of times to attempt to resume a run with a new run worker. Defaults to 3 if the the
    # run launcher supports resuming runs, otherwise defaults to 0.
    maxResumeRunAttempts: 0
So I would expect there to be no resumes.
What version of the Helm chart are you using? That sounds like a bug that we fixed sometime around the 1.2 release
❤️ 1
If you're on an earlier version of the helm chart than that I would expect setting it to -1 instead of 0 to turn it off
Awesome will update to fix thanks for all the help!
@Pablo Beltran hello! we are having the same
have you figured out the issue?
@Arsenii Poriadin It only stopped after i set the value to -1
value of
but it won't help mitigating the error
itself, will it?
Oh sorry that fixed the other issue. It needed up being related to the node pool but we never got to the root of it. We swapped to a larger node instance and it stopped happening.