Hey there I am running some jobs and keep seeing this error dagster #ask-community

Hey there! I am running some jobs and keep seeing...

Pablo Beltran

04/03/2023, 11:37 PM

Hey there! I am running some jobs and keep seeing this error: Detected run worker status UNKNOWN: 'dagster_k8s.client.DagsterK8sUnrecoverableAPIError: Unexpected error encountered in Kubernetes API Client.

Copy code

Stack Trace:
  File "/usr/local/lib/python3.7/site-packages/dagster_k8s/launcher.py", line 379, in check_run_worker_health
    job_name=job_name,
  File "/usr/local/lib/python3.7/site-packages/dagster_k8s/client.py", line 366, in get_job_status
    return k8s_api_retry(_get_job_status, max_retries=3, timeout=wait_time_between_attempts)
  File "/usr/local/lib/python3.7/site-packages/dagster_k8s/client.py", line 124, in k8s_api_retry
    ) from e

The above exception was caused by the following exception:
kubernetes.client.exceptions.ApiException: (404)
Reason: Not Found
HTTP response headers: HTTPHeaderDict({'Audit-Id': 'a65413c7-fd00-46ef-94e2-e056f1261a2e', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': '802301fe-53ea-4d4d-aeb0-efd3891f18ac', 'X-Kubernetes-Pf-Prioritylevel-Uid': '2952e8a6-fb75-43dc-b244-b39008d24b04', 'Date': 'Mon, 03 Apr 2023 23:23:27 GMT', 'Content-Length': '290'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"jobs.batch \"dagster-run-52fc0109-804d-441f-b1b9-c077f14a4005-1\" not found","reason":"NotFound","details":{"name":"dagster-run-52fc0109-804d-441f-b1b9-c077f14a4005-1","group":"batch","kind":"jobs"},"code":404}

daniel

04/04/2023, 2:09 AM

Hi Pablo - does that kubernetes job exist in your cluster if you run 'kubectl get jobs'? That error is saying that it can't find the kubernetes job that was created when the run was launched

Pablo Beltran

04/04/2023, 2:11 AM

So the job actually fully runs as far as I can tell and it seems that this error happens at the end of it.

daniel

04/04/2023, 2:12 AM

any chance you could share a debug file from a run where you saw this? using this dropdown

daniel

04/04/2023, 2:13 AM

is your k8s cluster possibly set up to remove k8s jobs very soon after they finish? If the job ttl is quite low i wonder if there could be a race condition

Pablo Beltran

04/04/2023, 6:13 PM

I will check with the TTL I sent you a dm with the debug file

daniel

04/05/2023, 1:00 AM

ok, i took a look at the debug file you sent me - i see that error you posted happening in the middle of the run, so it does seem like something in your k8s cluster is removing the k8s jobs in the middle of the run. I also see some indication that the run pod was interrupted partway through later on by the cluster, which typically happens during things like autoscaling when the cluster adds or removes a node in the middle of the run. This isn't an error that I've seen before so it does seem unique to your cluster configuration in some way. It's hard to provide more useful information without more access to your cluster but my best guess would be that there's some significant changes happening to your k8s cluster in the middle of the run - if there's a way to get more insight into k8s logging events going on during these runs it might help give more insight into where all this instability is coming from. Dagster can provide observability when something like the k8s job disappearing happens, but isn't going to be great at finding the root cause of something external like that.

Pablo Beltran

04/05/2023, 6:24 AM

Ok I will look into what could be causing that. Another weird issue I have been seeing is that when this occurs the job is 'resumed' which causes duplicate runs. I have this in my values.yaml:

Copy code

runMonitoring:
    enabled: true
    # Timeout for runs to start (avoids runs hanging in STARTED)
    startTimeoutSeconds: 180
    # How often to check on in progress runs
    pollIntervalSeconds: 120
    # Max number of times to attempt to resume a run with a new run worker. Defaults to 3 if the the
    # run launcher supports resuming runs, otherwise defaults to 0.
    maxResumeRunAttempts: 0

So I would expect there to be no resumes.

daniel

04/05/2023, 11:51 AM

What version of the Helm chart are you using? That sounds like a bug that we fixed sometime around the 1.2 release

❤️ 1

daniel

04/05/2023, 11:53 AM

1.1.10, looks like: https://github.com/dagster-io/dagster/commit/6590ed9ea9521c03baded1e0e3ed12cc4c196f27

daniel

04/05/2023, 11:55 AM

If you're on an earlier version of the helm chart than that I would expect setting it to -1 instead of 0 to turn it off

Pablo Beltran

04/05/2023, 3:53 PM

Awesome will update to fix thanks for all the help!

Arsenii Poriadin

05/09/2023, 4:44 PM

@Pablo Beltran hello! we are having the same

dagster_k8s.client.DagsterK8sUnrecoverableAPIError

have you figured out the issue?

Pablo Beltran

05/09/2023, 4:44 PM

@Arsenii Poriadin It only stopped after i set the value to -1

Arsenii Poriadin

05/09/2023, 4:59 PM

value of

maxResumeRunAttempts

Arsenii Poriadin

05/09/2023, 5:00 PM

but it won't help mitigating the error

dagster_k8s.client.DagsterK8sUnrecoverableAPIError

itself, will it?

Pablo Beltran

05/09/2023, 5:02 PM

Oh sorry that fixed the other issue. It needed up being related to the node pool but we never got to the root of it. We swapped to a larger node instance and it stopped happening.

2 Views

Open in Slack

Previous Next