Hi anybody knows what are the recommended limits of paraleli dagster #deployment-kubernetes

Hi! anybody knows what are the recommended limits ...

Carlos Pega

02/03/2023, 1:28 PM

Hi! anybody knows what are the recommended limits of paralelization in a single cluster using the k8s executor? For instance if in a single execution i do a map operation with 300 pods lasting a minute each and without setting a limit for the autoscaler would that be ok? I tried something like that and the AWS EKS API started to fail with this error:

❤️ 1

➕ 2

daniel

02/03/2023, 1:30 PM

Hi Carlos, is it possible to post the full stack trace? The inner exception that gets cut off there might have some clues

Carlos Pega

02/03/2023, 1:30 PM

sure

Carlos Pega

02/03/2023, 1:31 PM

Copy code

dagster_k8s.client.DagsterK8sUnrecoverableAPIError: Unexpected error encountered in Kubernetes API Client.
  File "/home/mehta/.local/lib/python3.9/site-packages/dagster/_core/executor/step_delegating/step_delegating_executor.py", line 248, in execute
    health_check_result = self._step_handler.check_step_health(
  File "/home/mehta/.local/lib/python3.9/site-packages/dagster_k8s/executor.py", line 264, in check_step_health
    status = self._api_client.get_job_status(
  File "/home/mehta/.local/lib/python3.9/site-packages/dagster_k8s/client.py", line 359, in get_job_status
    return k8s_api_retry(_get_job_status, max_retries=3, timeout=wait_time_between_attempts)
  File "/home/mehta/.local/lib/python3.9/site-packages/dagster_k8s/client.py", line 114, in k8s_api_retry
    raise DagsterK8sUnrecoverableAPIError(
The above exception was caused by the following exception:
kubernetes.client.exceptions.ApiException: (401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'Audit-Id': '1c7464ab-3d48-4848-a736-47285ac72638', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 24 Jan 2023 13:51:45 GMT', 'Content-Length': '129'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}


  File "/home/mehta/.local/lib/python3.9/site-packages/dagster_k8s/client.py", line 95, in k8s_api_retry
    return fn()
  File "/home/mehta/.local/lib/python3.9/site-packages/dagster_k8s/client.py", line 356, in _get_job_status
    job = self.batch_api.read_namespaced_job_status(job_name, namespace=namespace)
  File "/home/mehta/.local/lib/python3.9/site-packages/kubernetes/client/api/batch_v1_api.py", line 2785, in read_namespaced_job_status
    return self.read_namespaced_job_status_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/home/mehta/.local/lib/python3.9/site-packages/kubernetes/client/api/batch_v1_api.py", line 2872, in read_namespaced_job_status_with_http_info
    return self.api_client.call_api(
  File "/home/mehta/.local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/home/mehta/.local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/home/mehta/.local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 373, in request
    return self.rest_client.GET(url,
  File "/home/mehta/.local/lib/python3.9/site-packages/kubernetes/client/rest.py", line 241, in GET
    return self.request("GET", url,
  File "/home/mehta/.local/lib/python3.9/site-packages/kubernetes/client/rest.py", line 235, in request
    raise ApiException(http_resp=r)

Carlos Pega

02/03/2023, 1:32 PM

that only happended by stressing it. It works well with a lower limit in the autoscaler for instance

Carlos Pega

02/03/2023, 1:32 PM

(by having pods in pending until a node gets ready)

daniel

02/03/2023, 1:46 PM

I’m seeing something here that looks potentially related https://github.com/aws/containers-roadmap/issues/1810

daniel

02/03/2023, 1:48 PM

We’re already retrying that api call a few times but maybe we need to make those number of retries configurable for situations like this

Carlos Pega

02/03/2023, 1:50 PM

but it's the use case correct or i should be batching on any way? i expect to have dozens of those executions at the same time

daniel

02/03/2023, 1:51 PM

I wouldn’t expect dagster to be a bottleneck there if the underlying cluster can handle the number of pods you want to have at once

✅ 1

Mark Fickett

02/03/2023, 2:04 PM

In case you haven't seen it: https://kubernetes.io/docs/setup/best-practices/cluster-large/ . Stress testing our cluster I eventually ran into etcdserver running out of space to track pods, but that was with more like 10k pods queued. (We don't currently have an autoscaler.)

Carlos Pega

02/03/2023, 2:23 PM

thanks for the info 🙂

daniel

02/03/2023, 2:27 PM

It looks like right now we only retry k8s errors on these error codes:

Copy code

WHITELISTED_TRANSIENT_K8S_STATUS_CODES = [
    503,  # Service unavailable
    504,  # Gateway timeout
    500,  # Internal server error
]

we could potentially add a 401 here as well as a workaround for that EKS bug (assuming that's what it is) - but generally retrying on a 401 wouldn't help since that typically indicates a permissions issue

Carlos Pega

02/03/2023, 2:31 PM

yeap, i understand that, but as i mentioned it happens on a running cluster with other maps working on the same transaction. It's weird

daniel

02/03/2023, 2:31 PM

yeah my suspicion is the root cause there is an EKS or k8s bug rather than a dagster issue - but we may be able to work around it

daniel

02/03/2023, 2:33 PM

I can send out a PR that adds 401 to that list of error codes and see what the k8s experts on the team think

Carlos Pega

02/03/2023, 6:49 PM

great! thank you very much! 🙂

Keith Gross

02/21/2023, 7:03 PM

Hi @Carlos Pega we are seeing a similar issue while stress testing some of our infra, were you able to find a workaround for this issue?

Carlos Pega

03/06/2023, 12:32 PM

hi @Keith Gross, let me ask a colleague of mine and i come back to you

Carlos Pega

03/06/2023, 12:49 PM

it seems like @daniel’s pr went through: https://github.com/dagster-io/dagster/blob/master/CHANGES.md#bugfixes-3

• [dagster-k8s] Fixed an issue where pods launched by the
k8s_job_executor
would sometimes unexpectedly fail due to transient 401 errors in certain kubernetes clusters.

Keith Gross

04/18/2023, 8:05 PM

I know this was a while ago, but thanks for responding. Our issue turned out to be related to something else.

59 Views

Open in Slack

Previous Next