Hi! anybody knows what are the recommended limits ...
# deployment-kubernetes
Hi! anybody knows what are the recommended limits of paralelization in a single cluster using the k8s executor? For instance if in a single execution i do a map operation with 300 pods lasting a minute each and without setting a limit for the autoscaler would that be ok? I tried something like that and the AWS EKS API started to fail with this error:
❤️ 1
Hi Carlos, is it possible to post the full stack trace? The inner exception that gets cut off there might have some clues
Copy code
dagster_k8s.client.DagsterK8sUnrecoverableAPIError: Unexpected error encountered in Kubernetes API Client.
  File "/home/mehta/.local/lib/python3.9/site-packages/dagster/_core/executor/step_delegating/step_delegating_executor.py", line 248, in execute
    health_check_result = self._step_handler.check_step_health(
  File "/home/mehta/.local/lib/python3.9/site-packages/dagster_k8s/executor.py", line 264, in check_step_health
    status = self._api_client.get_job_status(
  File "/home/mehta/.local/lib/python3.9/site-packages/dagster_k8s/client.py", line 359, in get_job_status
    return k8s_api_retry(_get_job_status, max_retries=3, timeout=wait_time_between_attempts)
  File "/home/mehta/.local/lib/python3.9/site-packages/dagster_k8s/client.py", line 114, in k8s_api_retry
    raise DagsterK8sUnrecoverableAPIError(
The above exception was caused by the following exception:
kubernetes.client.exceptions.ApiException: (401)
Reason: Unauthorized
HTTP response headers: HTTPHeaderDict({'Audit-Id': '1c7464ab-3d48-4848-a736-47285ac72638', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'Date': 'Tue, 24 Jan 2023 13:51:45 GMT', 'Content-Length': '129'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Unauthorized","reason":"Unauthorized","code":401}

  File "/home/mehta/.local/lib/python3.9/site-packages/dagster_k8s/client.py", line 95, in k8s_api_retry
    return fn()
  File "/home/mehta/.local/lib/python3.9/site-packages/dagster_k8s/client.py", line 356, in _get_job_status
    job = self.batch_api.read_namespaced_job_status(job_name, namespace=namespace)
  File "/home/mehta/.local/lib/python3.9/site-packages/kubernetes/client/api/batch_v1_api.py", line 2785, in read_namespaced_job_status
    return self.read_namespaced_job_status_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/home/mehta/.local/lib/python3.9/site-packages/kubernetes/client/api/batch_v1_api.py", line 2872, in read_namespaced_job_status_with_http_info
    return self.api_client.call_api(
  File "/home/mehta/.local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/home/mehta/.local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/home/mehta/.local/lib/python3.9/site-packages/kubernetes/client/api_client.py", line 373, in request
    return self.rest_client.GET(url,
  File "/home/mehta/.local/lib/python3.9/site-packages/kubernetes/client/rest.py", line 241, in GET
    return self.request("GET", url,
  File "/home/mehta/.local/lib/python3.9/site-packages/kubernetes/client/rest.py", line 235, in request
    raise ApiException(http_resp=r)
that only happended by stressing it. It works well with a lower limit in the autoscaler for instance
(by having pods in pending until a node gets ready)
I’m seeing something here that looks potentially related https://github.com/aws/containers-roadmap/issues/1810
We’re already retrying that api call a few times but maybe we need to make those number of retries configurable for situations like this
but it's the use case correct or i should be batching on any way? i expect to have dozens of those executions at the same time
I wouldn’t expect dagster to be a bottleneck there if the underlying cluster can handle the number of pods you want to have at once
In case you haven't seen it: https://kubernetes.io/docs/setup/best-practices/cluster-large/ . Stress testing our cluster I eventually ran into etcdserver running out of space to track pods, but that was with more like 10k pods queued. (We don't currently have an autoscaler.)
thanks for the info 🙂
It looks like right now we only retry k8s errors on these error codes:
Copy code
    503,  # Service unavailable
    504,  # Gateway timeout
    500,  # Internal server error
we could potentially add a 401 here as well as a workaround for that EKS bug (assuming that's what it is) - but generally retrying on a 401 wouldn't help since that typically indicates a permissions issue
yeap, i understand that, but as i mentioned it happens on a running cluster with other maps working on the same transaction. It's weird
yeah my suspicion is the root cause there is an EKS or k8s bug rather than a dagster issue - but we may be able to work around it
I can send out a PR that adds 401 to that list of error codes and see what the k8s experts on the team think
great! thank you very much! 🙂
Hi @Carlos Pega we are seeing a similar issue while stress testing some of our infra, were you able to find a workaround for this issue?
hi @Keith Gross, let me ask a colleague of mine and i come back to you
it seems like @daniel’s pr went through: https://github.com/dagster-io/dagster/blob/master/CHANGES.md#bugfixes-3
• [dagster-k8s] Fixed an issue where pods launched by the
would sometimes unexpectedly fail due to transient 401 errors in certain kubernetes clusters.
I know this was a while ago, but thanks for responding. Our issue turned out to be related to something else.