Is there some maximum time a pod is allowed to be ...
# deployment-kubernetes
m
Is there some maximum time a pod is allowed to be pending before it's treated as failed? In my case with 5k dynamic outs, eventually some of the pods which were still Status:Pending according to kubectl showed up as failed in Dagit (stack in thread), which caused my job to fail. I'm trying to figure out if it's a reasonable theory that the agent just got tired of waiting for the pods to schedule (even though k8s thought they would eventually get scheduled) and decided they must have had an error, or if there was some error status or communication error on/in the pod I overlooked.
Copy code
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='172.20.0.1', port=443): Max retries exceeded with url: /apis/batch/v1/namespaces/dagster-cloud/jobs/dagster-step-2262cd711596e75e5477cb052aa81db2 (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f4f7b86d780>: Failed to establish a new connection: [Errno 111] Connection refused'))
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/executor/step_delegating/step_delegating_executor.py", line 243, in execute
    health_check_result = self._step_handler.check_step_health(
  File "/usr/local/lib/python3.10/site-packages/dagster_k8s/executor.py", line 248, in check_step_health
    job = self._batch_api.read_namespaced_job(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 2657, in read_namespaced_job
    return self.read_namespaced_job_with_http_info(name, namespace, **kwargs)  # noqa: E501
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 2744, in read_namespaced_job_with_http_info
    return self.api_client.call_api(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 373, in request
    return self.rest_client.GET(url,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 241, in GET
    return self.request("GET", url,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 214, in request
    r = self.pool_manager.request(method, url,
  File "/usr/local/lib/python3.10/site-packages/urllib3/request.py", line 75, in request
    return self.request_encode_url(
  File "/usr/local/lib/python3.10/site-packages/urllib3/request.py", line 97, in request_encode_url
    return self.urlopen(method, url, **extra_kw)
  File "/usr/local/lib/python3.10/site-packages/urllib3/poolmanager.py", line 336, in urlopen
    response = conn.urlopen(method, u.request_uri, **kw)
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 754, in urlopen
    return self.urlopen(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 754, in urlopen
    return self.urlopen(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 754, in urlopen
    return self.urlopen(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 726, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 439, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
d
Hey Mark - we don't have a timeout, no - the k8s executor check up on steps to see if they're still running, and it looks like that request failed with a communication error within the k8s cluster. But there isn't a timeout at play
a
version
1.1.6
includes a change that implements better retries for the specific exception you hit here
a
@alex Can you link to it here?
m
Thanks!
a