Mark Fickett
03/02/2023, 3:15 PMetcdserver request timed out
errors (as of agent version 1.1.10). Looking at out cluster resources, I don't see something obvious that would be putting strain on etcdserver
. Any suggestions? Full error in thread. I'll try updating our agent to 1.1.20 but wanted to surface the error too, since this has only come up about once a week.kubernetes.client.exceptions.ApiException: (500)
Reason: Internal Server ErrorHTTP response headers: HTTPHeaderDict({'Audit-Id': 'c27783d1-149c-4c14-aa13-39b704e351c3', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'be963a72-9c7c-4ba6-8b96-08dd1b67b88b', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'd762e8f5-544e-4b13-8031-f7f6909643ff', 'Date': 'Thu, 02 Mar 2023 03:34:23 GMT', 'Content-Length': '122'})HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"etcdserver: request timed out","code":500}
File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/api.py", line 991, in pipeline_execution_iterator
for event in pipeline_context.executor.execute(pipeline_context, execution_plan):
File "/usr/local/lib/python3.10/site-packages/dagster/_core/executor/step_delegating/step_delegating_executor.py", line 305, in execute
list(
File "/usr/local/lib/python3.10/site-packages/dagster_k8s/executor.py", line 260, in launch_step
self._api_client.batch_api.create_namespaced_job(
File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job
return self.create_namespaced_job_with_http_info(namespace, body, **kwargs) # noqa: E501
File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info
return self.api_client.call_api(
File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api
return self.__call_api(resource_path, method,
File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
response_data = self.request(
File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 391, in request
return <http://self.rest_client.POST|self.rest_client.POST>(url,
File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 276, in POST
return self.request("POST", url,
File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 235, in request
raise ApiException(http_resp=r)
dagster_k8s/executor.py", line 260, in launch_step
would make sense. I see some other recent improvements to the k8s executor in the changelog (so I'm eager to catch our version up!) but I don't see something that would obviously address this.daniel
03/02/2023, 4:26 PMMark Fickett
03/02/2023, 4:30 PMAndrea Giardini
03/02/2023, 5:43 PMMark Fickett
03/02/2023, 5:54 PMAndrea Giardini
03/03/2023, 10:04 AMMark Fickett
03/03/2023, 2:01 PMAndrea Giardini
03/03/2023, 2:04 PMMark Fickett
03/03/2023, 2:05 PMdaniel
03/03/2023, 2:36 PMMark Fickett
03/15/2023, 1:02 PM```The increase in error rates and latency observed in your cluster align with the time when defragmentation was performed on the etcd cluster backing your EKS cluster. EKS performs periodic defragmentation, one etcd node at-a-time, as a standard process on etcd to prevent etcd from running out of disk space. It is expected that defragmentation will result in timeouts and latency to requests that happen to be connected to the node being defragmented. Kubernetes is designed to tolerate short-lived timeouts to a subset of requests and minimal disruption is expected to workloads in the cluster. Customers can reduce the number of objects and size of each object to minimize the impact of defragmentation. EKS is working with the upstream etcd community to further optimize defragmentation and reduce actual impact to requests.
In details:
etcd stores data in a multiversion persistent key-value store. The persistent key-value store preserves the previous version of a key-value pair when its value is superseded with new data. The key-value store is effectively immutable; its operations do not update the structure in-place, but instead always generate a new updated structure. All past versions of keys are still accessible and watchable after modification. To prevent the data store from growing indefinitely over time and from maintaining old versions, the store may be compacted to shed the oldest versions of superseded data.
https://etcd.io/docs/v3.5/learning/data_model/
Compacting the keyspace history drops all information about keys superseded prior to a given keyspace revision. The space used by these keys then becomes available for additional writes to the keyspace.
https://etcd.io/docs/v3.5/op-guide/maintenance/#history-compaction-v3-api-key-value-database
After compacting the keyspace, the backend database may exhibit internal fragmentation. Any internal fragmentation is space that is free to use by the backend but still consumes storage space. Compacting old revisions internally fragments etcd by leaving gaps in backend database. Fragmented space is available for use by etcd but unavailable to the host filesystem. In other words, deleting application data does not reclaim the space on disk.
https://etcd.io/docs/v3.5/op-guide/maintenance/#defragmentation```So, sounds like they're saying it's WAI. I will push back and ask what ways they have to reduce/avoid timeouts, but for now it sounds like adding a retry would be very helpful. (Also I'm curious if any other customers using EKS are running into this.)