We're seeing occasional `etcdserver request timed ...
# deployment-kubernetes
m
We're seeing occasional
etcdserver request timed out
errors (as of agent version 1.1.10). Looking at out cluster resources, I don't see something obvious that would be putting strain on
etcdserver
. Any suggestions? Full error in thread. I'll try updating our agent to 1.1.20 but wanted to surface the error too, since this has only come up about once a week.
Copy code
kubernetes.client.exceptions.ApiException: (500)
Reason: Internal Server ErrorHTTP response headers: HTTPHeaderDict({'Audit-Id': 'c27783d1-149c-4c14-aa13-39b704e351c3', 'Cache-Control': 'no-cache, private', 'Content-Type': 'application/json', 'X-Kubernetes-Pf-Flowschema-Uid': 'be963a72-9c7c-4ba6-8b96-08dd1b67b88b', 'X-Kubernetes-Pf-Prioritylevel-Uid': 'd762e8f5-544e-4b13-8031-f7f6909643ff', 'Date': 'Thu, 02 Mar 2023 03:34:23 GMT', 'Content-Length': '122'})HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"etcdserver: request timed out","code":500}
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/execution/api.py", line 991, in pipeline_execution_iterator
    for event in pipeline_context.executor.execute(pipeline_context, execution_plan):
  File "/usr/local/lib/python3.10/site-packages/dagster/_core/executor/step_delegating/step_delegating_executor.py", line 305, in execute
    list(
  File "/usr/local/lib/python3.10/site-packages/dagster_k8s/executor.py", line 260, in launch_step
    self._api_client.batch_api.create_namespaced_job(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 210, in create_namespaced_job
    return self.create_namespaced_job_with_http_info(namespace, body, **kwargs)  # noqa: E501
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api/batch_v1_api.py", line 309, in create_namespaced_job_with_http_info
    return self.api_client.call_api(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 348, in call_api
    return self.__call_api(resource_path, method,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 180, in __call_api
    response_data = self.request(
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/api_client.py", line 391, in request
    return <http://self.rest_client.POST|self.rest_client.POST>(url,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 276, in POST
    return self.request("POST", url,
  File "/usr/local/lib/python3.10/site-packages/kubernetes/client/rest.py", line 235, in request
    raise ApiException(http_resp=r)
I'm wondering if a retry around
dagster_k8s/executor.py", line 260, in launch_step
would make sense. I see some other recent improvements to the k8s executor in the changelog (so I'm eager to catch our version up!) but I don't see something that would obviously address this.
d
Hmm I don't think I've seen this one before, I'll have to ask the k8s experts - a retry on this particular call might be tricky because it would run the risk of creating multiple jobs if we're not careful.
👍🏻 1
m
OK, thanks for taking a look. If said experts have suggestions for what might cause a timeout, I would definitely be glad to hear those too.
a
Where is the cluster deployed? This looks like a server error rather than a client error. Does it happen only with 1.1.10?
m
The cluster is in our AWS account in us-east-1. I agree it looks like an error response from the server. But it happens rarely (hasn't come up in our staging Dagster deployment which uses the same env, has only come up 2x in the 2 weeks since we moved prod from local multithreading to EKS), so it will probably be hard to track down and if the client can be robust to it that would be awesome. I've only seen it with 1.1.10 but like I said we just moved prod; I'm going to try upgrading the agent and see if that effects it.
a
To me, this looks like a problem on the EKS side, not Dagster. I guess it doesn’t happen on staging because maybe there are fewer pipelines running? I think this call should be ok to retry because jobs have unique names so it’s not possible to create two jobs with the same name.
m
Thanks. I sent a support case to AWS, I'll see what they say. Reading more EKS docs I see that the control plane runs on AWS-owned nodes, not our nodes, so it seems unlikely there's interaction between (for example) our disk usage and etcdserver disk writes. So I agree it seems hard for it to be caused by something Dagster / application side.
a
Do you know if you have multiple control-plane nodes? I am not sure about AWS but on google cloud you can either have a single control planes or multiple HA control planes (for an extra cost)
m
I haven't seen a way to figure that out or adjust it.
d
Let us know what aws ends up saying mark - not opposed to us making some changes on our side if they recommend just retrying when this happens
ty thankyou 2
m
AWS says:
```The increase in error rates and latency observed in your cluster align with the time when defragmentation was performed on the etcd cluster backing your EKS cluster. EKS performs periodic defragmentation, one etcd node at-a-time, as a standard process on etcd to prevent etcd from running out of disk space. It is expected that defragmentation will result in timeouts and latency to requests that happen to be connected to the node being defragmented. Kubernetes is designed to tolerate short-lived timeouts to a subset of requests and minimal disruption is expected to workloads in the cluster. Customers can reduce the number of objects and size of each object to minimize the impact of defragmentation. EKS is working with the upstream etcd community to further optimize defragmentation and reduce actual impact to requests.
In details:
etcd stores data in a multiversion persistent key-value store. The persistent key-value store preserves the previous version of a key-value pair when its value is superseded with new data. The key-value store is effectively immutable; its operations do not update the structure in-place, but instead always generate a new updated structure. All past versions of keys are still accessible and watchable after modification. To prevent the data store from growing indefinitely over time and from maintaining old versions, the store may be compacted to shed the oldest versions of superseded data.
https://etcd.io/docs/v3.5/learning/data_model/
Compacting the keyspace history drops all information about keys superseded prior to a given keyspace revision. The space used by these keys then becomes available for additional writes to the keyspace.
https://etcd.io/docs/v3.5/op-guide/maintenance/#history-compaction-v3-api-key-value-database
After compacting the keyspace, the backend database may exhibit internal fragmentation. Any internal fragmentation is space that is free to use by the backend but still consumes storage space. Compacting old revisions internally fragments etcd by leaving gaps in backend database. Fragmented space is available for use by etcd but unavailable to the host filesystem. In other words, deleting application data does not reclaim the space on disk.
https://etcd.io/docs/v3.5/op-guide/maintenance/#defragmentation```
So, sounds like they're saying it's WAI. I will push back and ask what ways they have to reduce/avoid timeouts, but for now it sounds like adding a retry would be very helpful. (Also I'm curious if any other customers using EKS are running into this.)
Bumping this -- I filed https://github.com/dagster-io/dagster/issues/13059 to collect up details, but we're still seeing timeouts a few times a week, and AWS support has not made changes that resolve the timeouts (due to etcdserver defragmentation processes). They have let us know the defrags take about 20s, so a typical retry-with-backoff should get around them. Since our jobs fail when the k8s executor hits these errors, it's important for us to resolve the issue, and unfortunately it doesn't seem like we can get around it without framework help from y'all.
🙌 1