Hi folks, I’m running 128 parallel tasks on 0.15.0...
# deployment-kubernetes
a
Hi folks, I’m running 128 parallel tasks on 0.15.0 with the k8s_executor and am seeing sporadic groups of failing ops with this k8s event on failing pods
Copy code
Error: failed to reserve container name "dagster_dagster-step-7ac915e19de074261268d861f51d1504-lh5gh_x-dagster_931e866f-92f0-4af8-b087-875b78dd1128_0": name "dagster_dagster-step-7ac915e19de074261268d861f51d1504-lh5gh_x-dagster_931e866f-92f0-4af8-b087-875b78dd1128_0" is reserved for "f4795a5d5e4e9c42a46bf59b3a98d1401fc871a03226a71479c8a65c4c15a21c"
Seems like this could be retry-related, but would be keen to find anyone else who has seen this.
j
Hmm I haven’t seen this, are you on GKE?
a
Thanks I had a glance at this but I’m using EKS. I’ve generally found that over 100 ops on the K8s jobs via dagster has a long tail of errors like this. Most likely I’ll try re-frame our use case
j
Hmm sorry you’re running in to stuff like this
Seems like the common thread on this error at least is clusters being under heavy load https://github.com/elastic/cloud-on-k8s/issues/2632
I’m a bit curious what container
f4795a5d5e4e9c42a46bf59b3a98d1401fc871a03226a71479c8a65c4c15a21c
is in your example. Are 2 containers trying to create for the same pod?
a
f4795a5d5e4e9c42a46bf59b3a98d1401fc871a03226a71479c8a65c4c15a21c
I believe the dagster step container. It’s a standard setup that works fine 99% of the time. So it’s evidently some race condition triggered by transient errors during a scale-up.
r
@Alex Remedios we’re using EKS as well and it happens to us mostly during high-peaks. Did you solve it somehow? thanks.
a
hi Roei, I’ve resolved to use the dagster multiprocessing executor for orchestration then submit tasks to a ray cluster which may be better suited for high-dimensional homogenous compute.
👍 1