Is anyone running a pipeline with thousands of ste...
# deployment-kubernetes
m
Is anyone running a pipeline with thousands of steps in step-per-pod mode? We're doing that on EKS, but getting etcdserver timeouts which AWS support says is because etcd is defragging its database (a couple times per week). I'm wondering if we're just generating way more short-lived pods than k8s (or at least EKS) is meant to handle; or if roughly this setup works fine for someone else / maybe hosted differently.
j
Off hand I know that we do have users on EKS with at least 1k pods per run. Of course they might not be running it as frequently. Have you heard from AWS support whether they recommend retries here?
m
Thanks Johann, good to have corroboration. The latest from EKS (last night) is that they're pushing for retries: "From EKS side, we have reduced the defrag frequency - earlier it was once in 6 hours, now it is once in 24 hours. This should reduce the impact but I would request you to please check if the workload can retry upon a timeout to make it more resilient." I'm not actually sure reducing defrags from 6 to 24h intervals will help; we are seeing interruption every 1-2 days anyway and presumably a less frequent defrag will take longer. So adding retries would be great!
I filed https://github.com/dagster-io/dagster/issues/13059 with details for this (we're still having jobs fail because of the issue). Cross-posted to the older Slack thread.