Does anyone have experience with running >1.000 - 10.000 ops concurrently?
We are working on a Dagster architecture where every 15 minutes we need to execute a large volume of ops (i.e. 1.000 - 10.000). We have Dagster deployed on Kubernetes (GCP) with the Helm chart and we're using the K8sRunLauncher such that all ops are handled concurrently where each op is executed in a dedicated Pod in Kubernetes. However, the with the current set-up this is not performant at all. Various issues we already notice when executing 500 ops concurrently:
• The scaling of the workers (i.e. Pods) takes a long time --> the quickest Pods are available within 15 seconds, the slowest Pods take more than 10 minutes before they are available.
• Running into DB connection issues, likely due to too many open connections (we use an external Postgresql database hosted on GCP)
• The duration it takes to complete an op (workload of each op practically identical) grows from <15s to >15min
Looking for advice on how to best engineer this.
Posted in #dagster-support