The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Does anyone have experience with running &gt;1.000 - 10.000 ops concurrently?

We are working on a Dagster architecture where every 15 minutes we need to execute a large volume of ops (i.e. 1.000 - 10.000). We have Dagster deployed on Kubernetes (GCP) with the Helm chart and we're using the K8sRunLauncher such that all ops are handled concurrently where each op is executed in a dedicated Pod in Kubernetes. However, the with the current set-up this is not performant at all. Various issues we already notice when executing 500 ops concurrently:
• The scaling of the workers (i.e. Pods) takes a long time --&gt; the quickest Pods are available within 15 seconds, the slowest Pods take more than 10 minutes before they are available.
• Running into DB connection issues, likely due to too many open connections  (we use an external Postgresql database hosted on GCP)
• The duration it takes to complete an op (workload of each op practically identical) grows from &lt;15s to &gt;15min
Looking for advice on how to best engineer this.
Posted in #dagster-support