Does anyone have experience with running >1.000...
# ask-community
b
Does anyone have experience with running >1.000 - 10.000 ops concurrently? We are working on a Dagster architecture where every 15 minutes we need to execute a large volume of ops (i.e. 1.000 - 10.000). We have Dagster deployed on Kubernetes (GCP) with the Helm chart and we're using the K8sRunLauncher such that all ops are handled concurrently where each op is executed in a dedicated Pod in Kubernetes. However, the with the current set-up this is not performant at all. Various issues we already notice when executing 500 ops concurrently: • The scaling of the workers (i.e. Pods) takes a long time --> the quickest Pods are available within 15 seconds, the slowest Pods take more than 10 minutes before they are available. • Running into DB connection issues, likely due to too many open connections (we use an external Postgresql database hosted on GCP) • The duration it takes to complete an op (workload of each op practically identical) grows from <15s to >15min Looking for advice on how to best engineer this.
dagster bot responded by community 1
a
Hey Bernard 👋 Interesting usecase. Regarding the Pod scaling, that looks like a k8s-specific issue, not so much related to Dagster. I guess most of the time is spent spawning up new nodes to process the workload right? how big is the docker image? Regarding the DB connections, it should be possible in GCP to increase the max_connections parameters of your Cloud sql server. Just make sure to size the DB appropriately. How cpu/mem intensive is each op?
b
Fair point, my first bullet is indeed more a k8s issue. Most of the time is indeed spent on spawning up nodes. As far as the docker image size goes it's about 440 MB. Yeah, definitely will need to look into increasing the max_connections. Think the database we're using right now is undersized which also restricts the amount of connections you're allowed to have. Not entirely sure, but it should be quite minimal. Basically each op just needs to make an API call and fetch some data. Currently we are just mocking this step where it's just grabbing data from a local file. I had arbitrarily experimented with some resource limits on the Pods (250m cpu and 500Mi memory) and those initial ops/workloads finish in <15 seconds. Have you worked on anything similar before?
a
Your combination of thousands of ops in a very short amount of time is quite unique. I worked more with ops that are very cpu/mem intensive rather than thousands short-lived ops like in your case A couple of things come to mind: • You could pre-scale your cluster in advance (e.g. a couple of minutes before the pipelines are triggered) • I generally tend to avoid very short-lived ops. Usually, there is more overhead scheduling the k8s pods, pulling the image, loading the dagster code than anything else. Since what you are doing seems quite straightforward, I would let every op do multiple api calls. I think it’s better to have 100 ops that perform 10 requests rather than 1000 pods that make one request each (in terms of k8s/dagster overhead).
s
I have the same problem, where we are working with this about this level of scale
a
Have you tried the suggestions above? ☝️
s
Yes. And with other orchestrators, we’ve managed to hit the limits of a gke cluster.
a
What’s the botteneck?
s
I’m attempting to move the other direction, a few pods as possible and I’m running into issues with dragsters run coordinator
Re: bottleneck - Not using Dagster, we basically maxed out the resources that GKE can support to their master api in order for it keep up with scheduling pods. At some point, gke’s master api runs out of memory and it becomes unhealthy, which basically causes a cluster to become unusable.
a
Does that also happen with regional clusters? From what I’ve seen, GKE scales the control plane up and down based on load
j
These were the bottlenecks we ran into and fixed: 1. The Postgres database: Each log gets written to the database, so you need a large number of possible database connections. Also your database needs to be powerful enough to handle the requests. 2. The "run" pod: The pod that spawns all the "step" pod, did not have enough compute. This made the process slow of spawning new pods, thus making the overall run really slow. We increased the specs of this pod. 3. Too many ops: The large number of pods didn't seem like a problem for GKE, but Dagster didn't like it. Everything became really slow. To solve this we batches multiple operations into ops. Although not ideal (missing the glass plane idea of Dagster, missing sensible retries etc.)
s
Ofcourse that goes into scaling k8s to support multiple masters, and while useful, we were well aware of the limitations of over provisioning pods, including the other head of per node and per pod and even other garbage collection issues that come out at high scale
a
@Jules Huisman (Quantile) Aren’t you using the GCS log manager? This will store your logs in GCS
s
Re: regional, no it was zonal. The limits we saw were on the order of 20k pods, but there we’re legacy configurations about this cluster which makes it particularly untrivial to recreate as regional
a
Ah well that’s the reason then… zonal clusters control planes don’t scale well. Regional is the way. It’s already a miracle that you managed to run 20k pods on a single control plane! 😄
j
@Andrea Giardini Didn't know that existed, we will definitely switch to that
D 1
s
I’m trying to avoid the issue with Dagster, by having a balance between processes and pods but now running into intra process scheduling issues