Does anyone have experience running a high number ...
# deployment-kubernetes
m
Does anyone have experience running a high number of concurrent jobs with dagster hybrid deployed to kubernetes? We scaled up to 40 recently and noticed some issues with our agent going to a restart loop. I wanted to seek advice on what might need to scale up other than op execution hosts to reach a volume of around 100 concurrently executing jobs.
d
Hi Matthew - were there any logs or information from the description of the agent pod that might give more clues about why it was restarting? I wouldn't expect 100 concurrent jobs to destabilize the agent
So i'm sure we can sort this out, maybe with some tweaks to your setup
m
Thanks for the response, @daniel, looks like you already helped us troubleshoot the actual agent shutdown and we have the following action items so far: • adding explicit k8s resource limits for our hosts • changing run queue settings to target a specific set of nodes • running the code servers on a different node than the run servers and agent Are there any other scaling concerns we might want to consider addressing as we ramp up to more concurrent jobs? Is there a tested upper bound on the amount of concurrent jobs we should be able to run?
d
On the dagster side, the main variable i think is less going to be the number of concurrent jobs (100 is no problem) than the size and complexity of those jobs and the number of events / logs that they're emitting at once. I would expect the resource constraints of your cluster to be the limit long before you run into any scaling limits on the dagster side.
m
My understanding is that we would either need to horizontally or vertically scale the code execution pods depending on the constraint if this is the case. Is that correct? Or would a large number of logs or intense CPU/memory use cause issues elsewhere as well?
d
I think making use of k8s autoscaling to make sure your cluster can keep up with the usage is the most likely thing you'll need to do, yeah
👍 1
m
Sounds good! That was the plan, but we got caught up on the code execution and agent being on the same hosts and likely didn’t allow enough resources.
d
Yeah, once they have limits in place, errors should present themselves as the pods that exceed their limits getting killed by the cluster, rather than making everything else around them slow