The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Does anyone have experience running a high number of concurrent jobs with dagster hybrid deployed to kubernetes? We scaled up to 40 recently and noticed some issues with our agent going to a restart loop. I wanted to seek advice on what might need to scale up other than op execution hosts to reach a volume of around 100 concurrently executing jobs.

Hi Matthew - were there any logs or information from the description of the agent pod that might give more clues about why it was restarting? I wouldn't expect 100 concurrent jobs to destabilize the agent

So i'm sure we can sort this out, maybe with some tweaks to your setup

Thanks for the response, <@U016C4E5CP8>, looks like you already helped us troubleshoot the actual agent shutdown and we have the following action items so far:
• adding explicit k8s resource limits for our hosts
• changing run queue settings to target a specific set of nodes
• running the code servers on a different node than the run servers and agent
Are there any other scaling concerns we might want to consider addressing as we ramp up to more concurrent jobs? Is there a tested upper bound on the amount of concurrent jobs we should be able to run?

On the dagster side, the main variable i think is less going to be the number of concurrent jobs (100 is no problem) than the size and complexity of those jobs and the number of events / logs that they're emitting at once. I would expect the resource constraints of your cluster to be the limit long before you run into any scaling limits on the dagster side.

My understanding is that we would either need to horizontally or vertically scale the code execution pods depending on the constraint if this is the case. Is that correct? Or would a large number of logs or intense CPU/memory use cause issues elsewhere as well?

I think making use of k8s autoscaling to make sure your cluster can keep up with the usage is the most likely thing you'll need to do, yeah

Sounds good! That was the plan, but we got caught up on the code execution and agent being on the same hosts and likely didn’t allow enough resources.

Yeah, once they have limits in place, errors should present themselves as the pods that exceed their limits getting killed by the cluster, rather than making everything else around them slow