Arnoud van Dommelen

11/21/2022, 3:59 PM
Hello everyone! We are facing an issue where we are unable to launch more than +/- 200 tasks simultaneously, however, the desired number of simultaneous tasks is > 1500 tasks. Currently, we are using AWS Fargate as container service, Dagster as orchestrator and Aurora PostgreSQL as database to write to and read from. The CPU of the Aurora PostgreSQL database (2 – 100 ACUs) is sufficient to handle the simultaneous tasks, with an average CPU usage of 40 percent. On top of that, the maximum available Fargate On-Demand vCPU resource count (4000vCPU) lies well above the current usage (+/- 200 tasks with 2vCPU). Within Dagster we specified a limit of 1000 concurrent runs. Do you have an idea where the bottleneck is coming from or whether we are missing something significant in our current solution? Thank you in advance.
:dagster-bot-resolve: 1
Also, the run queue was > 1000 at some point. It looks like the Daemon might not be able to handle the large number of requests?

Mike Atlas

11/21/2022, 9:36 PM
not sure what's going on but my approach would be to create a sandbox test and make the concurrent job count configurable so you can determine where the true limit seems to be hit. like maybe a job that sleeps for 5min and nothing else. then launch it with 100, 200, 300, 400 parallel
❤️ 1
keep an eye on the cloudwatch logs for any errors or rate limit alerts
eg it could be insidious like ECR pull image limit (or dockerhub, if you're using that?)
@Arnoud van Dommelen there's a release note in the lastest dagster that might fix your issue...
• [dagster-aws] Fixed a bug in the
when launching many runs in parallel. Previously, each run risked hitting a
in AWS for registering too many concurrent changes to the same task definition family. Now, the
recovers gracefully from this error by retrying it with backoff.
❤️ 1