Mark Fickett

07/28/2022, 6:48 PM
What's the best Dagster-cloud execution backend for a batch job with 10k tasks that run for a couple minutes each but can almost all be in parallel? I'm trying to figure out which executor will work best for our pipeline. Ideally it would: • Have relatively low per-step overhead, since our tasks don't run for that long (locally, python multiprocess + forkserver is working well). This may not be possible with lots of horizontal scaling / starting more instances. • Have minimal cost while our pipeline isn't running (we would have tasks executing maybe 10 out of 24 hours). I think both EKS and ECS can use Fargate, which will allocate instances/pods to match load. • Scale horizontally to match increasing number of items to process. I think the ECS executor only runs 1 instance per job, not instances per task, which probably rules it out. EKS starts additional pods per step I believe, though that's a bit heavy weight for our needs. A mix of execution models could be nice, not sure if that's at all practical. For example, run the lightweight setup steps with forkserver, and then send 1000 of the heavier tasks each to a separate dynamically allocated instance to be processed there in parallel with forkserver. I'm not familiar with Dask and I don't see Dagster Cloud docs about running an agent for Dask, so I'm not sure if I should be considering it.


07/29/2022, 3:53 PM
ya we’ve talked for years about building a system to bundle steps together, but its a big undertaking we haven’t started yet. One path you could take if its feasible is to break up the volume of work in to separate job runs - so each job run gets its own instance and takes on the amount of work deemed appropriate for a single machine. The difficulty may then be doing something once all jobs completed, though sensors can help with that. Additionally assets and partitioning may be a good way to model something like this, Another option would be to exit dagster orchestration and delegate to some other computation system from your
. You do lose out on dagster understanding whats happening as deeply, but systems like dask that focus deeply on scaling compute may handle your workload more effectively.