Hi Cat - we’ve just started looking into DAG managers (including Dagster). In our case, we’re looking to use ECS indirectly, by automating jobs using AWS Batch. We haven’t found any first-class support yet (except in Netflix’s Metaflow), so we’re still examining our options…
05/29/2020, 8:39 PM
gotcha. it’s interesting that metaflow’s compute layer integrates with aws batch instead of ecs directly, probably worth considering for dagster too. i was wondering how you made the decision that aws batch was a better fit than using aws ecs directly — is it mostly because the lifecycle is fully managed?
05/29/2020, 9:51 PM
Yeah, generally ease of use and lifecycle management, though I suppose some of that could be provided by Dagster?
We want to be able to together (docker-based) tasks of varying sizes, up to the larger instance types (r4.8xl, etc)
… so our assumption is any pull-based system (i.e. with workers awaiting tasks) wouldn’t work for us. We only want to keep instances alive when necessary.
05/29/2020, 10:09 PM
So with respect to managing lifecycle, dagster provides a run master that kicks off and monitors jobs and also handles user configured retries
A common strategy on the K8s side (which I think should be similar here) is to create an ephemeral k8s pod per step (ie per node in the dag) that shuts down once the step completes
in this case, only the dagit instance (and potentially celery / flower / broker) needs to be kept up all the time, but the actually compute pods dont
definitely see where youre coming from -- the architecture of our system is that run launchers and step executers are spun up per start of pipeline run, so that resources arent wasted
05/29/2020, 10:43 PM
Ah right - not very familiar with k8s, but I think I follow. So there’s no reason we wouldn’t be able to build a DAG of tasks where the resources allocated (CPU/memory) to each task is determined on the fly? Thanks for your help on this, btw!
05/30/2020, 12:18 AM
hey Rob - by “on the fly” do you mean you’d like resource limits (and node sizes, etc.) to be defined per task at DAG definition time, or do you truly want to set these dynamically at execution time?
If the former, it should be straightforward to define resource limits on solids (using “tags”), and then flow these through to AWS Batch using our “run launcher” abstraction