The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

General Dagster use-case question.  We're processing large (multiple TB) of data across many machines using k8s/ec2 and tools like dask.  We'd love to be able to model this in dagster for a) visibility and b) asset provenance/tracking.  How would one model this?  it seems like Dagster is geared towards doing all the computation within the python/dagster process

we have a dask integration: <https://docs.dagster.io/deployment/guides/dask>
you can run computation in either a local dask cluster or a distributed one, and the data is passed between computation (aka ops) via <https://docs.dagster.io/concepts/io-management/io-managers|IO Managers>

Does "data is passed between computation" mean that data is actually moved around, because in my case, i need data to stay in the dask cluster and only a reference be passed

I have been using the <https://github.com/dagster-io/dagster/blob/master/python_modules/libraries/dagster-dask/dagster_dask/resources.py#L143|dask_resource> directly rather than using it as an executor. I bet creating a dask IOManager similar to the <https://github.com/dagster-io/dagster/blob/master/python_modules/dagster/dagster/_core/storage/mem_io_manager.py|mem_io_manager> might be a way to pass around the futures how ever you want? :shrug: still thinking about this myself

yes the data itself _can_ be moved around. but you can also pass it by reference.