General Dagster use-case question. We're processi...
# ask-community
r
General Dagster use-case question. We're processing large (multiple TB) of data across many machines using k8s/ec2 and tools like dask. We'd love to be able to model this in dagster for a) visibility and b) asset provenance/tracking. How would one model this? it seems like Dagster is geared towards doing all the computation within the python/dagster process
y
we have a dask integration: https://docs.dagster.io/deployment/guides/dask you can run computation in either a local dask cluster or a distributed one, and the data is passed between computation (aka ops) via IO Managers
r
Does "data is passed between computation" mean that data is actually moved around, because in my case, i need data to stay in the dask cluster and only a reference be passed
n
I have been using the dask_resource directly rather than using it as an executor. I bet creating a dask IOManager similar to the mem_io_manager might be a way to pass around the futures how ever you want? 🤷 still thinking about this myself
y
yes the data itself can be moved around. but you can also pass it by reference.