Hi Dagster team!
I would like to address a question regarding performance.
It is a good way to describe a data transformation pipeline as a number of steps, with each step performing atomic transformation. But from a high level view, execution of such N-step process with data transfer between steps would take more time than if we execute this N-steps within one process.
If we take into consideration Dask (as an execution environment) its scheduler tries to distribute tasks among processes in such a way that should provide some degree of data locality (i.e. minimize data transfer over net), but still it is not trivial and I suppose a data analyst knows better what data is used by which task and can better judge on data locality.
Is there a way in dagster to 1/ describe a pipeline as a step of processes, 2/ provide some metadata that could enable better performance? In fact it is necessary to achieve some compromise between granular process description and between execution with minimal steps/hops/etc