Hi Dagster team!
I would like to address a question regarding performance.
It is a good way to describe a data transformation pipeline as a number of steps, with each step performing atomic transformation. But from a high level view, execution of such N-step process with data transfer between steps would take more time than if we execute this N-steps within one process.
If we take into consideration Dask (as an execution environment) its scheduler tries to distribute tasks among processes in such a way that should provide some degree of data locality (i.e. minimize data transfer over net), but still it is not trivial and I suppose a data analyst knows better what data is used by which task and can better judge on data locality.
Is there a way in dagster to 1/ describe a pipeline as a step of processes, 2/ provide some metadata that could enable better performance? In fact it is necessary to achieve some compromise between granular process description and between execution with minimal steps/hops/etc
07/24/2019, 2:50 PM
Hey @Alexei! If I understand your question correctly, I would position Dagster as much more coarse-grained task orchestration vs. something like Dask or Pyspark. Computational engines like those are built from the ground up with performance optimizations in mind (data locality) whereas Dagster is not really designed to directly process large-scale data.
Dagster is much more similar to Airflow in that regard, and is intended to sit a layer above computational engines like Dask and Spark, orchestrating the execution of these coarse-grained tasks.
That said, would be interested to hear more about your use case!
07/24/2019, 2:56 PM
Hi, @nate! In fact my case is about how can we understand the optimum granularity of atomic tasks?
1/ Having atomic tasks as analyst sees it - it is very convinient, but usually far from good performance. The flow can be described clearly, it is easy to understand and it is straightforward to implement. But it is slow.
2/ Next step is usually the optimization: some tasks are united into a single task. And on this step we loose the benefits of having granularity as shown in point 1. But we get performance.
So I expect that having an option to provide meta information on how granular atomic tasks pipeline should be executed would be a great benefit for any tool. Otherwise there is only illusive connection between flow seen by analyst and its implementation.
Currently I have a team which has made its own implementation of DAG and its execution. It has granular tasks but its performance is far far far from optimum. So I was wondering if there is a way to gain performance boost with keeping the current granularity (i.e. already developed code) but providing some extra information to some execution engine
I also have other examples of already developed DAGs for which some rearrangement of tasks, uniting there contexts, etc would give a great performance boost. But we can do it only coding. And it is rather challenging
Given a flow tasks A->B->C we get several options to reorganize execution: AB->C, A->BC, ABC. Each option should be coded manually and tested for performance. If we could do it using some meta information, we could automate it and quicker understand the most “optimum” flow
07/24/2019, 4:50 PM
let me DM you, I have a few more questions on your use case 🙂