About the feature Pass partition ranges to single run it wou dagster #dagster-feedback

About the feature "Pass partition ranges to single...

Nicolas Parot Alvarez

04/21/2023, 9:35 AM

About the feature "Pass partition ranges to single run", it would be really neat if it didn't require rewriting my partitioned assets to explicitly tell Dagster how to loop over a range of partitions. The basic logic seems non-ambiguous. Would there be unexpected side effects in simply letting Dagster execute the DAG for each partition in a single run by himself without having to code it? I think that already provides benefits in reducing the time loading code. I think this could be the default behavior of the feature, and if people want to customize how the looping happens, then they can specify it by calling the relevant context arguments.

geoHeil

04/21/2023, 9:48 AM

This depends on what you choose as your storage engine

geoHeil

04/21/2023, 9:49 AM

In I.e. Spark partitions might have a different level of parallelism than compared to for looping inside spark

Nicolas Parot Alvarez

04/21/2023, 12:29 PM

The default behavior would not consider if the resources have parallelism or not, it would just execute DAGs independently for each partition, one after the other, potentially in parallel according to the concurrency configuration.

Nicolas Parot Alvarez

04/21/2023, 1:30 PM

Awakened my Gimp-fu Here's a visual representation of 3 partition DAGs running under a single run, instead of 3, and respecting a concurrency of 2.

geoHeil

04/21/2023, 3:12 PM

But this is already pretty much the same thing for internal resources where no external thing like databricks cluster is spun up (or EMR) which might take a while to initialize?

geoHeil

04/21/2023, 3:12 PM

as far as I understand this, the main point of this feature is to limit the instanciation of such resources in case of backfills and focus on the actual operations which can be then more efficient

Nicolas Parot Alvarez

04/21/2023, 4:11 PM

My hypothesis is that having a single run could allow doing only once some of the loading that Dagster does to prepare a run. For example, parsing the definitions and loading them in memory and maybe other things.

owen

04/21/2023, 6:28 PM

hi @Nicolas Parot Alvarez! this is definitely an interesting idea, although @geoHeil is correct that this feature is intended for cases where you can group the execution of multiple partitions into a single operation, rather than multiple operations in the same run. at the moment, there's no way for dagster to know if an asset supports this behavior or not, so that button in the UI is the sort of implicit "all of my assets can be executed like this" button. but in a future world where there's some way (in code) to specify if execution can be grouped or not, I can imagine that something like what you're describing could be possible.

D 1

Nicolas Parot Alvarez

04/24/2023, 10:28 AM

Thanks for your answer @owen. I think it would be most beneficial, if it reduces the loading time of Dagster, which I guess are the light blue bars that take most of the time on my DAG executions above. Would it be possible to compress those times if Dagster knew that it has to run the exact same code for each partition in a single run?

48 Views

Open in Slack

Previous Next