I ve got a question about splitting work and if dagster is a dagster #ask-community

I've got a question about splitting work and if da...

moritz

03/15/2023, 2:54 PM

I've got a question about splitting work and if dagster is a good fit for this or not. Essentially, I'm wondering about split-apply-combine strategies. Imagine I have a large table and want to run an op on each split or I have many files in S3 and I want to run an op on each file. Consider that I have a million splits or files. Conceptually what I want is to define the table or the files as an asset (maybe with dynamic partition) and then have one (unpartitioned) downstream asset which is the outcome of combining those one million op results. Would dagster be a good fit for this with (dynamic) partitions or would it be better to use dagster to coordinate the work and manage the assets only? What I mean is maybe a single op that takes the entire upstream asset as input then let's, for example, dask or spark do the split and combine work on that asset and finally creates the combined output as a single asset again? Thank you in advance for your insights.

dagster bot responded by community 1

Andras Somi

03/15/2023, 3:13 PM

Not sure about the ultimate answer but you might want to take a look at dynamic graphs https://docs.dagster.io/concepts/ops-jobs-graphs/dynamic-graphs

moritz

03/15/2023, 3:24 PM

One of my questions is also if dagster can handle this kind of workload in terms of scheduling and job management. Do you have experience with that?

Andras Somi

03/15/2023, 4:16 PM

Not in the scale I think you need this, but yes, dynamic graphs worked for me in some cases, eg. when I wanted to fan out many API calls into their own op and collect the results downstream. It was in a very simple demployment though, nothing distributed.

moritz

03/15/2023, 4:25 PM

Cool, thanks for the info.

sandy

03/17/2023, 12:20 AM

For millions of partitions, we'd generally recommend a distributed computation engine like Spark or Dask instead of Dagster

👍 1

61 Views

Open in Slack

Previous Next