I've got a question about splitting work and if da...
# ask-community
m
I've got a question about splitting work and if dagster is a good fit for this or not. Essentially, I'm wondering about split-apply-combine strategies. Imagine I have a large table and want to run an op on each split or I have many files in S3 and I want to run an op on each file. Consider that I have a million splits or files. Conceptually what I want is to define the table or the files as an asset (maybe with dynamic partition) and then have one (unpartitioned) downstream asset which is the outcome of combining those one million op results. Would dagster be a good fit for this with (dynamic) partitions or would it be better to use dagster to coordinate the work and manage the assets only? What I mean is maybe a single op that takes the entire upstream asset as input then let's, for example, dask or spark do the split and combine work on that asset and finally creates the combined output as a single asset again? Thank you in advance for your insights.
dagster bot responded by community 1
a
Not sure about the ultimate answer but you might want to take a look at dynamic graphs https://docs.dagster.io/concepts/ops-jobs-graphs/dynamic-graphs
m
One of my questions is also if dagster can handle this kind of workload in terms of scheduling and job management. Do you have experience with that?
a
Not in the scale I think you need this, but yes, dynamic graphs worked for me in some cases, eg. when I wanted to fan out many API calls into their own op and collect the results downstream. It was in a very simple demployment though, nothing distributed.
m
Cool, thanks for the info.
s
For millions of partitions, we'd generally recommend a distributed computation engine like Spark or Dask instead of Dagster
👍 1