Hello there community! Thanks for the great work in the framework.
I have a question regarding scaling dagster:
• As an example I want to process a dataframe/table and produce a different output for each row (i.e., one file per row)
• I have an op that finds the universe of rows to be calculated and produces one dynamicoutput for each row
• Each row requires a pretty complex series of operations and takes quite some time, can fail in diffferent steps, etc. This is why I'd like to keep each row as a different work unit, giving more visiblity and fault isolation between processing each row
• Ideally I'd like to process millions of isolated rows in flows that can take a long time to run (>24h)
When I implement a prototype to test this just the collection of all the dynamic outputs takes a very long time which blocks the execution of the flow during that time. The UI struggles a bit as well but that is not a problem at all.
One obvious answer here could be to manually partition these rows and use one op per partition, at the expense of some visibility and isolation / fault tolerance, but I first want to understand if there are other ways to look at the issue.
In essence:
• Is there a working way to implement dynamic graphs that fan out into millions of ops using Dagster? Does this type of workload not suit dagster? Is there a suggested way to do something like this?
I've seen some light discussion around this in github issues but no obvious conclusions.
Thanks a lot for the help in advance