https://dagster.io/ logo
Title
e

Edward Smith

09/16/2021, 3:51 PM
Hi All! First Time dagster user here. I'm porting an existing process to dagster, and using DynamicOutput to parallelize a bunch of sorting operations. Is there a way to have millions of DynamicOutputs without breaking the UI?
j

johann

09/16/2021, 4:09 PM
Hi @Edward Smith, I think this would fall outside the standard use cases for Dagster. Dagster is great for structuring your pipelines, getting observability, etc., but it’s not intended to compete with tools like Dask or Spark for high performance compute. I don’t know much about your usecase, but I think you could consider calling out to one of those tools from your solids. They’ll do the heavy-lifting that they’re good at, while you’ll still get the Dagster interface on top of them
e

Edward Smith

09/16/2021, 4:14 PM
Thanks, @johann. This process currently just uses the Python multiprocessing library in a single script, so it seems like this should 'fit' within dagster:
p = multiprocessing.Pool(30)
        <http://logger.info|logger.info>("Starting to push")
        with gzip.open(f"recommendations_parallel.json.gz", "wt") as fout:
            for num, result in enumerate(p.imap_unordered(save_recommendations, list(range(len(broadcaster_names))))):
                if num % 10000 == 0:
                    <http://logger.info|logger.info>(f"Sent {num}")
                    logger.debug(f"Sample Result: {json.dumps(result)}")
                if SAVE_RECOMMENDATIONS:
                    fout.write(json.dumps(result))
                    fout.write("\n")
I could just have a single solid that uses this same code, I guess.
In the code above
save_recommendations
is getting called for each element of
broadcaster_names
j

johann

09/16/2021, 4:20 PM
Yeah I don’t have a strong opinion, but grouping the operations into one solid could be reasonable
Dagit tries to make it easy to inspect the lifetime of each solid, that definitely breaks down beyond maybe 1000 solids
e

Edward Smith

09/16/2021, 4:48 PM
That makes sense.... it does seem to me that the UI is the only limiting factor here, is that right?
Is it possible to combine Dynamic Solids into a single Composite Solid in the UI?
j

johann

09/16/2021, 4:49 PM
We also write a number of events to your database for each solid. You might run into issues with number of concurrent connections
e

Edward Smith

09/16/2021, 4:50 PM
Well, the number of concurrent connections is limited by the number of threads
So on a 16 core box, its just 16 connections
j

johann

09/16/2021, 4:51 PM
What executor would you be using?
e

Edward Smith

09/16/2021, 4:51 PM
I'm also thinking that maybe I could set a number of partitions in the solid config, and then, instead of yielding each individual item, I could yield N lists of items so that each list becomes a solid and there is a fixed number of lists.
I was thinking multiprocess_executor since that is how the existing code is working
j

johann

09/16/2021, 4:52 PM
You might find that the UI is the biggest limiting factor, but you would just be in uncharted territory and could certainly run in to something
e

Edward Smith

09/16/2021, 4:53 PM
Gotcha... I'll have my DynamicOutput be a list of X tuples instead of each tuple and I think that will resolve this issue.
j

johann

09/16/2021, 4:53 PM
With a million solids you’ll also pay the process spinup overhead for each solid
e

Edward Smith

09/16/2021, 4:53 PM
I can estimate the total number of tuples so that I wind up with a reasonable number of dynamic solids
j

johann

09/16/2021, 4:54 PM
Yeah I think finding the right way to group work is the way to go here
e

Edward Smith

09/16/2021, 4:54 PM
cool, thanks for the help!
j

johann

09/16/2021, 4:54 PM
You’re welcome!