Hey everyone, looking at using dagster to execute ...
# announcements
d
Hey everyone, looking at using dagster to execute some workflows. One particular use case we have involves taking a list of about 100,000 urls and getting the page
<title>
for each url then doing some further (fairly simple) processing for each one. The 100,000 number is the initial backlog and once that is complete about 1000 more would need to be processed each day. I'm not sure what the best way to structure this is. From reading the docs it seems like there are a few options: 1. Use partitions and have 1 partition for each url (not sure if partitions can scale out this much) 2. Have one pipeline that takes a single domain and does the processing required for it, then a second pipeline that takes the list of urls and executes the first pipeline for each url 3. Do it all within one pipeline and use a dynamic graph to fan out once the list of urls is fetched (not sure if dynamic graphs can handle fan out to 100,000, I'm guessing not? 4. Do it all within one pipeline and handle the processing of all urls inside a single solid. It would be a pity to do this as you lose a lot of the great observability that dagster seems to offer. Any help or thoughts much appreciated :)
s
Hi @Dan Coates - great question. Here are the two ways I would think about implementing this: • Use a system like Spark, Dask, or Ray to handle the highly-parallel parts. This is especially recommended if you envision yourself needing to do parallel joins, group-bys, etc. • Similar to your option 3, but handle multiple URLs per mapped task. I.e. construct chunks of maybe 10-100 URLs and package them together in a dynamic output.
d
Hey @sandy thanks for the quick reply. I'm not too worried about trying to do this highly parallel, it doesn't need to complete all that quickly and the processing time for each url should only be a second or so.
Similar to your option 3, but handle multiple URLs per mapped task.  I.e. construct chunks of maybe 10-100 URLs and package them together in a dynamic output.
This sounds like the best approach. I was thinking that it would be nice to have a single pipeline run for each url to make it easier to inspect results but that's not a must have.
g
Hi , we have an almost identical use case. Are there any good examples of using this fan-out approach for concurrent operations like these, please?
s
This sounds like the best approach. I was thinking that it would be nice to have a single pipeline run for each url to make it easier to inspect results but that's not a must have.
It's certainly not a terrible idea. However, if it only takes a second to process each URL, then having a run per URL will massively blow up the total processing time, because it can take many seconds to launch a run, depending on what run launcher you're using
@alex is there an example of dynamic orchestration in docs that we can point to?