Dan Coates03/18/2021, 1:11 AM
for each url then doing some further (fairly simple) processing for each one. The 100,000 number is the initial backlog and once that is complete about 1000 more would need to be processed each day. I'm not sure what the best way to structure this is. From reading the docs it seems like there are a few options: 1. Use partitions and have 1 partition for each url (not sure if partitions can scale out this much) 2. Have one pipeline that takes a single domain and does the processing required for it, then a second pipeline that takes the list of urls and executes the first pipeline for each url 3. Do it all within one pipeline and use a dynamic graph to fan out once the list of urls is fetched (not sure if dynamic graphs can handle fan out to 100,000, I'm guessing not? 4. Do it all within one pipeline and handle the processing of all urls inside a single solid. It would be a pity to do this as you lose a lot of the great observability that dagster seems to offer. Any help or thoughts much appreciated :)
sandy03/18/2021, 1:23 AM
Dan Coates03/18/2021, 1:40 AM
Similar to your option 3, but handle multiple URLs per mapped task. I.e. construct chunks of maybe 10-100 URLs and package them together in a dynamic output.This sounds like the best approach. I was thinking that it would be nice to have a single pipeline run for each url to make it easier to inspect results but that's not a must have.
Gerhard Van Deventer03/18/2021, 6:47 AM
sandy03/18/2021, 3:30 PM
This sounds like the best approach. I was thinking that it would be nice to have a single pipeline run for each url to make it easier to inspect results but that's not a must have.It's certainly not a terrible idea. However, if it only takes a second to process each URL, then having a run per URL will massively blow up the total processing time, because it can take many seconds to launch a run, depending on what run launcher you're using