Ken Geis

04/21/2021, 11:13 PM
Hi all. New to Dagster, seeing if it is suited to my task. I'd love some feedback on whether I'm approaching this right. Thanks in advance! I want to create a pipeline to incrementally process a corpus of documents. 1. run a database query that gets a list of PDFs from the source document management system 2. for each PDF in the list, fetch (HTTP get) it to a local directory if it's not already there; delete any local PDFs not in the list 3. for each PDF in the local directory, run a transform to output a JSON file; delete any previous outputs that do not match a PDF 4. do something (reduce?) with the JSON files The transition from 1 to 2 above suggests a dynamic workflow. Should dynamic workflow be used for something where I could have tens of thousands of files, or should it be a static workflow where each solid transforms many documents? I can imagine the solids creating Results related to the PDF and JSON assets. How can I make my pipeline clean up (on subsequent runs) PDF and JSON that have disappeared from the source system?

Andy H

04/21/2021, 11:57 PM
Just my 2 cents here, I think dynamic orchestration illustrates your use case of using a variable list of files as input as the example use case. So, probably it’s a good fit here. I don’t have immediate answers to the rest of your approach, but dynamic workflow does sound good here to me.