Hi all. New to Dagster, seeing if it is suited to my task. I'd love some feedback on whether I'm approaching this right. Thanks in advance!
I want to create a pipeline to incrementally process a corpus of documents.
1. run a database query that gets a list of PDFs from the source document management system
2. for each PDF in the list, fetch (HTTP get) it to a local directory if it's not already there; delete any local PDFs not in the list
3. for each PDF in the local directory, run a transform to output a JSON file; delete any previous outputs that do not match a PDF
4. do something (reduce?) with the JSON files
The transition from 1 to 2 above suggests a dynamic workflow. Should dynamic workflow be used for something where I could have tens of thousands of files, or should it be a static workflow where each solid transforms many documents?
I can imagine the solids creating Results related to the PDF and JSON assets. How can I make my pipeline clean up (on subsequent runs) PDF and JSON that have disappeared from the source system?
04/21/2021, 11:57 PM
Just my 2 cents here, I think dynamic orchestration illustrates your use case of using a variable list of files as input as the example use case. So, probably it’s a good fit here. I don’t have immediate answers to the rest of your approach, but dynamic workflow does sound good here to me.