Hi everyone ! I would like to have your thoughts about how to implement Dagster and run jobs on several local machines (I'm a week old dagster newbie)
My use case :
I have an already existing data pipeline written and ran only manually (step by step) that I want to transfer to Dagster (because it's currently a huge pain to operate and adapt to other use cases without a proper orchestrator). As two external paying APIs are used, I'm pretty cautious about task executions and duplicates avoidance.
The pipeline is seeded with quite extensive configurations (semantics, urls, numeric tresholds, etc.), it begins with composing first API queries, then runs them in batches, then process them (with a changing IO json schema), then evaluates data and compose the second API targets, etc. until processing a custom report. It's currently built on top of a local Mongo DB (which was better for prototyping) and ran locally.
Current goal :
As I need to prototype quite a few things to improve and extend the existing pipeline and make it run time consuming tasks (selenium automation tasks), I want to use some local computers that I own to operate those tasks (a Raspberry Pi 4, an old 32bit linux PC, etc.)
My question : I want to know what would be the Dagster way of using those machines to run jobs and still monitoring them on my main PC dagit instance ?
(I'm not so good at Docker and absolutely noob in K8S)
My ideas :
• Joining dagit instances on multiple PC with specific ops making graphQL calls, maybe with a Fastapi "hub". My main PC would send runjob calls and receive AssetMaterialization events for example.
• Maybe run celery servers on those machines and configure dagit to interact with them on specific jobs (but I'm not sure to understand how it's implemented)
• Else ?
Sometimes when you learn new concepts, the problem or the solution you focus is wrong, so feel free to correct me in any level you see necessary 🙂 I may also have missed a doc or example dealing with this kind of use case
Thanks a lot !