Hi Dagster community, I have a typical ELT job where I extract and load data into Google Cloud Storage. From there I use dbt (orchestrated by Dagster) to run SQL queries that materialize a bunch of tables in BigQuery. Right now I am likely overloading one of my ops with too much responsibility, but I am unsure how to break it apart. Take the example where I am retrieving a million records from an API endpoint. Right now I have a single op that is doing the extract and load. This is because I am using a python generator in my resource to yield back the pages of the API as they are retrieved. That lets me upload data to GCS without having to collect all million records in memory. At the end of the op I am returning a “task complete” string and not the million records. This reduces the memory footprint, but also doesn’t respect some Dagster best practices such as relying on the io manager to store the retrieved data in GCS. Does anyone have thoughts on how they would approach this if they needed to create something similar?
04/14/2022, 9:01 PM
Hey @marcos - the two main reasons for breaking apart ops are:
• Re-execution - wanting to be able to pick up in the middle if something fails
• Parallelism - allowing multiple sub-chunks of work to run at the same time
Are either of those relevant to you in this case? If not, I wouldn't think that you're doing anything wrong with your current approach.
04/14/2022, 9:09 PM
Those are great questions to ask oneself when deciding how to break up work into various ops. Thank you for sharing that. In this case those two are not relevant so I will not try and force this op into two. Thank you for the response!
04/19/2022, 9:47 AM
If I may jump in here, we have both above mentioned requirement such as re-execution and parallelism. That's one reason we have separate jobs that are connected together with sensors. Another is because each job has different granularity. Think of one Job collect a massive zip and the outputs are 1 to x files that are spawned n following jobs.
It's works kind of well as we can run lot's of jobs (1000s of them) in parallel. But we lose pretty much all overview and re-execution get's harder too, as we can only re-execute one job, but not the whole pipeline (which are all jobs together) again. If you understand what I mean, is there a way to have a "master" DAG for the overview but still be able to run different granularities?
I have a feeling separation of jobs and SW defined assets could help, but I guess also dynamic DAGs would be heavily needed, which makes it more harder if one one file would error out, that all other jobs would haven a status error as well. Were in the current approach each file has it's own job. Not sure if that makes sense. But I loved how you @marcos presented in the community meeting one DAG with dynamic ops.