https://dagster.io/ logo
Title
d

David VanBolt

11/12/2022, 4:54 PM
Subject design question I've been wrestling with for a couple of days. I am putting together a pretty classic SQLServer->SQLServer elt pipeline and I'm loving the potential of dagster after putting together a little mvp project. The trouble I'm having is there are hundreds of tables and I dont know whether I should set them up all as their own jobs. Right now I am using graphs and not assets to do it because I wasn't entirely sure how incremental extractions/loads worked with IO managers and I'd be writing my own because its SQL server. That said I'd love to move to asset materializations as I'd love to see lineage once I start on the transformation layer (probably dagster-dbt). Here are the two approaches I've taken. 1. Have a single job and a single schedule. That schedule loops through a library of stored configs (take these takes from these columns, primary keys etc) and then yields hundreds of run requests for that job but with the approach config for opps and resources (resource defs themselves don't need to change). 2. Change the main job to a graph and then have a loop through the stored configs that uses the .to_job() method to emit hundreds of jobs (and schedules) with the according config. In general is there are "right" way to do this when you're effectively just replicating subsections (by column, not partitioned) of RBDMS tables? Both methods work fine in Dagit but how do I use the completion of all these schedules (or run requests from a schedule) as dependency that needs to have completed before Dagster starts any dbt materializations? Would it be enough to emit AssetMaterializations from the final Op in the graph/job? Any high level best practices for this situation or specific projects on github that are good examples would be really appreciated.
❤️ 1
:rainbow-daggy: 1
s

Stephen Bailey

11/13/2022, 6:28 PM
I think this is a design decision with tradeoffs either way. My recommendation is that you should have an asset for every table you create, rather than a single job with many different configurations. You can use an
asset generator
to do this cleanly. It doesn't make things any less efficient from a coding perspective -- the main tradeoff is that it will generate a large number of nodes in your asset graph / clutter up your dagit ui / run history with hundred of different runs/assets. However, when you need the extra control (for example running a backfill on a single table), you have it naturally. Additionally, the dagster "repo" functionality allows you to exclude certain repos in the ui, which helps with the clutter problem. You could also do the "single job" approach, and yield AssetMaterialization events at the end of the run to populate your asset catalog with metadata. This is actually what we do currently (but wouldn't do again). You lose some of the nice catalog UI functionality, and don't really gain anything IMO from an end user perspectvie.
❤️ 1
:rainbow-daggy: 1
d

David VanBolt

11/14/2022, 8:53 PM
Outstanding Stephen. I'll give it a shot. Thanks so much
:rainbow-daggy: 1
1
j

Joris Ganne

12/29/2022, 1:21 PM
@David VanBolt have you worked this further out? I'm having design questions with a similar situations and would like to hear about your approach. Thank you in advance