Szymon Piskorz
03/28/2023, 12:06 PMdbt_assets = load_assets_from_dbt_manifest(
json.load(open("./manifest.json")),
node_info_to_asset_key=lambda node_info: AssetKey(
str(["dbt"] + [node_info["unique_id"].split(".")])
),
use_build_command=True,
)
Dagster loads all assets as intended but when I try to run the graph it then launches one pod where all models are run because the entrypoint command is: dbt --no-use-color --log-format json build --project-dir /data_products/common/dbt --profiles-dir /data_products/common/dbt --select *
meaning the selection has defaulted to asterisk as mentioned in the documentation. This means that the resulting graph will run all models in every single node so I will have n^2
of model runs.
My question: How do I achieve behaviour where each asset is run in separate step so I get one pod per one model+its tests(because of the build command)?
Normally I would just do a dbt build --project-dir /data_products/common/dbt --profiles-dir /data_products/common/dbt --select <MODEL_NAME>
but in case where all models are loaded at once from manifest I cannot extract the names in some sort of loop.Zachary Bluhm
03/28/2023, 1:09 PMAssetsDefinition
so that you create 1:1 for each dbt model you're usingsean
03/28/2023, 8:30 PMdagster-dbt
people.Szymon Piskorz
03/29/2023, 7:56 AMSzymon Piskorz
03/29/2023, 7:56 AMSzymon Piskorz
03/29/2023, 11:27 AMsean
03/29/2023, 2:16 PMowen
03/29/2023, 4:35 PMThis means that the resulting graph will run all models in every single node so I will have n^2 of model runs.
bit -- calling load_assets_from_dbt_manifest
should result in just a single node in the graph (which will execute all models), not n nodes.Szymon Piskorz
03/31/2023, 10:49 AMowen
03/31/2023, 9:55 PMload_assets_from_dbt_manifest
multiple times, with different select
parameters (so for example call it once for each sub project). In this way, you can get each sub project to run in its own isolated step.
You also do get some level of in-progress visibility (provided you're on dbt-core>=1.4
), as materialization events are generated as soon as each model completes executing.
Retryability is another bit that we want to improve on here, but I don't think each model out into separate steps is the right knob to turn in the general case (although it certainly would work). Instead, we're planning to expand the re-execution APIs to support retrying just the failed assets for a step, but this is a ways out.
But yeah in short, I think your feedback makes sense, and the best short-term solution would be to have multiple invocations of load_assets_from_dbt_manifest
. In the long term, do you think you would still need the one model <> one step mapping, or would sufficiently ergonomic dbt selection <> one step do the trick?
My personal bias here, but I guess in my mind I shudder a bit at the thought of 32k process being kicked off, each just doing a tiny bit of work, but I definitely understand the appeal of a middle ground where each step is responsible for a few dozen models or something.