Hey, I want to run nodes separately that I load fr...
# integration-dbt
s
Hey, I want to run nodes separately that I load from manifest using this snippet of code:
Copy code
dbt_assets = load_assets_from_dbt_manifest(
    json.load(open("./manifest.json")),
    node_info_to_asset_key=lambda node_info: AssetKey(
        str(["dbt"] + [node_info["unique_id"].split(".")])
    ),
    use_build_command=True,
)
Dagster loads all assets as intended but when I try to run the graph it then launches one pod where all models are run because the entrypoint command is:
dbt --no-use-color --log-format json build --project-dir /data_products/common/dbt --profiles-dir /data_products/common/dbt --select *
meaning the selection has defaulted to asterisk as mentioned in the documentation. This means that the resulting graph will run all models in every single node so I will have
n^2
of model runs. My question: How do I achieve behaviour where each asset is run in separate step so I get one pod per one model+its tests(because of the build command)? Normally I would just do a
dbt build --project-dir /data_products/common/dbt --profiles-dir /data_products/common/dbt --select <MODEL_NAME>
but in case where all models are loaded at once from manifest I cannot extract the names in some sort of loop.
🤖 3
z
We wanted this functionality as well Unfortunately your only option is to fork the dagster-dbt implementation and modify the method that creates the
AssetsDefinition
so that you create 1:1 for each dbt model you're using
s
Looks like this might be a common need, I’ll pass it onto our
dagster-dbt
people.
❤️ 2
s
Yeah, we are evaluating dagster as an orchestrator and having all models run within the same pod are a gamekiller for us. If that could be implemented it would be really appreciated.
@Bartosz Kopytek cc
@sean Could you please leave the link to the issue here so we can track the progress of work and the ETA?
🌈 1
s
cc @owen — I lack the dbt knowledge/context to know whether this is already in our issues or the best way to express the problem
👍 1
o
hi @Szymon Piskorz ! do you mind elaborating on the issue with having all dbt models execute in the same pod? is this a resource constraint thing, observability issue, something else? In general, we've preferred the solution of keeping all dbt computation within a single node, as it avoids per-process overhead and tracks more closely with "normal" dbt execution (where commands generally target lots of models at once), but I'm definitely interested in alternative use cases. I'm also not fully understanding
This means that the resulting graph will run all models in every single node so I will have n^2 of model runs.
bit -- calling
load_assets_from_dbt_manifest
should result in just a single node in the graph (which will execute all models), not n nodes.
s
@owen We had wanted to split it between pods/nodes in order to get the benefit of visibility when the run is ongoing, also re-runability if that is even a word. It just seems sad to see 32000 models run in one pod, even if not separation model by model it would be cool to have an option to pick a criteria on which we could load our dbt project and run separate “sub-projects” in separate nodes, as they are completely separate entities. But are still part of our “graph” as a whole. So short answer is visibility and just ability to pick and choose how it performs in separate-by-model, separate-by-tag, separate-by-sub-project. Basis.
o
Makes sense! One option that gets you part of the way there is to call
load_assets_from_dbt_manifest
multiple times, with different
select
parameters (so for example call it once for each sub project). In this way, you can get each sub project to run in its own isolated step. You also do get some level of in-progress visibility (provided you're on
dbt-core>=1.4
), as materialization events are generated as soon as each model completes executing. Retryability is another bit that we want to improve on here, but I don't think each model out into separate steps is the right knob to turn in the general case (although it certainly would work). Instead, we're planning to expand the re-execution APIs to support retrying just the failed assets for a step, but this is a ways out. But yeah in short, I think your feedback makes sense, and the best short-term solution would be to have multiple invocations of
load_assets_from_dbt_manifest
. In the long term, do you think you would still need the one model <> one step mapping, or would sufficiently ergonomic dbt selection <> one step do the trick? My personal bias here, but I guess in my mind I shudder a bit at the thought of 32k process being kicked off, each just doing a tiny bit of work, but I definitely understand the appeal of a middle ground where each step is responsible for a few dozen models or something.