Hi everyone! I have noticed a really weird behavio...
# integration-dbt
l
Hi everyone! I have noticed a really weird behaviour in my dagster + dbt project. 🤨🌈 I notice that, every time I launch a materialization, dagster spends a lot of time just to "list" all the dbt models and only after listing all of them it starts the real job and so it materilizes the models I am interested in. I don't really know what it is doing because everything it's in the background, but if I check the background processes I can see that it is doing something like
"dbt ls --output json ... --select model: xyz "
and it does it for each job, it starts over again every time. Seems like it is checking what models of my project have alreasy been materialized and which ones are still never materialized, how can I avoid this time-consuming behaviour? Thanks in advance! 👀
j
Hey @Lorenzo that’s an interesting behavior. How are you specifying how dagaster executes dbt? Are the dbt models being imported as dagster assets?
Copy code
dbt_assets = load_assets_from_dbt_project(
    project_dir=DBT_PROJECT_PATH, profiles_dir=DBT_PROFILES, key_prefix=["jaffle_shop"]
)
l
Hi @Jonathan Neo I am importing each dbt model singularly because I wanto to be able to restart every dag from the point of failure without re-materializing all the assets of the dag. -
Copy code
TMP_UM_CS_S_CHANNEL_P_AU_NEW_RECORDS_asset = load_assets_from_dbt_project(project_dir=DBT_PROJECT_DIR, select="TMP_UM_CS_S_CHANNEL_P_AU_NEW_RECORDS")
TMP_UM_PO_B_PO_HEADER_NEW_RECORDS_asset = load_assets_from_dbt_project(project_dir=DBT_PROJECT_DIR, select="TMP_UM_PO_B_PO_HEADER_NEW_RECORDS")
TMP_UM_PO_B_PURCHASE_ORDER_DETAIL_NEW_RECORDS_asset = load_assets_from_dbt_project(project_dir=DBT_PROJECT_DIR, select="TMP_UM_PO_B_PURCHASE_ORDER_DETAIL_NEW_RECORDS")
TMP_UM_QU_B_QUESTIONNAIRE_DETAIL_P_AU_NEW_RECORDS_asset = load_assets_from_dbt_project(project_dir=DBT_PROJECT_DIR, select="TMP_UM_QU_B_QUESTIONNAIRE_DETAIL_P_AU_NEW_RECORDS")
TMP_UM_QU_B_QUESTIONNAIRE_HEADER_P_AU_NEW_RECORDS_asset = load_assets_from_dbt_project(project_dir=DBT_PROJECT_DIR, select="TMP_UM_QU_B_QUESTIONNAIRE_HEADER_P_AU_NEW_RECORDS")
TMP_UM_QU_S_QUESTIONNAIRE_ANSWER_P_AU_NEW_RECORDS_asset = load_assets_from_dbt_project(project_dir=DBT_PROJECT_DIR, select="TMP_UM_QU_S_QUESTIONNAIRE_ANSWER_P_AU_NEW_RECORDS")
TMP_UM_SH_B_SHOP_GOLIVES_P_AU_NEW_RECORDS_asset = load_assets_from_dbt_project(project_dir=DBT_PROJECT_DIR, select="TMP_UM_SH_B_SHOP_GOLIVES_P_AU_NEW_RECORDS")
j
I’m not sure I understand why you would want to import each dbt model on it’s own. I would just do a single
load_assets_from_dbt_project
to load all my dbt assets. If you want to specify certain dbt models only, you could do:
Copy code
load_assets_from_dbt_project(project_dir=DBT_PROJECT_DIR, select="model_1 model_2 model_3 model_4")
Having multiple
load_assets_from_dbt_project
would trigger multiple
dbt ls
commands, and therefore take a long time to execute.
l
I tried to import all the models at once and it was indeed quick. The problem with that is: when you have let's say 4 models in a group, they all reference the previous one (first -> second -> third -> fourth). The third one fails and the fourth one is skipped. At this point, I cannot correct the third model and the materialize only the third and fourth. I can correct the third model, but then the materialization of the models is going to restart from the beginning. And dagster is going to re-materialize again first -> second -> third -> fourth. Even if first and second don't need to be started again
j
Ah I see. I would look at using a
build_asset_reconciliation_sensor
. What the reconciliation sensor does is that it materializes only model_4 when model_3 is fixed. I have an example here in a toy project: https://github.com/jonathanneo/my-dbt-dagster/blob/578ff10b9c1a4478f5e5462e7aa5d3ff2a4e07e7/stargazer/assets_modern_data_stack/my_asset.py#L97-L99
l
Wow, thank you so much @Jonathan Neo!! I am looking into it right now. Hopefully it is going to fix my problems dagster yay
a
The other solution here is to switch to the
load_assets_from_dbt_manifest
loader instead of the one you’re currently using: https://docs.dagster.io/_apidocs/libraries/dagster-dbt#dagster_dbt.load_assets_from_dbt_manifest This requires you to run
dbt ls
yourself (I.e. during your user code deployment container build) and then reuses the output for every dbt asset.
❤️ 1
l
@owen Hi owen, I've tagged you here because I'd like to know why dagster does the
/usr/bin/python3 /home/lorenzo/.local/bin/dbt --no-use-color --log-format json ls --project-dir /home/lorenzo/Documents/GitHub/dagster-dbt-test/dbt_python_assets/dbt_python_assets/../UM_FOX_AU-dbt/dbt --profiles-dir /home/lorenzo/Documents/GitHub/dagster-dbt-test/dbt_python_assets/dbt_python_assets/../UM_FOX_AU-dbt/dbt/config --select TMP_UM_SH_B_SHOP_HIERARCHY_P_AU_UPDATE --output json
for each and every asset during my run. Keep in mind that I imported every model as a singular asset to be able to restart the DAGs with maximum granularity. It looks a bit strange, because it does this command for each asset during the import of the code, and then it repeats the same thing for each asset (again) when I run a DAG. Thank you! yay
q
+1 to this. My dagster project runs the
dbt ls
command for any asset that I materialize, even if it's not a dbt asset.
a
You should probably be using
load_assets_from_dbt_manifest
rather than
load_assets_from_dbt_project
- see my comment above
❤️ 1
q
@Adam Bloom I get that. But I'm just wondering if it's necessary to run
dbt ls
on each asset materialization.
❤️ 1
a
it is triggered whenever
load_assets_from_dbt_project
is invoked. you won't see it happening on each startup with
load_assets_from_dbt_manifest
q
It's okay for it to be triggered on each startup. But to be triggered before each asset (even non-dbt assets) materializes is just too much, I think.
Unless it is behaving like dbt which parses the models to determine the dependencies and lineages before each run.
o
@Lorenzo just loading your entire project with a single call to
load_assets_from_dbt*
will allow you to execute any subset of dbt models, so loading each model as a separate call is not recommended and doesn't have a real benefit.
@Qwame this happens because dagster needs to load your repository in order to execute any step of your job -- when you use
load_assets_from_dbt_project
, that means that in order to load your repository code, dagster will need to run
dbt ls
(there's no way to load just the subset of the repository that is unrelated to dbt). I'd definitely endorse @Adam Bloom’s suggestion of using
load_assets_from_dbt_manifest
for this case.
🌈 1
l
Thank you for the explanations guys, your interest is really appreciated! Thanks to @owen and @Adam Bloom in particular. To be fair, I did try the import from project and from manifest before, but for some reason, I could not restart only the failed jobs. Probably I experimented with so many different elements that I got confused 🤕.