Hi everyone I have noticed a really weird behaviour in my da dagster #integration-dbt

Hi everyone! I have noticed a really weird behavio...

Lorenzo

01/17/2023, 10:37 AM

Hi everyone! I have noticed a really weird behaviour in my dagster + dbt project. 🤨🌈 I notice that, every time I launch a materialization, dagster spends a lot of time just to "list" all the dbt models and only after listing all of them it starts the real job and so it materilizes the models I am interested in. I don't really know what it is doing because everything it's in the background, but if I check the background processes I can see that it is doing something like

"dbt ls --output json ... --select model: xyz "

and it does it for each job, it starts over again every time. Seems like it is checking what models of my project have alreasy been materialized and which ones are still never materialized, how can I avoid this time-consuming behaviour? Thanks in advance! 👀

Jonathan Neo

01/17/2023, 11:48 AM

Hey @Lorenzo that’s an interesting behavior. How are you specifying how dagaster executes dbt? Are the dbt models being imported as dagster assets?

Copy code

dbt_assets = load_assets_from_dbt_project(
    project_dir=DBT_PROJECT_PATH, profiles_dir=DBT_PROFILES, key_prefix=["jaffle_shop"]
)

Lorenzo

01/17/2023, 12:05 PM

Hi @Jonathan Neo I am importing each dbt model singularly because I wanto to be able to restart every dag from the point of failure without re-materializing all the assets of the dag. -

Copy code

TMP_UM_CS_S_CHANNEL_P_AU_NEW_RECORDS_asset = load_assets_from_dbt_project(project_dir=DBT_PROJECT_DIR, select="TMP_UM_CS_S_CHANNEL_P_AU_NEW_RECORDS")
TMP_UM_PO_B_PO_HEADER_NEW_RECORDS_asset = load_assets_from_dbt_project(project_dir=DBT_PROJECT_DIR, select="TMP_UM_PO_B_PO_HEADER_NEW_RECORDS")
TMP_UM_PO_B_PURCHASE_ORDER_DETAIL_NEW_RECORDS_asset = load_assets_from_dbt_project(project_dir=DBT_PROJECT_DIR, select="TMP_UM_PO_B_PURCHASE_ORDER_DETAIL_NEW_RECORDS")
TMP_UM_QU_B_QUESTIONNAIRE_DETAIL_P_AU_NEW_RECORDS_asset = load_assets_from_dbt_project(project_dir=DBT_PROJECT_DIR, select="TMP_UM_QU_B_QUESTIONNAIRE_DETAIL_P_AU_NEW_RECORDS")
TMP_UM_QU_B_QUESTIONNAIRE_HEADER_P_AU_NEW_RECORDS_asset = load_assets_from_dbt_project(project_dir=DBT_PROJECT_DIR, select="TMP_UM_QU_B_QUESTIONNAIRE_HEADER_P_AU_NEW_RECORDS")
TMP_UM_QU_S_QUESTIONNAIRE_ANSWER_P_AU_NEW_RECORDS_asset = load_assets_from_dbt_project(project_dir=DBT_PROJECT_DIR, select="TMP_UM_QU_S_QUESTIONNAIRE_ANSWER_P_AU_NEW_RECORDS")
TMP_UM_SH_B_SHOP_GOLIVES_P_AU_NEW_RECORDS_asset = load_assets_from_dbt_project(project_dir=DBT_PROJECT_DIR, select="TMP_UM_SH_B_SHOP_GOLIVES_P_AU_NEW_RECORDS")

Jonathan Neo

01/17/2023, 12:13 PM

I’m not sure I understand why you would want to import each dbt model on it’s own. I would just do a single

load_assets_from_dbt_project

to load all my dbt assets. If you want to specify certain dbt models only, you could do:

Copy code

load_assets_from_dbt_project(project_dir=DBT_PROJECT_DIR, select="model_1 model_2 model_3 model_4")

Having multiple

load_assets_from_dbt_project

would trigger multiple

dbt ls

commands, and therefore take a long time to execute.

Lorenzo

01/17/2023, 12:18 PM

I tried to import all the models at once and it was indeed quick. The problem with that is: when you have let's say 4 models in a group, they all reference the previous one (first -> second -> third -> fourth). The third one fails and the fourth one is skipped. At this point, I cannot correct the third model and the materialize only the third and fourth. I can correct the third model, but then the materialization of the models is going to restart from the beginning. And dagster is going to re-materialize again first -> second -> third -> fourth. Even if first and second don't need to be started again

Jonathan Neo

01/17/2023, 12:55 PM

Ah I see. I would look at using a

build_asset_reconciliation_sensor

. What the reconciliation sensor does is that it materializes only model_4 when model_3 is fixed. I have an example here in a toy project: https://github.com/jonathanneo/my-dbt-dagster/blob/578ff10b9c1a4478f5e5462e7aa5d3ff2a4e07e7/stargazer/assets_modern_data_stack/my_asset.py#L97-L99

Jonathan Neo

01/17/2023, 12:57 PM

Usage docs here: https://docs.dagster.io/_apidocs/schedules-sensors#dagster.build_asset_reconciliation_sensor

Lorenzo

01/17/2023, 1:37 PM

Wow, thank you so much @Jonathan Neo!! I am looking into it right now. Hopefully it is going to fix my problems dagster yay

Adam Bloom

01/17/2023, 3:16 PM

The other solution here is to switch to the

load_assets_from_dbt_manifest

loader instead of the one you’re currently using: https://docs.dagster.io/_apidocs/libraries/dagster-dbt#dagster_dbt.load_assets_from_dbt_manifest This requires you to run

dbt ls

yourself (I.e. during your user code deployment container build) and then reuses the output for every dbt asset.

❤️ 1

Lorenzo

01/18/2023, 1:17 PM

@owen Hi owen, I've tagged you here because I'd like to know why dagster does the

/usr/bin/python3 /home/lorenzo/.local/bin/dbt --no-use-color --log-format json ls --project-dir /home/lorenzo/Documents/GitHub/dagster-dbt-test/dbt_python_assets/dbt_python_assets/../UM_FOX_AU-dbt/dbt --profiles-dir /home/lorenzo/Documents/GitHub/dagster-dbt-test/dbt_python_assets/dbt_python_assets/../UM_FOX_AU-dbt/dbt/config --select TMP_UM_SH_B_SHOP_HIERARCHY_P_AU_UPDATE --output json

for each and every asset during my run. Keep in mind that I imported every model as a singular asset to be able to restart the DAGs with maximum granularity. It looks a bit strange, because it does this command for each asset during the import of the code, and then it repeats the same thing for each asset (again) when I run a DAG. Thank you! yay

Qwame

01/18/2023, 5:52 PM

+1 to this. My dagster project runs the

dbt ls

command for any asset that I materialize, even if it's not a dbt asset.

Adam Bloom

01/18/2023, 5:57 PM

You should probably be using

load_assets_from_dbt_manifest

rather than

load_assets_from_dbt_project

- see my comment above

❤️ 1

Qwame

01/18/2023, 5:59 PM

@Adam Bloom I get that. But I'm just wondering if it's necessary to run

dbt ls

on each asset materialization.

❤️ 1

Adam Bloom

01/18/2023, 6:00 PM

it is triggered whenever

load_assets_from_dbt_project

is invoked. you won't see it happening on each startup with

load_assets_from_dbt_manifest

Qwame

01/18/2023, 6:02 PM

It's okay for it to be triggered on each startup. But to be triggered before each asset (even non-dbt assets) materializes is just too much, I think.

Qwame

01/18/2023, 6:05 PM

Unless it is behaving like dbt which parses the models to determine the dependencies and lineages before each run.

owen

01/18/2023, 6:05 PM

@Lorenzo just loading your entire project with a single call to

load_assets_from_dbt*

will allow you to execute any subset of dbt models, so loading each model as a separate call is not recommended and doesn't have a real benefit.

owen

01/18/2023, 6:07 PM

@Qwame this happens because dagster needs to load your repository in order to execute any step of your job -- when you use

load_assets_from_dbt_project

, that means that in order to load your repository code, dagster will need to run

dbt ls

(there's no way to load just the subset of the repository that is unrelated to dbt). I'd definitely endorse @Adam Bloom’s suggestion of using

load_assets_from_dbt_manifest

for this case.

🌈 1

Lorenzo

01/19/2023, 8:50 AM

Thank you for the explanations guys, your interest is really appreciated! Thanks to @owen and @Adam Bloom in particular. To be fair, I did try the import from project and from manifest before, but for some reason, I could not restart only the failed jobs. Probably I experimented with so many different elements that I got confused 🤕.

8 Views

Open in Slack

Previous Next