So I am posting in common < C01U954MEER|dagster support> cha dagster #ask-community

So I am posting in common <#C01U954MEER|dagster-su...

Binoy Shah

04/20/2023, 7:03 PM

So I am posting in common #dagster-support channel, but my concern touches #dagster-dbt and #dagster-kubernetes too Attach image shows our repository code organization with following features 1. Multiple modules/packages, each with separate workflow/(dagster repository) 2. Actually there are many more packages under

workflows

than what image shows, I have trimmed the size 3. Shared large DBT project between all the workflows 4. Each package.repository loads its own set of DBT Asset/s 5. All repositories are imported by the

workflows

top level package within

___init___.py

file 6. All code packaged in single Docker Image 7. Deployed in Kubernetes via just 1

deployments

entry and started with

--moduleName src.workflows

Status: Currently its starting correctly, and all project assets are displayed correct and it works Problem: Whenever an Asset Materialization or a Job is started, a Kuberntes job is created and the Kubernetes Pod loads all of my assets and pre-emptively executes

dbt run

command for each of the asset of all repository of all packages and it takes almost 10 minutes before it actually executes the asset materialization that was desired in the first place. Any recommendations or pointers on how I can improve and what I am doing wrong ?

Tim Castillo

04/20/2023, 7:21 PM

Hi! This is an interesting situation. Aside from the

dbt run

initialized during the asset materialization, are you triggering a

dbt run

elsewhere? I'm currently trying to wrap my head around this. Is the 10 minutes because that's how long it takes for the

dbt run

to finish? and what do you mean by the run is pre-emptive?

Binoy Shah

04/20/2023, 7:58 PM

as in if the job is created to execute the asset materialization

Copy code

@asset()
def my_asset():
  do logic

Shouldnt it just execute this method, why does it have to pre-load all repositories and load all dbt assets of other packages

Binoy Shah

04/20/2023, 8:02 PM

So if I have asset

inventory

in package/module

wear

When i materialize the

inventory

asset, The kuberntes Job that is kicked off to materialize the

inventory

first loads all other repositories like below

Copy code

@repository
def participant_summary_data_repository():
  print("starting the repository: participant_summary_data")
  return [
           with_resources(
             dbt_assets,
             resource_defs={
               "dbt": dbt_cli_resource.configured(warehouse.value), }
           )

         ] + [transform_job] + [participant_summary_data_schedule] + [sensor_job]

I have 15 such repositories since its a mono-repo code base

Binoy Shah

04/20/2023, 8:03 PM

so I get 15 messages in the console

Copy code

starting the repository: participant_summary_data
starting the repository: participant_something_other
starting the repository: another_repository_data

Binoy Shah

04/20/2023, 8:03 PM

before the actual code of

inventory

is even executed

Tim Castillo

04/20/2023, 8:04 PM

Hmm, while I'm still trying to grasp this, can I verify something? Is it doing a

dbt run

dbt compile

when it spins up? When loading definitions/repository,

load_assets_from_dbt_project

has to compile the dbt project to understand what the models are. However, if that's the bottleneck, then you can use

load_assets_from_dbt_manifest

and point it to a pre-compiled dbt project's

manifest.json

to skip needing to compile it. This might solve your issue.

Binoy Shah

04/20/2023, 8:05 PM

checking i could be mistaken

Copy code

dbt_assets = load_assets_from_dbt_project(
    project_dir=DBT_PROJECT_DIR,
    profiles_dir=DBT_PROFILES_DIR,
    select=dbt_asset_selects,
    node_info_to_asset_key=lambda node_info: get_node_asset_key(node_info),
    node_info_to_group_fn=lambda node_info: get_node_group_name(node_info),
    use_build_command=True
)

That’s my dbt asset for each of 15 repositories

🙇🏽 1

Binoy Shah

04/20/2023, 8:07 PM

I am truely ashamed if the bow is sarcastic one 🙂

Tim Castillo

04/20/2023, 8:08 PM

ahahaha, not sarcastic, just I appreciate you checking!

Tim Castillo

04/20/2023, 8:09 PM

but yeah, let me know if it's kicking off a full dbt run or if it's just a bunch of really slow dbt compiles. If it's the second one, then using load_assets_from_manifest will speed that up significantly for you.

Binoy Shah

04/20/2023, 8:13 PM

I checked.. It’s build/compile not the run, run only happens when i materialize the dbt asset The Python code below is in

src/workflows/wear/assets/inventories.py

When I materialize the

raw_inventory

the Kube job starts up and first loads all my 15 repository modules even if they are not anyway associated with

inventories.py

inventories.py

Binoy Shah

04/20/2023, 8:13 PM

Why should it load all 15 repositories and not just one module

wear

Tim Castillo

04/20/2023, 8:19 PM

Hmm, that goes into the nitty gritty of how repositories and code locations work that I'm not familiar with. that being said, would you be interested in trying out the

manifest

way? The easy way to try it out is run a

dbt compile

locally, put those

manifest.json

s into your Dagster project and swap out the

load_assets_from_*

calls to point to those files instead.

Binoy Shah

04/20/2023, 8:26 PM

Below is my actual log

kubjob-log

Binoy Shah

04/20/2023, 8:27 PM

The job logic is just 1.27 seconds long.. but total execution time goes way beyond 5-7 minutes

Binoy Shah

04/20/2023, 8:28 PM

and repository started logs are somehow appearing twice for this run

Binoy Shah

04/20/2023, 8:29 PM

and to answer your question, If it is cutting down time to use manifests considerably, I would very will use the manifest route

Binoy Shah

04/21/2023, 1:38 PM

So I added the code, to generate manifest file at startup and load only via manifests Is this hacky? Is there a better way to achieve this. Its in the module’s

__ init __.py

file

__init__.py

Tim Castillo

04/21/2023, 6:01 PM

Hi! I wouldn't call this hacky. It's really quite robust. but rather than building the manifest on every instantiation, would you be able to build it separately whenever a change to dbt happens? And then build the

manifest.json

, dump it into some type of file storage, then pull it at Dagster's build time?

Tim Castillo

04/21/2023, 6:01 PM

I do like the caching logic that you have set up!! It's quite slick!

2 Views

Open in Slack

Previous Next