So I am posting in common <#C01U954MEER|dagster-su...
# ask-community
b
So I am posting in common #dagster-support channel, but my concern touches #dagster-dbt and #dagster-kubernetes too Attach image shows our repository code organization with following features 1. Multiple modules/packages, each with separate workflow/(dagster repository) 2. Actually there are many more packages under
workflows
than what image shows, I have trimmed the size 3. Shared large DBT project between all the workflows 4. Each package.repository loads its own set of DBT Asset/s 5. All repositories are imported by the
workflows
top level package within
___init___.py
file 6. All code packaged in single Docker Image 7. Deployed in Kubernetes via just 1
deployments
entry and started with
--moduleName src.workflows
Status: Currently its starting correctly, and all project assets are displayed correct and it works Problem: Whenever an Asset Materialization or a Job is started, a Kuberntes job is created and the Kubernetes Pod loads all of my assets and pre-emptively executes
dbt run
command for each of the asset of all repository of all packages and it takes almost 10 minutes before it actually executes the asset materialization that was desired in the first place. Any recommendations or pointers on how I can improve and what I am doing wrong ?
t
Hi! This is an interesting situation. Aside from the
dbt run
initialized during the asset materialization, are you triggering a
dbt run
elsewhere? I'm currently trying to wrap my head around this. Is the 10 minutes because that's how long it takes for the
dbt run
to finish? and what do you mean by the run is pre-emptive?
b
as in if the job is created to execute the asset materialization
Copy code
@asset()
def my_asset():
  do logic
Shouldnt it just execute this method, why does it have to pre-load all repositories and load all dbt assets of other packages
So if I have asset
inventory
in package/module
wear
When i materialize the
inventory
asset, The kuberntes Job that is kicked off to materialize the
inventory
first loads all other repositories like below
Copy code
@repository
def participant_summary_data_repository():
  print("starting the repository: participant_summary_data")
  return [
           with_resources(
             dbt_assets,
             resource_defs={
               "dbt": dbt_cli_resource.configured(warehouse.value), }
           )

         ] + [transform_job] + [participant_summary_data_schedule] + [sensor_job]
I have 15 such repositories since its a mono-repo code base
so I get 15 messages in the console
Copy code
starting the repository: participant_summary_data
starting the repository: participant_something_other
starting the repository: another_repository_data
before the actual code of
inventory
is even executed
t
Hmm, while I'm still trying to grasp this, can I verify something? Is it doing a
dbt run
or
dbt compile
when it spins up? When loading definitions/repository,
load_assets_from_dbt_project
has to compile the dbt project to understand what the models are. However, if that's the bottleneck, then you can use
load_assets_from_dbt_manifest
and point it to a pre-compiled dbt project's
manifest.json
to skip needing to compile it. This might solve your issue.
b
checking i could be mistaken
Copy code
dbt_assets = load_assets_from_dbt_project(
    project_dir=DBT_PROJECT_DIR,
    profiles_dir=DBT_PROFILES_DIR,
    select=dbt_asset_selects,
    node_info_to_asset_key=lambda node_info: get_node_asset_key(node_info),
    node_info_to_group_fn=lambda node_info: get_node_group_name(node_info),
    use_build_command=True
)
That’s my dbt asset for each of 15 repositories
🙇🏽 1
I am truely ashamed if the bow is sarcastic one 🙂
t
ahahaha, not sarcastic, just I appreciate you checking!
but yeah, let me know if it's kicking off a full dbt run or if it's just a bunch of really slow dbt compiles. If it's the second one, then using load_assets_from_manifest will speed that up significantly for you.
b
I checked.. It’s build/compile not the run, run only happens when i materialize the dbt asset The Python code below is in
src/workflows/wear/assets/inventories.py
When I materialize the
raw_inventory
the Kube job starts up and first loads all my 15 repository modules even if they are not anyway associated with
inventories.py
Why should it load all 15 repositories and not just one module
wear
t
Hmm, that goes into the nitty gritty of how repositories and code locations work that I'm not familiar with. that being said, would you be interested in trying out the
manifest
way? The easy way to try it out is run a
dbt compile
locally, put those
manifest.json
s into your Dagster project and swap out the
load_assets_from_*
calls to point to those files instead.
b
Below is my actual log
The job logic is just 1.27 seconds long.. but total execution time goes way beyond 5-7 minutes
and repository started logs are somehow appearing twice for this run
and to answer your question, If it is cutting down time to use manifests considerably, I would very will use the manifest route
So I added the code, to generate manifest file at startup and load only via manifests Is this hacky? Is there a better way to achieve this. Its in the module’s
__ init __.py
file
t
Hi! I wouldn't call this hacky. It's really quite robust. but rather than building the manifest on every instantiation, would you be able to build it separately whenever a change to dbt happens? And then build the
manifest.json
, dump it into some type of file storage, then pull it at Dagster's build time?
I do like the caching logic that you have set up!! It's quite slick!