Binoy Shah
04/20/2023, 7:03 PMworkflows
than what image shows, I have trimmed the size
3. Shared large DBT project between all the workflows
4. Each package.repository loads its own set of DBT Asset/s
5. All repositories are imported by the workflows
top level package within ___init___.py
file
6. All code packaged in single Docker Image
7. Deployed in Kubernetes via just 1 deployments
entry and started with --moduleName src.workflows
Status: Currently its starting correctly, and all project assets are displayed correct and it works
Problem: Whenever an Asset Materialization or a Job is started, a Kuberntes job is created and the Kubernetes Pod loads all of my assets and pre-emptively executes dbt run
command for each of the asset of all repository of all packages and it takes almost 10 minutes before it actually executes the asset materialization that was desired in the first place.
Any recommendations or pointers on how I can improve and what I am doing wrong ?Tim Castillo
04/20/2023, 7:21 PMdbt run
initialized during the asset materialization, are you triggering a dbt run
elsewhere? I'm currently trying to wrap my head around this. Is the 10 minutes because that's how long it takes for the dbt run
to finish?
and what do you mean by the run is pre-emptive?Binoy Shah
04/20/2023, 7:58 PM@asset()
def my_asset():
do logic
Shouldnt it just execute this method, why does it have to pre-load all repositories and load all dbt assets of other packagesBinoy Shah
04/20/2023, 8:02 PMinventory
in package/module wear
When i materialize the inventory
asset,
The kuberntes Job that is kicked off to materialize the inventory
first loads all other repositories like below
@repository
def participant_summary_data_repository():
print("starting the repository: participant_summary_data")
return [
with_resources(
dbt_assets,
resource_defs={
"dbt": dbt_cli_resource.configured(warehouse.value), }
)
] + [transform_job] + [participant_summary_data_schedule] + [sensor_job]
I have 15 such repositories since its a mono-repo code baseBinoy Shah
04/20/2023, 8:03 PMstarting the repository: participant_summary_data
starting the repository: participant_something_other
starting the repository: another_repository_data
Binoy Shah
04/20/2023, 8:03 PMinventory
is even executedTim Castillo
04/20/2023, 8:04 PMdbt run
or dbt compile
when it spins up?
When loading definitions/repository, load_assets_from_dbt_project
has to compile the dbt project to understand what the models are.
However, if that's the bottleneck, then you can use load_assets_from_dbt_manifest
and point it to a pre-compiled dbt project's manifest.json
to skip needing to compile it. This might solve your issue.Binoy Shah
04/20/2023, 8:05 PMdbt_assets = load_assets_from_dbt_project(
project_dir=DBT_PROJECT_DIR,
profiles_dir=DBT_PROFILES_DIR,
select=dbt_asset_selects,
node_info_to_asset_key=lambda node_info: get_node_asset_key(node_info),
node_info_to_group_fn=lambda node_info: get_node_group_name(node_info),
use_build_command=True
)
That’s my dbt asset for each of 15 repositoriesBinoy Shah
04/20/2023, 8:07 PMTim Castillo
04/20/2023, 8:08 PMTim Castillo
04/20/2023, 8:09 PMBinoy Shah
04/20/2023, 8:13 PMsrc/workflows/wear/assets/inventories.py
When I materialize the raw_inventory
the Kube job starts up and first loads all my 15 repository modules even if they are not anyway associated with inventories.py
Binoy Shah
04/20/2023, 8:13 PMwear
Tim Castillo
04/20/2023, 8:19 PMmanifest
way?
The easy way to try it out is run a dbt compile
locally, put those manifest.json
s into your Dagster project and swap out the load_assets_from_*
calls to point to those files instead.Binoy Shah
04/20/2023, 8:26 PMBinoy Shah
04/20/2023, 8:27 PMBinoy Shah
04/20/2023, 8:28 PMBinoy Shah
04/20/2023, 8:29 PMBinoy Shah
04/21/2023, 1:38 PM__ init __.py
fileTim Castillo
04/21/2023, 6:01 PMmanifest.json
, dump it into some type of file storage, then pull it at Dagster's build time?Tim Castillo
04/21/2023, 6:01 PM