https://dagster.io/ logo
Title
r

Robert Wade

04/05/2023, 5:14 PM
After using Dagster for a long period of time I am still trying to understand the preferred approach for creating asset-based jobs, specifically with how configuration is supposed to occur. Here is a situation: Imagine we have a daily-partitioned asset-based job:
@asset(config_schema="{first_foo: str)" …)
def first_asset(….)
….

@asset(config_schema="{second_foo: str)" …)
def second_asset(first_asset)
….

my_job = define_asset_job("my_job", selection=["*second_asset"] config=<config for first and second asset>, …)

my_sched = build_schedule_from_parititioned_job(my_job, ….)
All of this works great. Now let’s imagine that things in the data engineering world change, specifically my first_asset now has a dependency on an asset from a different code location (perhaps built by a different team).
other_asset = SourceAsset(key=AssetKey("another_teams_asset"))

@asset(config_schema="{first_foo: str)" …)
def first_asset(another_teams_asset)
….
This now requires me to update my job and include config for another_teams_asset. Now let’s imagine that the other team goes through a variety of iterations and another_teams_asset suddenly has a variety of assets that it depends on. Am I expected to monitor all of these iterations/changes and continue to update the config for my job?
s

sandy

04/05/2023, 7:06 PM
Hi Robert - quick suggestion: you can use three backticks to create codeblocks that span multiple lines, like this:
line 1
line 2
r

Robert Wade

04/05/2023, 7:07 PM
Sure. I was just using the code block icon on the slack UI. Shall I edit my original post?
s

sandy

04/05/2023, 7:07 PM
No worries, just for the future
Onto your actual question: Would you want this configuration to be used any time that the asset is materialized? I.e. what if you went to the asset graph and clicked the Materialize button, outside of your job?
r

Robert Wade

04/05/2023, 7:09 PM
If we want to materialize in the UI then we materialize at the job level -- which would then cause the config (yml) to be loaded
but my point is this -- it seems that if you have a schedule that runs a job that materializes a set of assets, and your first asset (or any, really) depends on an asset (that itself my depend on an infinite number of assets) then suddenly is it your responsibility to have all of the configuration loaded into your job? That could easily become unmanageable.
In order to further understand this scenario (and to possibly help someone who finds this thread) I created a sample app. I have two code locations, each with 2 assets, 1 job, and 1 daily-partitioned schedule. At first these two code locations had no relationship to each other. As expected, each job can be executed for any of the partitioned dates. Next I created a dependency: the 2nd code location's first asset takes as input the last asset of the 1st code location. When I run the 2nd code location's job for a particular partition the job fails since the asset within the 1st code location has not been materialized. That makes sense. Here is where things don't work. I went into the 1st code location and ran its job for a specific date. I then ran the 2nd code location's job for the same date. It still failed. Clearly this is bad. So in summary, if an asset relies on an asset from another code location then it won't automatically materialize that upstream asset. Makes sense. However, if you manually materialize the upstream asset it still doesn't work.
s

sandy

04/06/2023, 5:23 PM
You job should only need to include configuration for the assets that are materialized inside it, not the assets that are upstream of it. If that's not the behavior you're experiencing, would you mind sharing a code snippet that reproduces it? It could be a bug.
r

Robert Wade

04/06/2023, 5:46 PM
Yes I can share all this code. I think the main thing that I have demonstrated and now understand better is that if an asset references an upstream asset from another code location then it won't automatically attempt to materialized the upstream asset. It will itself fail since the upstream dependency does not exist. However, what I have also uncovered is that if the upstream dependency DOES exist, it still fails. What's the best way to share the code? I have 2 tiny little directories (one per code location).
s

sandy

04/06/2023, 5:51 PM
However, what I have also uncovered is that if the upstream dependency DOES exist, it still fails.
What error are you seeing in this case?
r

Robert Wade

04/06/2023, 5:54 PM
thumbnail_image001-4.png
s

sandy

04/06/2023, 5:54 PM
is the asset in the upstream code location partitioned?
r

Robert Wade

04/06/2023, 5:54 PM
yes
as are the ones in the downstream
s

sandy

04/06/2023, 5:55 PM
something that could be causing that issue is if the downstream code location isn't aware that the upstream asset is partitioned. is there a
partitions_def
on the
SourceAsset
in the downstream repo? if not, adding one might fix the problem.
r

Robert Wade

04/06/2023, 5:55 PM
ok let me try that
That solved the problem. Thank you very much. I am attaching a zip with the code in case someone may find it useful. In terms of my original problem (how to execute some downstream assets that rely on the upstream assets) I think the solution is going to build an asset reconciliation sensor that watches the upstream assets and if/when they are materialized it will do the same for the downstream assets. (I hope this works with partitions)
s

sandy

04/06/2023, 8:46 PM
Awesome - note that, at the moment, asset reconciliation sensors don't work well across code locations, but @johann is working on changes that will address this in 1-2 weeks