I've installed the <assets_dbt_python> example to ...
# ask-community
r
I've installed the assets_dbt_python example to play around with the Freshness and Auto-materialize policies, but can't find my way past a weird issue: • the asset
predicted_orders
(I've added -> Freshness policy: none, Auto-materialize policy: eager) depends on both ◦
daily_order_summary
(I've added -> Freshness policy: none, Auto-materialize policy: eager) ◦
order_forecast_model
(I've added -> Freshness policy: By 06:00 PM UTC, this asset should incorporate all data up to 24 hours before that time, Auto-materialize policy: lazy). ▪︎
order_forecast_model
in this example dag, it also depends on
daily_order_summary
When upstream assets materialize,
daily_order_summary
auto-materializes as expected, but
predicted_orders
does not, although its eager Auto-materialize policy should force it to. It looks like the Auto-materialize sensor doesn't see past the
order_forecast_model
. My intuition is that the lazy Auto-materialize policy for this asset breaks the chain for all other assets downstream of its parent asset
daily_order_summary
. Is this indeed unexpected or am I missing anything less obvious?
c
Hi R. Amaral! Curious--are these asset partitioned, and if so, how?
r
Hi @claire! No, none of these are partitioned. This is straight out of the example repo, I just added the eager Auto-materialize policy everywhere except for the
predicted_orders
asset.
c
Hm.... is
order_forecase_model
materializing when expected as dictated by its freshness policy? My understanding is that
predicted_orders
should update when all of its parents are up to date, meaning that
order_forecast_model
must be materialized before
predicted_orders
can be materialized cc @owen to confirm
o
hi! claire is correct here, and this is something that the description in the sidebar should clarify (cc @johann). in short, even with an
eager
auto-materialize policy, an asset will not be materialized until all of its parents are "up to date". In this case, that'd mean that assets with the "upstream data" indicator will not be auto-materialized. the reason it chooses not to materialize in this situation is that it would result in two different "versions" of daily_order_summary being consumed by predicted_orders. the first being direct, and the second version being transitively consumed through order_forecast model. for this specific example, it would result in us "predicting orders" based on today's order summary but yesterday's model in this week's release, a new UI will be available on the Asset Details page, which will help explain why certain assets were skipped, which should help a lot in demystifying the decisions (which are quite opaque at the moment)
r
Thanks Claire and Owen! Looking forward to seeing the new release and getting more clarity on what is happening 👍🏻 My mental model for this was that an
eager
policy would be like an open valve, always letting the data flow through the pipeline steps. A
lazy
policy would be a closed valve that can only be open by a downstream process searching for up-to-date data. So it was weird not seeing the materializations cascade all through the "open valves". Regarding the transitive consumption of models: if the same data
a
feeds a ml model training step
b
(that is run once a day) and it should also flow (incrementally) through the prediction step
c
(which also depends on
b
) multiple times a day, how could this work without generating deadlocks? Would asset partitions be the way to solve this?
o
definitely an interesting analogy re: valves, and I think there are likely valid cases where you'd want that sort of behavior. however, as a default case, immediately firing when any of your parents are updated would generally lead to a lot of redundant work in diamond-shaped graphs (i.e. the top is updated, causing the right and left sides to both be updated -- assuming the right and left don't finish at the exact same time, we'll end up having to update the bottom twice in quick succession)
re: deadlocks -- just to confirm, in this example, it sounds like in this example,
a
is being executed multiple times a day (otherwise it wouldn't make sense to execute
c
on the same data). in that case, asset partitions sound like a reasonable solution --
b
could be daily-partitioned, and
c
could have a
LastPartitionMapping
on
b
(i.e.
c
depends on the most recent day's partition of
b
). another option would be to forego auto-materialization of
c
and just have an asset sensor which materializes
c
whenever
b
is updated. I do think there's merit in having an AutoMaterializePolicy that just doesn't care if the upstreams are up to date or not, and that could be a potential future third option
👍 1
r
I imagine this could be somehow handled declaratively, like by having the AutoMaterializePolicy accept an argument for
any
,
all
, or a subset of upstream assets to select/exclude in its evaluation:
any
- always triggers whenever any of the immediate upstream assets are refreshed
all
- wait for
all
of the immediate upstream assets to have a more recent materialization than the asset being evaluated ( I know this one gets a bit in the territory of the FreshnessPolicy, but could be complementary to it)
subset
- select or exclude a list of immediate upstream assets to define the subset of assets the sensor will look at in its evaluation.