R. Amaral Vieira
05/29/2023, 9:21 PMpredicted_orders
(I've added -> Freshness policy: none, Auto-materialize policy: eager) depends on both
◦ daily_order_summary
(I've added -> Freshness policy: none, Auto-materialize policy: eager)
◦ order_forecast_model
(I've added -> Freshness policy: By 06:00 PM UTC, this asset should incorporate all data up to 24 hours before that time, Auto-materialize policy: lazy).
▪︎ order_forecast_model
in this example dag, it also depends on daily_order_summary
When upstream assets materialize, daily_order_summary
auto-materializes as expected, but predicted_orders
does not, although its eager Auto-materialize policy should force it to.
It looks like the Auto-materialize sensor doesn't see past the order_forecast_model
. My intuition is that the lazy Auto-materialize policy for this asset breaks the chain for all other assets downstream of its parent asset daily_order_summary
. Is this indeed unexpected or am I missing anything less obvious?claire
05/30/2023, 8:47 PMR. Amaral Vieira
05/31/2023, 6:31 PMpredicted_orders
asset.claire
05/31/2023, 8:29 PMorder_forecase_model
materializing when expected as dictated by its freshness policy? My understanding is that predicted_orders
should update when all of its parents are up to date, meaning that order_forecast_model
must be materialized before predicted_orders
can be materialized
cc @owen to confirmowen
05/31/2023, 8:52 PMeager
auto-materialize policy, an asset will not be materialized until all of its parents are "up to date". In this case, that'd mean that assets with the "upstream data" indicator will not be auto-materialized.
the reason it chooses not to materialize in this situation is that it would result in two different "versions" of daily_order_summary being consumed by predicted_orders. the first being direct, and the second version being transitively consumed through order_forecast model. for this specific example, it would result in us "predicting orders" based on today's order summary but yesterday's model
in this week's release, a new UI will be available on the Asset Details page, which will help explain why certain assets were skipped, which should help a lot in demystifying the decisions (which are quite opaque at the moment)R. Amaral Vieira
05/31/2023, 10:11 PMeager
policy would be like an open valve, always letting the data flow through the pipeline steps. A lazy
policy would be a closed valve that can only be open by a downstream process searching for up-to-date data. So it was weird not seeing the materializations cascade all through the "open valves".
Regarding the transitive consumption of models: if the same data a
feeds a ml model training step b
(that is run once a day) and it should also flow (incrementally) through the prediction step c
(which also depends on b
) multiple times a day, how could this work without generating deadlocks? Would asset partitions be the way to solve this?owen
06/01/2023, 9:30 PMowen
06/01/2023, 9:36 PMa
is being executed multiple times a day (otherwise it wouldn't make sense to execute c
on the same data).
in that case, asset partitions sound like a reasonable solution -- b
could be daily-partitioned, and c
could have a LastPartitionMapping
on b
(i.e. c
depends on the most recent day's partition of b
).
another option would be to forego auto-materialization of c
and just have an asset sensor which materializes c
whenever b
is updated.
I do think there's merit in having an AutoMaterializePolicy that just doesn't care if the upstreams are up to date or not, and that could be a potential future third optionR. Amaral Vieira
06/02/2023, 9:58 AMany
, all
, or a subset of upstream assets to select/exclude in its evaluation:
any
- always triggers whenever any of the immediate upstream assets are refreshed
all
- wait for all
of the immediate upstream assets to have a more recent materialization than the asset being evaluated ( I know this one gets a bit in the territory of the FreshnessPolicy, but could be complementary to it)
subset
- select or exclude a list of immediate upstream assets to define the subset of assets the sensor will look at in its evaluation.