I ve installed the <https github com dagster io dagster tree dagster #ask-community

I've installed the <assets_dbt_python> example to ...

R. Amaral Vieira

05/29/2023, 9:21 PM

I've installed the assets_dbt_python example to play around with the Freshness and Auto-materialize policies, but can't find my way past a weird issue: • the asset

predicted_orders

(I've added -> Freshness policy: none, Auto-materialize policy: eager) depends on both ◦

daily_order_summary

(I've added -> Freshness policy: none, Auto-materialize policy: eager) ◦

order_forecast_model

(I've added -> Freshness policy: By 06:00 PM UTC, this asset should incorporate all data up to 24 hours before that time, Auto-materialize policy: lazy). ▪︎

order_forecast_model

in this example dag, it also depends on

daily_order_summary

When upstream assets materialize,

daily_order_summary

auto-materializes as expected, but

predicted_orders

does not, although its eager Auto-materialize policy should force it to. It looks like the Auto-materialize sensor doesn't see past the

order_forecast_model

. My intuition is that the lazy Auto-materialize policy for this asset breaks the chain for all other assets downstream of its parent asset

daily_order_summary

. Is this indeed unexpected or am I missing anything less obvious?

claire

05/30/2023, 8:47 PM

Hi R. Amaral! Curious--are these asset partitioned, and if so, how?

R. Amaral Vieira

05/31/2023, 6:31 PM

Hi @claire! No, none of these are partitioned. This is straight out of the example repo, I just added the eager Auto-materialize policy everywhere except for the

predicted_orders

asset.

claire

05/31/2023, 8:29 PM

Hm.... is

order_forecase_model

materializing when expected as dictated by its freshness policy? My understanding is that

predicted_orders

should update when all of its parents are up to date, meaning that

order_forecast_model

must be materialized before

predicted_orders

can be materialized cc @owen to confirm

owen

05/31/2023, 8:52 PM

hi! claire is correct here, and this is something that the description in the sidebar should clarify (cc @johann). in short, even with an

eager

auto-materialize policy, an asset will not be materialized until all of its parents are "up to date". In this case, that'd mean that assets with the "upstream data" indicator will not be auto-materialized. the reason it chooses not to materialize in this situation is that it would result in two different "versions" of daily_order_summary being consumed by predicted_orders. the first being direct, and the second version being transitively consumed through order_forecast model. for this specific example, it would result in us "predicting orders" based on today's order summary but yesterday's model in this week's release, a new UI will be available on the Asset Details page, which will help explain why certain assets were skipped, which should help a lot in demystifying the decisions (which are quite opaque at the moment)

R. Amaral Vieira

05/31/2023, 10:11 PM

Thanks Claire and Owen! Looking forward to seeing the new release and getting more clarity on what is happening 👍🏻 My mental model for this was that an

eager

policy would be like an open valve, always letting the data flow through the pipeline steps. A

lazy

policy would be a closed valve that can only be open by a downstream process searching for up-to-date data. So it was weird not seeing the materializations cascade all through the "open valves". Regarding the transitive consumption of models: if the same data

feeds a ml model training step

(that is run once a day) and it should also flow (incrementally) through the prediction step

(which also depends on

) multiple times a day, how could this work without generating deadlocks? Would asset partitions be the way to solve this?

owen

06/01/2023, 9:30 PM

definitely an interesting analogy re: valves, and I think there are likely valid cases where you'd want that sort of behavior. however, as a default case, immediately firing when any of your parents are updated would generally lead to a lot of redundant work in diamond-shaped graphs (i.e. the top is updated, causing the right and left sides to both be updated -- assuming the right and left don't finish at the exact same time, we'll end up having to update the bottom twice in quick succession)

owen

06/01/2023, 9:36 PM

re: deadlocks -- just to confirm, in this example, it sounds like in this example,

is being executed multiple times a day (otherwise it wouldn't make sense to execute

on the same data). in that case, asset partitions sound like a reasonable solution --

could be daily-partitioned, and

could have a

LastPartitionMapping

(i.e.

depends on the most recent day's partition of

). another option would be to forego auto-materialization of

and just have an asset sensor which materializes

whenever

is updated. I do think there's merit in having an AutoMaterializePolicy that just doesn't care if the upstreams are up to date or not, and that could be a potential future third option

👍 1

R. Amaral Vieira

06/02/2023, 9:58 AM

I imagine this could be somehow handled declaratively, like by having the AutoMaterializePolicy accept an argument for

any

all

, or a subset of upstream assets to select/exclude in its evaluation:

any

- always triggers whenever any of the immediate upstream assets are refreshed

all

- wait for

all

of the immediate upstream assets to have a more recent materialization than the asset being evaluated ( I know this one gets a bit in the territory of the FreshnessPolicy, but could be complementary to it)

subset

- select or exclude a list of immediate upstream assets to define the subset of assets the sensor will look at in its evaluation.

3 Views

Open in Slack

Previous Next