Stephen Bailey
07/26/2023, 11:21 AMFreshnessPolicy(maximum_lag_minutes=60)
-- does this amount to different things depending on the dependency structure?
ā¢ If no upstream dependencies, it is basically an hourly schedule
ā¢ If one upstream dependency (but just one layer), it is dependent on upstream assets being fresher, and will be updated w/i one hour
ā¢ If two or more upstream dependencies (but just one layer), it is dependent on all upstream assets being fresher, and will be updated w/i one hour
ā¢ If two or more layers of dependencies, it is dependent on all upstream assets being fresher, and will be updated w/i one hour
Do i understand that right? I'm not sure which docs page is best for understanding this, but I have learned the most from the class docstring. Feels like a table of some sort in the docs would be useful.Remi Gabillet
07/26/2023, 11:57 AMRemi Gabillet
07/26/2023, 11:57 AMMalo PARIS
07/26/2023, 12:11 PMDaniel Gafni
07/26/2023, 3:05 PMDaniel Gafni
07/26/2023, 3:06 PMNicolas Parot Alvarez
07/26/2023, 3:19 PMDaniel Gafni
07/26/2023, 3:35 PMNicolas Parot Alvarez
07/26/2023, 3:39 PMNikolaj Galak
07/26/2023, 6:03 PMNicolas Parot Alvarez
07/26/2023, 6:17 PMStephen Bailey
07/26/2023, 6:31 PMfreshness_dependency_depth
parameter, where you want the scheduler to only consider the immediate parents (or parents of parents) when calculating freshness? Being able to specify depth=1
(refresh after parent) or depth=0
(refresh on schedule) would mitigate some of the concerns about not knowing how long upstream things would take, and also simplify the reasoning for why something has or hasn't kicked off.Remi Gabillet
07/26/2023, 7:00 PMStephen Bailey
07/26/2023, 7:06 PMSkip Conditions
error. Dagster doesn't auto-materialize these upstream assets (they are controlled on a schedule), so despite it's parent emitting events, the asset itself does not runRemi Gabillet
07/26/2023, 7:10 PMRemi Gabillet
07/26/2023, 7:12 PMStephen Bailey
07/26/2023, 7:57 PMowen
07/28/2023, 11:02 PMA -> B -> ... -> Z
. An update to A
would take ~26 hours to propagate all the way down to the bottom, as each asset in the chain would feel comfortable waiting ~60 minutes to propagate the change from their specific parent.
My suspicion is that there's some other way of describing the desired behavior in a lot of these cases which would vastly simplify the mental model. If people have specific ways of stating what they'd like (independent from any current implementation / freshness policies), I'd love to talk through what that might be. Some possible starting points:
ā¢ "This asset should execute at around 8AM every day, after its parents have been updated"
ā¢ "If this asset hasn't been materialized in the last 60 minutes, materialize it as soon as any of its parents have been updated"
ā¢ "Every 60 minutes, this asset should be materialized, as long as one of its parents has been updated since the last time it ran"Stephen Bailey
07/29/2023, 12:19 AMmaximum_lag_minutes=0
-- i.e. I want as little lag between any asset materializations as possible. So in the A->Z
case, I would set 0 lag between assets, and then everything would refresh as soon as its parent refreshed, which is what an event-driven system would do. (i.e. a standard asset_sensor
today).
For the case where there are no parents, a more active way of expressing maximum_lag_minutes
would be to call it minimum_time_between_materializations
, or */60
in cron terminology. In this case, I personally think just using cron_schedules is how this should be done -- FreshnessPolicy(maximum_lag_minutes=0, cron_schedule="5 * * * *")
.Daniel Gafni
07/29/2023, 5:35 AMNicolas Parot Alvarez
07/31/2023, 10:16 AMFreshnessPolicy(minimum_lag_minutes=0, cron_schedule="5 * * * *")
minimum_lag_minutes (float) ā If the asset's materialization is triggered by its parent(s), and respects its optional CRON schedule, it awaits minimum_lag_minutes
before starting its own materialization. For example, if the current asset's parent(s) is updated every 1 min, but you only need the current asset to be updated every 1 h, because less than that would be wasted computation, then you can set minimum_lag_minutes = 60
to enforce a lag of 1 h that will be waited before the current asset starts materializing.
I think this prevents wondering about it works.
A downside is that it may go against initial intentions of a smarter features.
On our side, currently to prevent unnecessary materialization, we've been playing with our sensor definitions.
We've set reasonable frequencies, and in the sensor evaluation we check that certain jobs are not already running before triggering new jobs.
Example: https://gist.github.com/NicolasPA/854392e22dc1410977cc7ddb8b8605a4