Just reading up on the <AutoMaterializePolicy> fea...
# dagster-feedback
s
Just reading up on the AutoMaterializePolicy features, very cool, and natural extension. Are there plans to extend it to other triggers, like schedule- and event-based? I can imagine the combination of
AutomaterializePolicy.cron_schedule("0 0 * * *")
and
AutomaterializePolicy.eager()
could solve a ton of use cases without ever having to learn about sensors and schedules classes.
j
Interesting. So far we’ve been thinking of it as a separate thing from sensors and schedules (though they’re all clearly in some instigator category). We have seen confusion arising from combing AutoMaterializePolicies with schedules though
s
Is the idea that the assets that you'd want to put
AutoMaterializePolicy.cron_schedule("0 0 * * *")
on are at the root of the asset graph? What determines cadence you end up wanting to refresh an asset like that on? E.g. is source data that the asset is derived from that gets refreshed daily? Is there source data that gets updated continuously, but the asset doesn't need to incorporate it immediately?
s
I'd say I'm basically trying to remove the overhead of creating jobs, schedules, or sensors in order for a developer to deploy a new asset. Currently, they have to think about not only the asset itself, but also a
define_asset_job
,
AssetSelection
(or other relative python imports),
schedules
, etc. So deploying a new, simple asset brings a lot Dagster framework overhead.
AutomaterializePolicy
could make it so that any new asset can be deployed by only configuring it at the asset level in sufficiently simple circumstances
s
I'd say I'm basically trying to remove the overhead of creating jobs, schedules, or sensors in order for a developer to deploy a new asset.
That makes total sense. I share this goal. What I'm curious about here is how annoying it would be for the developer to express the schedule for the root asset in a "declarative" way, i.e. in terms of either: 1. When source data is available 2. When derived data is required to be up-to-date Example of (1): "refresh the events table whenever the raw_events table is modified"
Copy code
@observable_source_asset(auto_observe_interval_minutes=30)
def raw_events():
    return get_last_modified_timestamp("raw_events_table")

@asset(non_argument_deps={"raw_events"}, auto_materialize_policy=AutoMaterializePolicy.eager())
def events():
    ...
Example of (2): "the events table should never be more than 24 hours out of date"
Copy code
@asset(
    freshness_policy=FreshnessPolicy(maximum_lag_minutes=24 * 60),
    auto_materialize_policy=AutoMaterializePolicy.lazy(),
)
def events():
    ...
Both of these are more code than what you're suggesting and don't map to it 100%, so I'm not convinced they're better. Just trying to understand how far apart your mental model is from the mental model of the current declarative scheduling system.
s
I guess my mental model sees both "refreshed-on-a-cron-schedule" and "refreshed-with-respect-to-some-dataset-property" as both instances of declarative scheduling? I get that it deviates from the freshness policy mental model, but I'll confess I have trouble mapping freshness policies -- in almost all cases we are continuously receiving / ingesting data, so there's not really a difference between giving it a cron schedule and saying, "check for new data every 30 minutes". The pain I'm seeing is that schedules are being set in places other than where the asset itself is declared, which makes it harder to build and debug.
s
Got it - that makes sense
👍 1
I filed an issue to track this request: https://github.com/dagster-io/dagster/issues/14328
❤️ 2
v
Piggybacking to say that this would be an amazing feature. As Stephen said, I’d love to say goodbye to jobs entirely and shift the mental model and the entire data platform to an asset-only one. I currently only have asset jobs for the root assets and everything else gets auto-materialized, but removing this initial layer would be even better!