We have some Timewindow partitioned assets. We hav...
# ask-community
t
We have some Timewindow partitioned assets. We have swapped to auto-materializing for some of our assets downstream from the partitioned ones. It seems that the partitioned assets need to have a completed backload in order to have the "Upstream data has changed since latest materialization" kick in - otherwise the partitioned assets gets registered as "Never materialized", and the downstream assets don't get activated by an auto-materialization policy. Is this a correct understanding? Sometimes backfills fail partially for some windows, or we do not want to run a complete backfill immediately - are there some ways to make the auto-materialization work just when, say, data appear from a scheduled run in the most recent backfill?
Backfilling the upstream assets (before they only had a partial backfill) did indeed make sure the downstream asset started to detect arrival of new data in the upstream assets. A use case is I would like partitions to remain "failed" if a file corresponding to a given timestamp is not present on a server, meaning the asset would remain "Overdue" and downstream asset would not be able to auto-materialize. Should this be handled by a custom
AutoMaterializePolicy
?
o
hi @Tarje Bargheer! would it be correct to assume that the downstream assets are unpartitioned? by default, an unpartitioned asset will be assumed to depend on all partitions of the upstream partitioned asset, and so you're correct that anything upstream of that currently needs to be filled in before the downstream can be kicked off. Two possibilities here: • first is along the lines of this issue: https://github.com/dagster-io/dagster/issues/14628, where we may add an option to allow an asset to materialize even if upstream partitions are missing • another option would be to update your partition mapping to something like
LastPartitionMapping
, which would say that the downstream asset only cares about the last partition being filled in (and so won't wait for the other ones). Sounds like the first option would be better for your setup, is that right? For the second bit, can you describe the exact behavior you're looking for a bit more? Is it that a missing upstream partition should be treated differently than a failed upstream partition?
t
Hi Owen. Exactly right, and thanks for your precision on this! The downstream assets are unpartitioned, and the github discussion, so an option that specifies what to do in this transition between partitioned and un-partitioned would be a feature that would resolve issues in our current setup (that has some technical debt - we are on our way with changing our setup to keep partitioned assets partitioned). As far as I can tell changing the partition mapping would work as well. The specific behaviour is that we expect a file to arrive every hour on a server, and kick of a downstream unpartitioned asset. If the file is not present it is helpful to have the backfill history keep it as a record of a failed partition - if the file is missing we send an alert, and it will potentially get fixed in time - and we can run backfills when these issues are fixed. But the flow should progress and only the existence of a file for the latest partition should determine if the downstream asset should be started. As far as I can tell,
LastPartitionMapping
specifies exactly this? I find it a bit hard to read the documentation on this. Is it sufficient to simply add this
LastPartitionMapping
to the downstream unpartitioned asset, or on the partitioned asset? This is what I could find https://docs.dagster.io/_apidocs/partitions#dagster.PartitionMapping Again, thanks - this was a really helpful clarification for me.
o
ah it sounds like
LastPartitionMapping
is exactly what you need in that case. To be more precise about what a partition mapping in general is, it says, for a given upstream asset of another asset, which partitions of the upstream asset it depends on. By default, if the downstream asset is unpartitioned, we assume "all of the partitions of the upstream", but you can modify that assumption by setting the partition mapping explicitly on the
AssetIn
of the downstream asset, i.e.
Copy code
@asset(partitions_def=...)
def upstream():
   ...

@asset(ins={"upstream": AssetIn(partition_mapping=LastPartitionMapping())})
def downstream():
   ...
c
@owen Yo! I work with the magnificant @Tarje Bargheer - Your solutions are great (especially number 1, which would be preferable), but the second solution is fine in theory. HOWEVER, our assets utilize the
non_argument_deps
and here you cannot use partitioning mapping, as far as i can see - so it sadly doesn't solve it. Long story short the downstream asset is a wrapper around a script that runs a specified notebook in databricks, and uses the upstream asset name as an argument (and fetches the data it in the notebook) It would (presumably?) require the
non_argument_deps
to take a list of
AssetIn
's rather than just AssetKey/strings, which requires an update to dagster - right? or am i missing a temporary workaround? the asset is created via a homemade AssetFactory, so i cannot even simply set the in's as arguments, as they factory defines the deps procedurally. I could probably write some sick meta programming to make a workaround, which might be a fun codechallenge; less so nice code to maintain.
o
ah
non_argument_deps
is actually just shorthand for
Copy code
@asset(non_argument_deps={"foo", "bar"})

# short for ..

@asset(ins={"foo": AssetIn(Nothing), "bar": AssetIn(Nothing)})
so you can use the second form to add partition mappings to those dependencies
c
Awesome thanks
I am not entirely sure if it's perfectly fitting; if it is
LastPartition
wouldn't that mean that if i refresh a very old partition, a downstream won't rematerialize?
o
ah sorry for losing track of this, that's indeed the case here -- setting the
LastPartitionMapping
would mean that the downstream unpartitioned asset would only ever respond to changes to the most recent partition of the upstream
It sounds like you're interested in a behavior that materializes the downstream unpartitioned asset in response to any materialization of the parent (maybe except in cases where you're currently doing a large backfill, or in general when there are lots of your parent partitions getting materialized?), would that be accurate?
c
Yes exactly
@owen are you on vacation or did you acknowledge my need while concluding that this is not supported? 😄
I guess basically, we request that one can specify that a an auto materialization policy can ignore:
Copy code
and none of the following are true:
any of its parent assets / partitions are missing
o
ha sorry, been a busy few weeks -- the basic idea is that this is not currently supported, but something we're working on supporting (this is the next milestone in the auto-materialize space, and is actively in development)
❤️ 1