https://dagster.io/ logo
Title
v

Vitaly Markov

05/25/2023, 8:39 AM
I've noticed a fundamental assumption in code that "upstream" partitions should always exist when working with `TimeWindowPartitionMapping`: https://github.com/dagster-io/dagster/blob/d3532401b1b5823f43c390b3420b33e7dcefa42[…]ster/dagster/_core/definitions/time_window_partition_mapping.py If at least one upstream partition does not exist, code throws an exception. But lack of "downstream" partitions is fine: https://github.com/dagster-io/dagster/blob/d3532401b1b5823f43c390b3420b33e7dcefa42[…]ster/dagster/_core/definitions/time_window_partition_mapping.py Is there any way to get list of upstream partitions without existence check? Alternatively, maybe it would be possible to add an option to set
raise=False
? It's ok to keep the current behaviour as default. It might be very useful for maintenance scripts and custom scheduling implementations, when I check every asset and scan asset graph "upwards" instead of catching materialization events and scan it "downwards". Thank you!
c

claire

05/25/2023, 5:16 PM
Hi Vitaly, I have a PR out to address this: https://github.com/dagster-io/dagster/pull/14449 Wondering what your use case is for having nonexistent upstream partitions?
v

Vitaly Markov

05/25/2023, 8:29 PM
@claire, fantastic! I think it would fix my problem entirely. 🔥 I am working on declarative job configs and alternative scheduling implementation. It goes quite well so far. Instead of having thousands of multi-asset sensors (one per job), I create one super-sensor for many jobs at once. This sensor pre-loads last materialization for all assets and counters for all asset partitions. After that sensor iterates over assets and checks the status of upstream dependencies. For partitioned assets, it checks every non-materialized downstream partition and compares it with status of corresponding upstream partitions. If all upstream partitions are materialized in all parents, but downstream partition is not yet materialized -> yield RunRequest. The main goal is to scale well and manage dependencies for very large number of assets and jobs. Also, I add some custom scheduling logic and conditions which go beyond current AutoMaterializePolicy / FreshnessPolicy.
Since I do checks in reverse (from "downstream" to "upstream"), it is possible to encounter the situation when corresponding upstream partitions do not exist (e.g. due to
start_date
being a bit later for one of parent assets).
c

claire

05/25/2023, 8:40 PM
Oh, I see, you want to be able to handle the case where a parent asset starts later than a downstream asset. That makes sense, I think this PR would be what you need then