How to deal with circular/self dependencies withou...
# dagster-feedback
d
How to deal with circular/self dependencies without using time partitions? 1. I want to process a bunch of files and produce an asset A pointing to these files 2. I want to keep track of the files which have been processed, so I create an asset B with a list (table) of all previously processed files 3. I combine them into a
@multi_asset
which makes sure the processed files are always appended to B whenever A is materialized. 4. Now when I receive new files to process I would like to filter them by the already processed files in B. But this would create a circular/self dependency, because now I need B asset to update the B asset (by appending filtered files). Of course, I could just manually read the asset without lettings Dagster know about it as a workaround. However, I would lose some benefits of using assets/IOManagers in this case. A possible solution would be to introduce dynamic time partitions. In this case the asset could depend on a previous partition of itself, while the partitions would appear dynamically.
o
Yeah definitely an interesting usecase (and right now we do indeed recommend "just manually read the asset without letting Dagster know about it"). I would say that you can often get a lot of the benefit out of IOManagers even if you do it this way, as you can use
load_asset_value
to invoke the IOManager for you: https://github.com/dagster-io/dagster/discussions/14432 We have in the past considered some sort of "ContinuousTimeWindowPartitionsDefinition", where any run could target a time partition with an arbitrary start/end point. This could in theory go along with an "AllPreviousPartitionMapping", where a given partition of an asset corresponds to all prior partitions of another asset. This could also maybe be used in a self-partition mapping, so that you could explicitly define that an asset depends on all previous partitions of itself. Alternatively, we could cut out all that partition mapping stuff and just let a non-partitioned asset depend on itself, although it's possible that that could be trickier to implement than it sounds
d
Definitions.load_asset_value is really nice, I’ve been using it outside of Dagster runs previously. However, I’m more concerned about writing the asset (as my special job does this). Would it be possible to have Definitions.materialize_asset too? Alternatively, like currently the ‘job’ decorator has an input_values argument, which can be used to load assets (by the way, it didn’t work correctly when I tried it, I’ll take a look into it again), it could also have an output_assets argument. Sorry for the phone formatting
@owen any thoughts?
o
Can you say a bit more about what your special job is doing? How does it differ from the regular flow of materializing your asset? Is it mostly just that it has some pre-work steps that happen before your asset is materialized? If so, what sort of pre-work is happening there? could it be encapsualted by an asset in some way?
d
The asset maintains a collection of processed files. Every time new files are processed their names are appended to this asset. It’s a deltalake table. Sometimes I want to reset this table or sync it with the files which have been manually added. The job does exactly this - looks through available files and writes their names in the table (overwriting it instead of appending). I realize the asset is not idempotent, and this is the root of the problem, but it seems like Dagster allows such workflows judging by the existence of AssetMaterializationEvent.
o
could this be done via config on the asset (default config = "append only", fancy config = "overwrite")? then you could create a special job that has the fancy config encoded on it, which should do a similar thing to your existing job
d
This should be possible, but this approach would lead to a lot of boilerplate and dirty code. The asset would be just one giant hack. I would really like to be able to have
Definitions.materialize_asset
instead as it would be more versatile