Hi all happy to find out there is such a passionate communit dagster #ask-community

Hi all - happy to find out there is such a passion...

Edvard Lindelöf

06/13/2023, 9:48 AM

Hi all - happy to find out there is such a passionate community around Dagster! I am looking for advice on how to keep track of historical asset materializations in a fairly dynamically updated graph. The case involves a 'large' (1000s) number of assets. With a frequency of about every 5 minutes an input file arrives that triggers materialization of a small number of input- and downstream-assets. This can be achieved by dynamically partitioning over input files and using conditional materialization, so far so good. But something crucial to the application is being able to retrieve materializations as they were at a certain historical date (or, optionally, how they were after ingestion of a certain historical input file). How can I go about this? Can Dagster help me with it? Should I rethink what I consider to be an asset?

❤️ 2

Edvard Lindelöf

06/13/2023, 9:55 AM

One option I've investigated is to use LastPartitionMapping, or something similar, to have downstream asset partitions map to the latest materialized partition of the upstream asset. It is unclear to me if this is possible and I suspect that if it is, it may require one DynamicPartitionsDefinition instance per asset, which I'm not sure is a good idea.

Edvard Lindelöf

06/13/2023, 9:58 AM

It would be incredibly neat to be able to achieve it without introducing side-effects into asset code

owen

06/13/2023, 5:43 PM

hi @Edvard Lindelöf! definitely interesting -- the partition mapping approach you mentioned seems reasonable conceptually, to be honest, but another option would be to handle this logic in a custom IOManager. Imagining you're storing files in a filesystem, I could imagine a storage scheme that would basically be:

Copy code

def handle_output(self, context, obj):
    directory = "foo/{context.asset_key}"
    store_obj_in(obj, directory + "/{current_date}")

def load_input(self, context):
    directory = "foo/{context.asset_key}"
    path = get_newest_file_in_firectory(directory)
    return load_obj_from(directory + path)

this ends up being fairly similar to the last_partition_mapping scheme logically, but keeps the historical information separate from the asset's the logic

Edvard Lindelöf

06/14/2023, 7:21 AM

Thanks Owen, that's a nice option especially in that it doesn't require so much extra code! Will try it out.

Edvard Lindelöf

06/14/2023, 7:36 AM

For more from a conceptual level, I've contemplated whether full re-materialization of the state requires loading all historical files - I think that, _in principle (😁),_ it should be possible to compute exactly what input files are required to materialize a partition of an asset, since each file is an input node to a bigger DAG... In practice, I might look at having all the files available in the same asset to be able to do some of that optimization myself

3 Views

Open in Slack

Previous Next