Bruno Grande
08/05/2022, 5:42 PMAssetMaterialization
in a dynamically partitioned non-asset job, but I lose the useful data lineage of SDAs.owen
08/05/2022, 6:38 PMStaticPartitionsDefinition
that actually reads the set of partition keys from a database, i.e.
def get_partitions_def():
all_filenames = call_to_database()
return StaticPartitionsDefinition(all_filenames)
my_partitions = get_partitions_def()
You could have a separate job that updates the contents of the database so that it stays somewhat up to date. This isn't really a recommended pattern, because it generally means that every time you import this code, you'll need to make a call to a database, so you'd want to make sure that this was a pretty fast call (and probably cache the result).Bruno Grande
08/05/2022, 7:07 PMowen
08/05/2022, 8:43 PMBruno Grande
08/05/2022, 9:08 PMop
that downloads the file from the data repository
• An asset for the set of processed outputs from all manifest chunks
◦ This would be backed by a dynamic graph, which would handle the splitting of the manifest and the submission of remote processing jobs
◦ This asset would depend on the first one
I think this would help achieve what I’m looking for because if the manifest is updated, then I would want to re-materialize the second asset.
Do you know if there’s an easy way to use the file checksum (e.g. MD5) to determine whether it’s “out-of-date”? Or does Dagster only currently determine “out-of-dateness” based on whether upstream assets have been re-materialized or not.
I wonder if I could use the asset definition’s metadata for this. 🤔owen
08/05/2022, 9:50 PMcontext.add_output_metadata({"hash": my_hash})
). This method will add it to a particular materialization's metadata. You could then query this value in the next run of the asset with context.instance.get_asset_events(asset_key=my_asset_key, limit=1)
. This event should have that metadata on it somewhere.Ben Gatewood
08/06/2022, 5:33 AM