How is everyone else handling assets that aren’t partitioned but should be historized? E.g. in budget planning, department heads will update spreadsheets, often intraday, and the lake should contain all historical versions of it. Though I know S3 can keep old versions of files, working with this versioning system seems a little cumbersome. My current approach is to append a load timestamp to the filename, but this sometimes causes problems loading the asset downstream (could just be me not fully battle-testing my IO Manager code though)
dagster bot responded by community 1
02/09/2023, 2:10 PM
I just asked this question yesterday. The suggestion we are trying now is to use asset partitions. The safety measure in place is then to not produce an asset if the current day (or whatever frequency you are using) doesn't match up with the partition being generated. See brief thread here: https://dagster.slack.com/archives/C01U954MEER/p1675888677787539
So every new day, the new "partition" will be fetched which is really the entire resource snapshotted in time. Then there is a check in that function to either throw an error or something else if the partition window doesn't match up. That will prevent old partitions from getting re-materialized which would not be correct time snapshots.
02/09/2023, 2:26 PM
Hm, that’s an interesting approach, though I can see my OCD kicking in when looking at the backfill list being mostly empty since the job doesn’t need to run daily. I’d also maybe like to have multiple versions of the asset that pertain to the same day
02/09/2023, 2:29 PM
yeah, if your partition definition is complicated this might be a pain. and if it fails for whatever reason you will have "forever missing" partitions, though that shouldn't be a problem in most downstream cases. The built in partitions definitions have daily, weekly, and monthly I believe.
Otherwise I would probably do it as the IOManager level, like you were suggesting, which was one of the approaches we were considering.
Another option altogether is to not bother with the
stuff and just schedule an op that pushes to a bucket where you tell it to go.