hey all, so i’m new to dagster, and i have a reaso...
# ask-community
j
hey all, so i’m new to dagster, and i have a reasonably straightforward task that I can’t for the life of me figure out the proper “dagstery” way of organizing things. I have a reasonably straightforward task: 1. download a series of monthly zip files 2. Unpack them into daily csvs 3. load each CSV into the same postgres table but I can’t for the life of me figure out how to conceptually organize this workflow in dagster, esp. with the partitions. Conceptually, should each zip file be represented by an asset that’s partitioned monthly? Should the daily csvs be their own daily partitioned asset that has the monthly partitioned zip as an upstream asset? How do I pass the states of the files back and forth, e.g. whether or not the file has been downloaded, where the file is being stored, whether or not the zip file has been unpacked, etc? I could hardcode all of the filepaths based off of the date but that feels wrong somehow
🤖 1
s
I am in the same situation, If you find a good solution ping me :)
s
hey @Jack Yin - how do you envision running this? i.e. do you want to run it for all the months at once? do you want to be able to run it for individual months? for individual days?
j
@Saul Burgos i ended up just cramming all the logic into the asset definition like an uncultured savage
@sandy I intend on being able to backfill it and then subsequently refreshing it daily, which will require re-downloading the zip and doing the whole ETL thing
so i guess i have a MonthlyPartitionedAsset upstream of a DailyPartitionedAsset, and as long as the Daily one can force the Monthly to reset then i’m good
ok yeah here’s an example of where i’m running into trouble - i’m storing the zips and CSVs in predetermined locations in the filesystem
but when i try to put the CSV upstream of the zip, it expects some output to be there
i guess i should just return some dummy value?
^jk i’m wrong about that, i just missed the
end_offset
parameter
s
cramming all the logic into a single asset definition is a totally reasonable approach
if it's important to be able to re-execute from the middle, but it doesn't make sense to model with multiple assets, you can use graph-backed assets: https://docs.dagster.io/concepts/assets/software-defined-assets#graph-backed-assets
also fwiw you can put a monthly-partitioned asset on a daily schedule
j
ah i see i was just about to ask about that, but that makes sense