Hey folks I m trying to build a simple POC ETL pipeline that dagster #ask-community

Hey folks, I'm trying to build a simple POC ETL pi...

Emmanuel Ogunwede

01/12/2023, 1:44 AM

Hey folks, I'm trying to build a simple POC ETL pipeline that runs on a monthly schedule, fetching data from an API, writing it to a data lake (S3) and finally loading the data from S3 into snowflake... I'm struggling with certain decisions cos I'm not sure if they're anti-patterns... 1. I'm using s3_pickle_io_manager as my IO Manager 2. I've written a custom resource that fetches data from the API 3. I've defined an asset that uses the custom resource to fetch the data 4. I have another asset that uses S3_resource to write the API data to an S3 bucket (not sure if I should return the file name/S3 URL or return nothing) 5. I have a custom resource which can load the data from S3 as a dataframe in memory. My intention is to use it within an asset_fn to write the dataframe into a snowflake table MY CONFUSION 1. Are all the steps in my pipeline good candidates for software defined assets in dagster 2. I'd want to use Monthly partition definitions on my assets but I only really need it in the first asset to dynamically generate the date part of a URL...would it be an anti pattern if I defined a partitioned asset Job and eventually a Date partitioned schedule that materialises only the asset that fetches data from the API while using asset reconciliation sensors to ensure that the asset writing to the datalake and the asset writing to snowflake are never stale? 3. Is this some sort of scenario where I really need just two assets (one to fetch API data and another to write to datalake) and then a separate op that is responsible for moving data from S3 into datalake...is that even a valid pattern? 4. Is it an anti pattern to pass the monthly partition definition to all of my assets (even if some assets can do without the partition definition) and just have one job and one schedule 5. What's a recommended/optimal solution for my use case? Thanks in advance! big dag eyes

👀 1

sandy

01/12/2023, 9:54 PM

1. Are all the steps in my pipeline good candidates for software defined assets in dagster

Any step that produces an output in persistent storage is a good candidate for a software-defined asset.

1. Is it an anti pattern to pass the monthly partition definition to all of my assets (even if some assets can do without the partition definition) and just have one job and one schedule

Are the objects that these downstream assets are creating partitioned by month? I.e. would you like the ability to overwrite just the data for a particular month without affecting the data for other months? If yes, then it's a good fit for partitions. If no, then something you could do is have a non-partitioned asset downstream of the partitioned asset. Partitioned and non-partitioned assets can go inside the same scheduled job.

Emmanuel Ogunwede

01/13/2023, 6:31 PM

Cool, I'd try this out. Thanks @sandy!

21 Views

Open in Slack

Previous Next