I have some existing code that generates a parquet file from dagster #ask-community

I have some existing code that generates a parquet...

Pedram Navid

04/20/2023, 1:54 AM

I have some existing code that generates a parquet file from a postgres source which I then need to upload to s3 and then write to Snowflake, is it better to think of this as an op or an asset?

Tim Castillo

04/20/2023, 3:15 AM

Hey stranger! I'd think of it as two assets: asset one being how to make the parquet file and the other as how it's written to snowflake. One way to look at is that these type of assets represent how the data is handed off between locations, in this case S3 and Snowflake. ---ignore everything below here. possible risk of over-complicating the mental model--- That's one way to view it. There's nuance to this though. If you want more robustness over the compute engine -> S3 step, then two separate assets for the local Parquet file and another for the file in S3 would work. It really depends on your needs, ex. if writing to S3 fails sometimes, then splitting it into two and adding a

RetryPolicy

to the S3 would you let try re-uploading without needing to do the extract, too. tl;dr: typically just the S3 and Snowflake asset case would work.

🙂 1

Pedram

04/20/2023, 4:01 AM

hi old friend. ooh that’s very helpful. first asset creates a parquet file, to some tmp folder? is the output from that asset the location to the tmp file? or does it need to load the parquet and pickle it using an io manager? then the second asset would presumably either accept the file path or the pickled file and handle writing that to s3

Pedram

04/20/2023, 4:01 AM

wow there’s two pedrams

Tim Castillo

04/20/2023, 4:37 AM

One account per pet. As it must be. It's typically an anti-pattern to pass metadata about the data (ex. File path) through as the asset definition, but if you want the control over the upload step, then it's the easiest way to do it. Optimally, you'd not load your parauet file into memory on your S3 asset def. Your run config/resources would have enough context that the S3 upload step would be able to resolve programmatically what the file path is, and your Snowflake would be able to resolve that S3 path.

67 Views

Open in Slack

Previous Next