question about syncing a partitioned asset from postgres to dagster #ask-community

question about syncing a partitioned asset from po...

Rahil Sondhi

06/29/2023, 8:36 PM

question about syncing a partitioned asset from postgres to snowflake: 1. i have a

postgres.events

asset with a daily partition. it's fully materialized 2. i have a

snowflake.events

asset with a daily partition. these depend on

postgres.events

, and they need to get materialized, but in date asc order 3. i was thinking of writing a job that iterates through the

postgres.events

partitions, enqueues the corresponding

snowflake.events

partition, but it has to be done sequentially, and it needs to abort upon failure (because i want to insert the data in order) any tips on how to accomplish #3 this? or am i thinking about it wrong? thank you!

🤖 1

Tim Castillo

06/29/2023, 8:43 PM

Is there a particular reason why you'd like the inserts to be done sequentially? If you're clustering your Snowflake table by event time, Snowflake should automatically re-cluster on its own with the insert. Depending on the scale of your data, I'd recommend not reading the postgres data into memory in order to write it to Snowflake. If it's too big to fit into memory, you might want to consider using Dagster to orchestrate: • dumping the data out of Postgres into an S3 bucket (or something similar) • having Snowflake recognize that bucket as an external stage • then running a query to move the data from the stage directly into Snowflake. And assuming you use the date/partition key as how you segment and store the data, the ingests should stay in step together when moving out of postgres and into snowflake.

Rahil Sondhi

06/29/2023, 8:48 PM

hey Tim, thanks for the quick reply!

Is there a particular reason why you'd like the inserts to be done sequentially?

yea i was wondering about this too. i found this answer on the Snowflake forum from a Snowflake employee saying order does matter 🤔 but i guess if i decide to specify the

cluster by

myself, then i'm back in control

Depending on the scale of your data, I'd recommend not reading the postgres data into memory

yea sorry i ommitted that in my original post, but the

postgres.events

is actually postres data but hosted on s3 in parquet format. i guess i should rename the asset to

s3.events

? 🤔 that's also a point of confusion for me. should these be three separate assts? they're all the same data, but hosted in different places (postgres -> s3 parquet -> snowflake)

Tim Castillo

06/29/2023, 8:58 PM

Oh yeah, they are right, but if the data is at the scale where you have to worry about your ingest, I think it's a fair time to start hi-jacking their partitioning strategy and managing the clustering on your own. I like framing this as 3 assets: •

postgres_events

- is "optional", but it's a source asset, since I assume your product is what populates this, not orchestrated by Dagster. ◦ I'm gonna ack that this asset you can likely skip if you don't want this level of observability or touch declarative scheduling this early on. •

s3_events

- the exporting and dumping of postgres data into S3 •

snowflake_events

- the query from the stage The reason why I'd split this into multiple assets is so if any of those steps fail, you can re-run from there rather than having to do it all over again. ex. You likely won't get any errors when dumping from Postgres to S3, but I'd expect the copy into Snowflake step to fail a bit because the schema for

events

is likely prone to change. By splitting the assets, you can let the fail happen, patch up your query for copying into snowflake, and then re-run just that asset materialization, rather than both assets.

Rahil Sondhi

06/30/2023, 1:22 AM

awesome, thank you so much for the great support! 🙏

4 Views

Open in Slack

Previous Next