question about syncing a partitioned asset from po...
# ask-community
r
question about syncing a partitioned asset from postgres to snowflake: 1. i have a
postgres.events
asset with a daily partition. it's fully materialized 2. i have a
snowflake.events
asset with a daily partition. these depend on
postgres.events
, and they need to get materialized, but in date asc order 3. i was thinking of writing a job that iterates through the
postgres.events
partitions, enqueues the corresponding
snowflake.events
partition, but it has to be done sequentially, and it needs to abort upon failure (because i want to insert the data in order) any tips on how to accomplish #3 this? or am i thinking about it wrong? thank you!
🤖 1
t
Is there a particular reason why you'd like the inserts to be done sequentially? If you're clustering your Snowflake table by event time, Snowflake should automatically re-cluster on its own with the insert. Depending on the scale of your data, I'd recommend not reading the postgres data into memory in order to write it to Snowflake. If it's too big to fit into memory, you might want to consider using Dagster to orchestrate: • dumping the data out of Postgres into an S3 bucket (or something similar) • having Snowflake recognize that bucket as an external stage • then running a query to move the data from the stage directly into Snowflake. And assuming you use the date/partition key as how you segment and store the data, the ingests should stay in step together when moving out of postgres and into snowflake.
r
hey Tim, thanks for the quick reply!
Is there a particular reason why you'd like the inserts to be done sequentially?
yea i was wondering about this too. i found this answer on the Snowflake forum from a Snowflake employee saying order does matter 🤔 but i guess if i decide to specify the
cluster by
myself, then i'm back in control
Depending on the scale of your data, I'd recommend not reading the postgres data into memory
yea sorry i ommitted that in my original post, but the
postgres.events
is actually postres data but hosted on s3 in parquet format. i guess i should rename the asset to
s3.events
? 🤔 that's also a point of confusion for me. should these be three separate assts? they're all the same data, but hosted in different places (postgres -> s3 parquet -> snowflake)
t
Oh yeah, they are right, but if the data is at the scale where you have to worry about your ingest, I think it's a fair time to start hi-jacking their partitioning strategy and managing the clustering on your own. I like framing this as 3 assets: •
postgres_events
- is "optional", but it's a source asset, since I assume your product is what populates this, not orchestrated by Dagster. ◦ I'm gonna ack that this asset you can likely skip if you don't want this level of observability or touch declarative scheduling this early on. •
s3_events
- the exporting and dumping of postgres data into S3 •
snowflake_events
- the query from the stage The reason why I'd split this into multiple assets is so if any of those steps fail, you can re-run from there rather than having to do it all over again. ex. You likely won't get any errors when dumping from Postgres to S3, but I'd expect the copy into Snowflake step to fail a bit because the schema for
events
is likely prone to change. By splitting the assets, you can let the fail happen, patch up your query for copying into snowflake, and then re-run just that asset materialization, rather than both assets.
r
awesome, thank you so much for the great support! 🙏