Rahil Sondhi
06/29/2023, 8:36 PMpostgres.events
asset with a daily partition. it's fully materialized
2. i have a snowflake.events
asset with a daily partition. these depend on postgres.events
, and they need to get materialized, but in date asc order
3. i was thinking of writing a job that iterates through the postgres.events
partitions, enqueues the corresponding snowflake.events
partition, but it has to be done sequentially, and it needs to abort upon failure (because i want to insert the data in order)
any tips on how to accomplish #3 this? or am i thinking about it wrong? thank you!Tim Castillo
06/29/2023, 8:43 PMRahil Sondhi
06/29/2023, 8:48 PMIs there a particular reason why you'd like the inserts to be done sequentially?yea i was wondering about this too. i found this answer on the Snowflake forum from a Snowflake employee saying order does matter 🤔 but i guess if i decide to specify the
cluster by
myself, then i'm back in control
Depending on the scale of your data, I'd recommend not reading the postgres data into memoryyea sorry i ommitted that in my original post, but the
postgres.events
is actually postres data but hosted on s3 in parquet format. i guess i should rename the asset to s3.events
? 🤔
that's also a point of confusion for me. should these be three separate assts? they're all the same data, but hosted in different places (postgres -> s3 parquet -> snowflake)Tim Castillo
06/29/2023, 8:58 PMpostgres_events
- is "optional", but it's a source asset, since I assume your product is what populates this, not orchestrated by Dagster.
◦ I'm gonna ack that this asset you can likely skip if you don't want this level of observability or touch declarative scheduling this early on.
• s3_events
- the exporting and dumping of postgres data into S3
• snowflake_events
- the query from the stage
The reason why I'd split this into multiple assets is so if any of those steps fail, you can re-run from there rather than having to do it all over again.
ex. You likely won't get any errors when dumping from Postgres to S3, but I'd expect the copy into Snowflake step to fail a bit because the schema for events
is likely prone to change. By splitting the assets, you can let the fail happen, patch up your query for copying into snowflake, and then re-run just that asset materialization, rather than both assets.Rahil Sondhi
06/30/2023, 1:22 AM