https://dagster.io/ logo
Title
f

Fede Isas

03/14/2023, 7:22 PM
Hi! I have an RDS DB with a multi-tenant single schema app (every table has a
tenant_id
). <100 million rows total count (between all tables and tenants). I'd like to periodically (hourly?) generate S3 parquet files per table per tenant.
/lake/{tenant_id}/{table}.parquet
. Transformations are basically plucking and renaming columns. My research lead me to consider Airbyte + Dagster to accomplish this. Am I on the right track?
:dagster-bot-responded-by-community: 1
a

Adam Bloom

03/14/2023, 7:25 PM
Are you planning to replicate your source database to a warehouse and generate your parquet files from there? Or are those files going to be generated from the source database?
f

Fede Isas

03/14/2023, 7:28 PM
Hi! Not sure, but ideally I'd like to avoid crushing the source DB so I think my best bet is to extract RDS -> S3 first and then run a job per tenant per table to generate each parquet file?
a

Adam Bloom

03/14/2023, 7:56 PM
so, CDC sync for DB -> S3? If that isn't CDC, then I'm not sure that step does much to avoid load. However, if it is CDC, you'll end up with many files (an increasing number with time) to use in your parquet generation job
f

Fede Isas

03/14/2023, 8:38 PM
Yes, I think CDC would be ideal. Looking into. AWS DMS right now
a

Adam Bloom

03/14/2023, 8:42 PM
If you’re considering AWS DMS, then you likely don’t need airbyte.
f

Fede Isas

03/14/2023, 8:43 PM
So Dagster would sensor from S3 regardless of how I get the data there? How should I think about materializing n partitions per tenant for each source on Dagster?
a

Adam Bloom

03/14/2023, 9:49 PM
you could just have a sensor for each (tenant, partition) pair, depending on how frequently those change. You can define sensors and assets with factory patterns, so that doesn't require repeating any code in dagster
f

Fede Isas

03/14/2023, 9:53 PM
@Adam Bloom I'll look into those. Still getting familiar with all the Dagster concepts. Thanks a lot!