Hi! I have an RDS DB with a multi-tenant single sc...
# ask-community
f
Hi! I have an RDS DB with a multi-tenant single schema app (every table has a
tenant_id
). <100 million rows total count (between all tables and tenants). I'd like to periodically (hourly?) generate S3 parquet files per table per tenant.
/lake/{tenant_id}/{table}.parquet
. Transformations are basically plucking and renaming columns. My research lead me to consider Airbyte + Dagster to accomplish this. Am I on the right track?
dagster bot responded by community 1
a
Are you planning to replicate your source database to a warehouse and generate your parquet files from there? Or are those files going to be generated from the source database?
f
Hi! Not sure, but ideally I'd like to avoid crushing the source DB so I think my best bet is to extract RDS -> S3 first and then run a job per tenant per table to generate each parquet file?
a
so, CDC sync for DB -> S3? If that isn't CDC, then I'm not sure that step does much to avoid load. However, if it is CDC, you'll end up with many files (an increasing number with time) to use in your parquet generation job
f
Yes, I think CDC would be ideal. Looking into. AWS DMS right now
a
If you’re considering AWS DMS, then you likely don’t need airbyte.
f
So Dagster would sensor from S3 regardless of how I get the data there? How should I think about materializing n partitions per tenant for each source on Dagster?
a
you could just have a sensor for each (tenant, partition) pair, depending on how frequently those change. You can define sensors and assets with factory patterns, so that doesn't require repeating any code in dagster
f
@Adam Bloom I'll look into those. Still getting familiar with all the Dagster concepts. Thanks a lot!