Hello We are creating our first end to end pipeline to gener dagster #ask-community

Hello! We are creating our first end-to-end pipeli...

Jérémie Bouhet

04/01/2022, 9:30 AM

Hello! We are creating our first end-to-end pipeline to generate a gold data & execute successive dbt command to do that. The process will be typically to execute this: Step 1: • Input: Snowflake raw table • Job: dbt run to create Bronze data • Output: Bronze data table Step 2: • Input: Bronze data table • Job: dbt run to create Silver data • Output: Silver data table Step 3: • Input: Silver data table • Job: dbt run to create Gold data • Output: Gold data table Today we are running all these step one by one directly in Production. The problem is that when we have dirty data or wrong data, we directly populate it in production tables. We have a Data Observability that tell us afterward that we have some quality issue. What we want: Anticipate this quality issue by playing some test on the data before to publish it in Production. Have you some guidelines to apply here with a dbt/dagster pipelines? The idea I would like to implement will be to execute all Step in a isolated staging area/environment that reproduce the Production one (by a snowflake clone for example). Then apply all my quality check on it & then switch the staging tables to production (when they pass the quality gate). For example the workflow for the Step 1 would be: Step 1: • Input: Snowflake raw table • Job: ◦ dbt run to create Bronze data in Staging environment (that replicate the production one) ◦ Apply quality checks ◦ Switch the staging tables to Production environment • Output: Bronze data table What do you think? Have you some best practices/guidelines to help us to build this kind of pipelines? And integrated with a snowflake/dbt/dasgter stack. Thanks a lot!

owen

04/01/2022, 4:47 PM

hi @Jérémie Bouhet! Sounds like you're describing a blue green deployment . The way I would envision setting this up in dagster would be with a series of ops:

bronze dbt run -> silver dbt run -> gold dbt run -> data quality check on staging tables -> swap staging tables to prod

. The bronze dbt project would read from the production data source, but output to a staging schema (and the silver / gold projects would read from / write to staging schemas as well). Only once the data quality check has passed will these tables get promoted to production. We have libraries for interacting with both dbt and snowflake (dagster-dbt and dagster-snowflake) which should make those bits pretty quick to set up.

🙏 1

❤️ 1

2 Views

Open in Slack

Previous Next