How do you source your data in pre-production envi...
# data-platform-design
c
How do you source your data in pre-production environments? For faster development and increased security we want to have a fully separate pre-production cloud environment. Ideally, it would have an exact copy of the data present in our production warehouse or lake. Problem is that this poses a security risk and is not cost effective. How are you dealing with this issue? Do you have an anonymization pipeline from prod into staging? Do you generate fakes? It seems complicated for us since our dbt project is very large and some raw sources can’t be easily reproduced. Especially SaaS sources that rarely have a non-prod equivalent as compared to an operational database which lives by default in pre-production.
y
one way we recommend and we also use internally is for snowflake users, to leverage snowflake clone: https://docs.dagster.io/guides/dagster/branch_deployments — on a pr/branch creation, you can kick off a snowflake schema clone which gets you the up-to-date source data without polluting the prod data. here’s the open source version of it: https://docs.dagster.io/guides/dagster/transitioning-data-pipelines-from-development-to-production
c
Thanks for sharing, I like the idea. It wouldn't work if I'm running my code in a different account with no access to production data but definitely a nice trick if that not the case.
c
+1 for snowflake clones. I've also used Faker with factory_boy to populate source tables in automated tests to allow for testing transformation logic and pipeline integration.