The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

*How do you source your data in pre-production environments?*

For faster development and increased security we want to have a fully separate pre-production cloud environment.

Ideally, it would have an exact copy of the data present in our production warehouse or lake. Problem is that this poses a security risk and is not cost effective.

How are you dealing with this issue? Do you have an anonymization pipeline from prod into staging? Do you generate <https://faker.readthedocs.io/en/master/|fakes>?

It seems complicated for us since our dbt project is very large and some raw sources can’t be easily reproduced. Especially SaaS sources that rarely have a non-prod equivalent as compared to an operational database which lives by default in pre-production.

one way we recommend and we also use internally is for snowflake users, to leverage snowflake clone: <https://docs.dagster.io/guides/dagster/branch_deployments> — on a pr/branch creation, you can kick off a snowflake schema clone which gets you the up-to-date source data without polluting the prod data.

here’s the open source version of it: <https://docs.dagster.io/guides/dagster/transitioning-data-pipelines-from-development-to-production>

Thanks for sharing, I like the idea. 

It wouldn't work if I'm running my code in a different account with no access to production data but definitely a nice trick if that not the case.

+1 for snowflake clones. I've also used <https://faker.readthedocs.io/en/master/|Faker> with <https://factoryboy.readthedocs.io/en/stable/|factory_boy> to populate source tables in automated tests to allow for testing transformation logic and pipeline integration.