Has anyone been running into data quality issues with cross-db data replication e.g. Postgres > Snowflake? If so, how did you go about validating the data consistency/correctness and fixing the quality issues?
Currently working on an open-source package to validate the correctness of db replication and would love more input. If anyone has thoughts about this and would be open to sharing, that’d be super helpful!
I used to work on a team that focused on streaming replication from on prem Postgres to S3 (which was loaded into other databases and directly queryable via Redshift Spectrum and Athena).
We used a combination of sampling and checksumming to assert validity
Roughly, our flow was:
Postgres WAL > Kafka > microbatches in S3 > file concatenation via Spark > final representation in S3
We’d regularly sample entire records in Postgres and assert the same records existed in S3.
We’d also regularly run simple checksums (usually record counts between date or transaction ranges) on both Postgres and the final representation in S3.
With intermediate checks along the way (like counting records in and records out during the file concatenation step, etc.)