https://dagster.io/ logo
#ask-community
Title
# ask-community
s

Sofiane Lounici

03/08/2024, 2:04 PM
Hi, I have an architecture question: we have zipped CSV files on a S3 Bucket that we want to ingest daily into a Postgres database (1 file = 1 table). It's possible that these files won't change every day (some yes, some no), so we want an efficient process (with a cache system ?) to avoid delete rows and re-ingest everything everyday. For instance, if we have two files (agents, equipement):
Copy code
# agents
agent_id,first_name,last_name
John - Smith - STORAGE, John, Smith
Mary - Johnson - STORAGE, Mary, Johnson
Susan - Williams - WAREHOUSE, Susan, Williams
James - Brown - WAREHOUSE, James, Brown
Copy code
# equipement
equipement_id,equipement_name, agent_id
1 - STORAGE - Camera, Camera, John - Smith - STORAGE
1 - STORAGE - Light, Light, Mary - Johnson - STORAGE
2 - WAREHOUSE - Light, Light, Susan - Williams - WAREHOUSE
2 - WAREHOUSE - Camera, Camera, James - Brown - WAREHOUSE
I want to insert these files into agent et equipement, as well as building the foreign keys. Which tools I can use with Dagster to perform this ? Is Airbyte (for instance) adapted to this task ? Do you have any recommendation on building efficient ingestion system ?
e

Ella Rohm-Ensing

03/08/2024, 7:31 PM
Hi Soufiane - chiming in from Airbyte here. Airbyte explicitly moves data and is not aimed toward database replication (which I know is not exactly what you're looking to do, but I think is relevant here, as the FK question pertains more to metadata) - it won't build the foreign keys in the Postgres destination. Otherwise, though, your question about
It's possible that these files won't change every day (some yes, some no), so we want an efficient process (with a cache system ?) to avoid delete rows and re-ingest everything everyday.
refers to incremental syncing (keep track of cursors, only sync what's changed), which is supported on the S3 source.
s

Sofiane Lounici

03/20/2024, 3:08 PM
Thank you !
5 Views