Nicolas Guary
06/12/2023, 1:23 PMio_manager_key
on my dbt assets to a BigQueryPandasIOManager
to tell Dagster that any Asset created using dbt should be handled using a BQ IO Manager
• Create an Asset with a custom IO Manager for PostgreSQL that simply takes the dbt model name in arg and returns it, which will then use the BigQueryPandasIOManager
to read the input data and the custom PostgreSQL one to write it.
Is this the right approach ? If so, I believe that Dagster will load the whole table in-memory, is there any way to mitigate this ?
Thank's in advance for your inputs !owen
06/12/2023, 6:13 PMBigQueryPandasIOManager
would result in the entire table being loaded in memory, but I believe working around this is a more general problem rather than dagster-specific. In most cases, for cross-db replication, you'll need some process to read all of your data from the source, and write all of the data to the destination.
Depending on the scale of your data, this could be done as you describe, but once the size of the table no longer fits in the memory of a single process, you'll need a more elaborate process. There are special purpose tools for this usecase, such as fivetran, airbyte, or stitch, or you could build your own scheme for handling this.
Another option (depending on how your tables are structured) might be to use partitions to limit the amount of data handled at any given time.Nicolas Guary
06/12/2023, 7:56 PMBigQueryPandasIOManager
+ a custom IO Manager for PostgreSQL (by the way, is there any plans on releasing an official IOManager for PostgreSQL ?) and I'll see if the problem scales a lot how I can use an external tool for this.
I'll look into Partitions too, this could be useful !owen
06/12/2023, 8:22 PMNicolas Guary
06/13/2023, 6:58 AM