Hello I have a use case where I need to read from a BigQuery dagster #ask-community

Hello, I have a use case where I need to read from...

Nicolas Guary

06/12/2023, 1:23 PM

Hello, I have a use case where I need to read from a BigQuery table generated using dbt and write it to a PostgreSQL database. I wrote a simple IOManager that can write to PostgreSQL but I don't really know the right approach to load the data from BigQuery in Dagster. From what I've seen, one way to achieve this would be : • Set the

io_manager_key

on my dbt assets to a

BigQueryPandasIOManager

to tell Dagster that any Asset created using dbt should be handled using a BQ IO Manager • Create an Asset with a custom IO Manager for PostgreSQL that simply takes the dbt model name in arg and returns it, which will then use the

BigQueryPandasIOManager

to read the input data and the custom PostgreSQL one to write it. Is this the right approach ? If so, I believe that Dagster will load the whole table in-memory, is there any way to mitigate this ? Thank's in advance for your inputs !

owen

06/12/2023, 6:13 PM

Hi @Nicolas Guary! You're correct that using

BigQueryPandasIOManager

would result in the entire table being loaded in memory, but I believe working around this is a more general problem rather than dagster-specific. In most cases, for cross-db replication, you'll need some process to read all of your data from the source, and write all of the data to the destination. Depending on the scale of your data, this could be done as you describe, but once the size of the table no longer fits in the memory of a single process, you'll need a more elaborate process. There are special purpose tools for this usecase, such as fivetran, airbyte, or stitch, or you could build your own scheme for handling this. Another option (depending on how your tables are structured) might be to use partitions to limit the amount of data handled at any given time.

Nicolas Guary

06/12/2023, 7:56 PM

Thank's a lot Owen for the input ! I'll try to solve the current use case using

BigQueryPandasIOManager

+ a custom IO Manager for PostgreSQL (by the way, is there any plans on releasing an official IOManager for PostgreSQL ?) and I'll see if the problem scales a lot how I can use an external tool for this. I'll look into Partitions too, this could be useful !

owen

06/12/2023, 8:22 PM

No problem! There are no immediate plans on the PostgreSQL IOManager front (we haven't seen a lot of demand for this)

🙏 1

Nicolas Guary

06/13/2023, 6:58 AM

Yes I can understand this, writing data to an external PostreSQL Database is a common use-case and the ability to do it directly from Dagster using built-in functionalities would be awesome !

40 Views

Open in Slack

Previous Next