What is the proper way to do the following Basically moving dagster #ask-community

What is the proper way to do the following (Basica...

Rafael Gomes

01/05/2023, 12:46 PM

What is the proper way to do the following (Basically moving data from my databases to BQ with extra transformation): I have several assets

[A, B, C]

, and each gets data from different tables in a database and makes some transformations. The result of these assets is a dataframe that will be passed to another asset

responsible to send the data to BigQuery. I don't think this is the proper way to do that, I keep replicating the asset

because the table names are different. Is it possible to have a common asset that receives data from multiple assets and extra parameters (table name for example)? Also, should

be an asset?

claire

01/05/2023, 7:29 PM

Hi Rafael. If i'm understanding correctly, what you're doing is creating an asset X that will write the input dataframe to bigquery, and duplicating X for each table you'd like to output to bigquery. Generally any defined asset should be a singular object that is persisted in storage. So in this case, A, B, and C are assets. X is just an operation that writes an arbitrary table into bigquery, so I'd recommend representing X as an IO manager instead. You can define a bigquery IO manager and add it to A, B, and C. Upon the asset outputting the dataframe, the bigquery IO manager should write the dataframe to bigquery so you no longer need a downstream X operation.

❤️ 1

Rafael Gomes

01/05/2023, 8:02 PM

Thanks 🙂

Abhishek Agrawal

01/24/2023, 4:11 PM

Hey @claire, resuscitating this thread as query is along these lines. Say, I have several external tables (which point to a gcp path for datasource) in BigQuery and I want to create assets for them. Now, when a file is dropped to any of the above locations, the respective external table will be automatically populated. At this time, the asset will not get materialised again but the table will have new data. When a file is dropped, we want to get the assets materialised and that should further trigger downstream processes but this is not happening. What to do?

Abhishek Agrawal

01/24/2023, 11:57 PM

@Rafael Gomes tagging you in case you could help me 🙂

Rafael Gomes

01/25/2023, 1:32 AM

I think you'll need to create a custom BigQuery IO Manager. There is also a opened issue to implement that: https://github.com/dagster-io/dagster/issues/10411 (You can find some draft codes there as well)

Abhishek Agrawal

01/25/2023, 1:59 AM

Ah okay. Without it, this can't be done you think?

3 Views

Open in Slack

Previous Next