What is the proper way to do the following (Basica...
# ask-community
r
What is the proper way to do the following (Basically moving data from my databases to BQ with extra transformation): I have several assets
[A, B, C]
, and each gets data from different tables in a database and makes some transformations. The result of these assets is a dataframe that will be passed to another asset
X
responsible to send the data to BigQuery. I don't think this is the proper way to do that, I keep replicating the asset
X
because the table names are different. Is it possible to have a common asset that receives data from multiple assets and extra parameters (table name for example)? Also, should
X
be an asset?
c
Hi Rafael. If i'm understanding correctly, what you're doing is creating an asset X that will write the input dataframe to bigquery, and duplicating X for each table you'd like to output to bigquery. Generally any defined asset should be a singular object that is persisted in storage. So in this case, A, B, and C are assets. X is just an operation that writes an arbitrary table into bigquery, so I'd recommend representing X as an IO manager instead. You can define a bigquery IO manager and add it to A, B, and C. Upon the asset outputting the dataframe, the bigquery IO manager should write the dataframe to bigquery so you no longer need a downstream X operation.
❤️ 1
r
Thanks 🙂
a
Hey @claire, resuscitating this thread as query is along these lines. Say, I have several external tables (which point to a gcp path for datasource) in BigQuery and I want to create assets for them. Now, when a file is dropped to any of the above locations, the respective external table will be automatically populated. At this time, the asset will not get materialised again but the table will have new data. When a file is dropped, we want to get the assets materialised and that should further trigger downstream processes but this is not happening. What to do?
@Rafael Gomes tagging you in case you could help me 🙂
r
I think you'll need to create a custom BigQuery IO Manager. There is also a opened issue to implement that: https://github.com/dagster-io/dagster/issues/10411 (You can find some draft codes there as well)
a
Ah okay. Without it, this can't be done you think?