having a bit of an issue with a Dagster + dbt + Bi...
# integration-bigquery
d
having a bit of an issue with a Dagster + dbt + BigQuery implementation I'm trying to run. I have data loaded to BigQuery as a
raw_table
that has date and id. I'm trying to create an incremental dbt asset that is partitioned on date, but when I try the
insert_overwrite
strategy, it tells me I've reached my quota for insert/append for the table. When I try to use the
merge
strategy with partitioning on date, and the unique_id being id, I get an error saying I'm creating a table with over 4000 partitions, even when I try to materialize the asset for only a year's worth of data. When I try removing the partitioning, I get an error from dagster saying there is no
partition_expr
metadata value. Is there a better way to handle this? My source data table in BigQuery has data ranging from 1990-2023, although not every day actually has data
j
hey @Danny Steffy do you have some code snippets you can share? Also, just to confirm, are you using the BigQuery IO manager or the BigQuery resource? cc @rex for more dbt knowledge
d
IO Manager
đź‘Ť 1
Copy code
@dbt_assets(
    manifest=Path(DBT_PROJECT_DIR, "target", "manifest.json"),
    partitions_def=nhl_daily_partition,
    io_manager_key="warehouse_io_manager",
    dagster_dbt_translator=NHLDagsterDbtTranslator(),
)
def dbt_assets(context: OpExecutionContext, dbt: DbtCliResource):
    time_window = context.asset_partitions_time_window_for_output(
        list(context.selected_output_names)[0]
    )

    dbt_vars = {
        "min_date": time_window.start.strftime('%Y-%m-%d'),
        "max_date": time_window.end.strftime('%Y-%m-%d')
    }
    dbt_build_args = ["build", "--vars", json.dumps(dbt_vars)]

    yield from dbt.cli(dbt_build_args, context=context).stream()
that's my dbt_asset code
j
some additional context for you: • the BigQuery IO manager shouldn’t be getting invoked by the
@dbt_asset
function. Storing the data from the model is handled by dbt and the IO manager is only used to load the data in downstream assets • We also don’t use BigQuery native partitions in the BigQuery IO manager right now, which leads me to believe that the issue is coming from how dbt is trying to load the data into BigQuery
d
I see, so not having the BigQuery IO manager on the dbt_asset should still load the data to BigQuery?
j
yes, but if you’re trying to use the data in downstream assets, you’ll need to tell dagster how to fetch the data from BigQuery, which would require either: • adding the BigQuery IO manager, like you have now • manually fetching the data in the downstream asset yourself but to start debugging, you could remove the BigQuery IO manager and see if the error still occurs. That would help begin to narrow down the cause
d
it looks like removing the IO manager fixed the issue
actually, it has a new issue
j
interesting. that’s not what i would have expected. @rex does the new
@dbt_assets
use the IO manager now?
d
it looks like it materialized the asset, but it didn't use it as incremental for some reason. It transformed the entire model
my source asset:
my result asset:
I'm flattening a table that has 2 rows for every id, so I would expect the 40k rows for the entire asset... Also the Number of partitions looks right for the run I had, which was a year's worth of data, but the rows are way more
and the max date is way more recent
any ideas on what I'm doing wrong?
Here's the dbt model
j
cc @owen i didn’t realize rex was on pto, do you mind taking a look at this?
o
hm the IOManager definitely should not be invoked for a dbt asset -- the output type is
Nothing
, which should circumvent any IOManager behavior, so definitely odd that adding / removing that would have any impact. Some things I'd look into: 1. see the exact command that dagster is running on your behalf (this will show up in the logs for the run) 2. see if that looks as you expect (dates formatted as expected and all that) I'm not familiar with bigquery sql syntax in particular, but I know many sql dialects can be quite picky about how the dates are passed in (often, you'll want to have an explicit date parsing function called, rather than just passing in a bare string), so that might be another place to look
e
Try to wrap the min_date and max_date inside DATE() in dbt model?
Copy code
DATE('{{ var('max_date') }}')