Hi how can I configure the dataset and location to read from dagster #integration-bigquery

Hi, how can I configure the dataset and location t...

Benedikt Buchert

02/27/2023, 9:53 PM

Hi, how can I configure the dataset and location to read from? It is an asset produced by dbt. I read https://docs.dagster.io/_apidocs/libraries/dagster-gcp-pandas#dagster_gcp_pandas.bigquery_pandas_io_manager Currently it is looking in

marts

but the correct dataset is prefixed by dbt

abc_marts

. Also it is looking inside US location, which is not where the dataset is located. I tried setting the default location to

"location": "europe-west3"

in my definitions which seems to have no effect. Is it possible to specify the dataset location and datset name per asset or asset group as well?

Copy code

from dagster import asset, AssetIn
import pandas as pd

@asset(
    ins={"dim_ga4__users": AssetIn(
        key_prefix=["dbt_models", "marts"]
    )},
    group_name="marts"
)
def show_users_head(dim_ga4__users) -> pd.DataFrame:
    print(dim_ga4__users.head)
    return dim_ga4__users

jamie

02/28/2023, 5:19 PM

hey @Benedikt Buchert if the dbt dataset is in the

dataset.table

abc_marts.gim_ga4__users

then you will want your AssetIn to be

Copy code

ins={"dim_ga4__users": AssetIn(
        key_prefix=["dbt_models", "abc_marts"]
    )},

I believe that should correspond to the full asset key of the dbt dataset as loaded by dagster, but if that’s not the case, let me know and i can take a closer look. checking on the location stuff now

jamie

02/28/2023, 5:22 PM

yeah that’s an oversight on my part - the location didn’t get fully propagated to the query. putting up a PR now and it should get into this week’s release

Benedikt Buchert

02/28/2023, 8:14 PM

Thank you @jamie thank you for creating that pull request. The key prefix that is automatically pulled from dbt is marts even though the dataset is abc_marts. This is because the schema from dbt https://docs.getdbt.com/docs/build/custom-schemas#why-does-dbt-concatenate-the-custom-schema-to-the-target-schema. I guess I can change that on the dbt side. But I guess it would be nice having the ability to adjust this probably it is more of a dbt integration issue though. Currently if I use key_prefix=["dbt_models", "abc_marts"] it does not match anymore.

jamie

02/28/2023, 8:28 PM

i see

jamie

02/28/2023, 8:34 PM

i’ll be the first to admit my dbt knowledge isn’t very strong, so correct me if i’m wrong about any of this. Based on my understanding you have a schema specified in your profiles.yml - what schema is that?

abc_marts

or just

marts

? it would also be helpful if you could share the code snippet of how you’re loading the dbt models as assets. feel free to dm that to me if you dont want to share publicly

Benedikt Buchert

02/28/2023, 8:45 PM

In profiles.yml It is defined as

abc

. Then in the project.yml it is set to marts for all models that are living in the marts folder. This leads to the dataset being named

abc_marts

. But the default behaviour is to take

AssetKey([model_name])

. So I guess what I need to do is to us the

node_info_to_asset_key

function to adjust the behaviour and prefix everything with

abc

or whatever I have defined in my profiles.yml.

Copy code

dbt_assets = load_assets_from_dbt_project(
    project_dir=DBT_PROJECT_PATH,
    profiles_dir=DBT_PROFILES,
    key_prefix=["dbt_models"],
    source_key_prefix=["dbt_source"]
)

https://docs.dagster.io/_apidocs/libraries/dagster-dbt#assets-dbt-core Right?

Benedikt Buchert

02/28/2023, 8:48 PM

At least for the last asset key prefix

marts

I need to adjust it.

jamie

02/28/2023, 8:51 PM

yeah basically the last key prefix before the asset name needs to match the dataset name. I think this is also a really good argument to have the

dataset

config for the io manager override the key prefix, that would allow you to set the dataset on the io manager itself and then it would ignore key prefixes

👍 1

Benedikt Buchert

02/28/2023, 8:57 PM

If I would do it in the io manager, I assume I would still be able to adjust that dynamically per model. So it knows the correct dataset per model?

jamie

02/28/2023, 9:54 PM

basically the two approaches you can take right now are: 1. set

dataset

on the io manager. then every asset using this io manager will be stored in + loaded from that specified dataset 2. set the dataset for each asset via the

key_prefix

and the io manager will store + load each asset from the dataset specified via key prefix right now these are mutually exclusive (ie if you have key prefixes AND set

dataset

config on the io manager, we throw an error), but we likely could/should relax that a bit. the issue is determining which approach to prefer if a user specifies both ways

👍 1

Benedikt Buchert

03/04/2023, 6:22 PM

Copy code

def node_info_to_asset_key(node_info: Mapping[str, Any]) -> AssetKey:
    asset_array = [
        node_info['schema'],
        node_info['name']
    ]
    return AssetKey(asset_array)

This fixes the issue and also simplifies the mapping for Fivetran imports for Bigquery.

Copy code

dbt_assets = load_assets_from_dbt_project(
    project_dir=DBT_PROJECT_PATH,
    profiles_dir=DBT_PROFILES,
    key_prefix=["dbt_models"],
    source_key_prefix=["dbt_source"],
    node_info_to_asset_key=node_info_to_asset_key
)

3 Views

Open in Slack

Previous Next