Hey all is there a way with dagster dbt to have it materiali dagster #integration-dbt

Hey all, is there a way with dagster_dbt to have i...

Kirk Stennett

12/06/2022, 4:17 PM

Hey all, is there a way with dagster_dbt to have it materialize assets that match a DBT tag? From what I've seen you have to use the pre-defined asset name that's given (with some customization there like prefix). Or if I want to do something like this am I better off building something custom that runs the job from the CLI resource and determines which assets were materialized? The goal is to be able to do arbitrary dbt runs and have Dagster show the corresponding assets as materialized

rex

12/06/2022, 4:21 PM

Are you using core or cloud? For core, the asset command takes in a

select

argument where you can put in your tag selector. For cloud, you can just add the

--select

argument in your dbt Cloud job. Then, when we create software defined assets from your dbt Cloud job, we will respect this selector.

Kirk Stennett

12/06/2022, 4:23 PM

This is for core, and I'm not sure I follow. Is this something done through

AssetSelection

? I'm on 1.0.6 / 0.16.6 if that changes anything

rex

12/06/2022, 4:23 PM

Ah, I am talking about the interface that generates software defined assets from dbt in `dagster-dbt`:

load_assets_from_dbt_project

load_assets_from_dbt_manifest

Kirk Stennett

12/06/2022, 4:25 PM

Ah so then the flow would be: Create an op that takes an arbitrary selector, loads those assets and then creates an asset job around those? Before I was just loading all assets from the dbt project by default, and then creating asset jobs that ran some subset using

AssetSelection

And given the latter, is it problematic if I redeclare assets given my solution?

rex

12/06/2022, 4:29 PM

Oh I see what you’re trying to do. Is this the workflow you are expecting? 1. Load all the models from a dbt project as assets into Dagster. a. I should only have to load this once. b. I have tagged my models accordingly using dbt tags. 2. Given the models from (1), I want to be able to select arbitrary subsets of them to materialize. a. Ideally, using the tags that I’ve created.

Kirk Stennett

12/06/2022, 4:30 PM

Yep! That's exactly what I'm trying to do

Kirk Stennett

12/06/2022, 4:31 PM

I'm not tied to that of course, it's just how I thought you had to load everything. So if the solution is to load them via jobs, then that's fine with me

rex

12/06/2022, 4:34 PM

• 1a: accomplished with either

load_assets_from_dbt_project

load_assets_from_dbt_manifest

• 1b: accomplished with the

node_info_to_group_fn

argument: you can use the node information (that contains the dbt tags!) and map the tag name to a group name • 2a: when using

define_asset_job

, use the

selection

argument that can take in an

AssetSelection

, specifically

AssetSelection.groups

. Then you can use the group name from 1b here

rex

12/06/2022, 4:34 PM

cc @owen if I missed anything

Kirk Stennett

12/06/2022, 4:39 PM

Is there a way to add multiple groups? For instance if I had a tag that looked like:

tags=['tables', 'daily']

could I have those be separate groups? From what I can tell given a node it only returns a single str

Kirk Stennett

12/06/2022, 4:55 PM

Or do you see any problem with this? From what I can tell it has the behavior I'm looking for:

Copy code

def create_arbitrary_dbt_run_job(dbt_models="tag:daily"):
    assets: Sequence[AssetsDefinition] = with_resources(
        load_assets_from_dbt_project(
            project_dir="project",
            profiles_dir=os.getenv("DBT_PROFILES_DIR"),
            select=dbt_models
        ),
        {
            "dbt": dbt_cli_resource.configured(
                {
                    "project_dir": "project",
                    "profiles_dir": os.getenv("DBT_PROFILES_DIR"),
                }
            )
        },
    )
    job = define_asset_job(name="arbitrary_dbt_test", selection=KeysAssetSelection(*assets[0].asset_keys))
    return ScheduleDefinition(
        job=job,
        cron_schedule="@daily"
    )

view_job = create_arbitrary_dbt_run_job()

rex

12/06/2022, 4:59 PM

I think the problem that I see here is that the models could potentially be loaded multiple times right? say if

tag:tables

and

tag:daily

have overlapping models, yet you want to materialize them in separate runs?

Kirk Stennett

12/06/2022, 5:07 PM

Ah yeah that's true. For the time being I'm less worried about the clashing. I'm trying to provide an interface and abstraction so that analysts can run jobs via dagster somewhat similarly to dbt. From a dagster standpoint do you see a problem with the sample? Or is it mostly just problematic from the DBT side?

Kirk Stennett

12/06/2022, 5:07 PM

I guess the more I think about it the more it makes sense to have pre-defined paths for builds

Kirk Stennett

12/06/2022, 5:10 PM

Either way, thanks for the help getting this going. I appreciate it!

owen

12/06/2022, 9:22 PM

hi @Kirk Stennett! that's a really clever setup actually, and I think it would work perfectly fine. In essence, I view what you're doing as creating a custom type of

AssetSelection

, which is resolved by shelling out to

dbt

. In fact, I might just model it that way explicitly (as in a

get_asset_selection_for_dbt_selection()

function, which takes in a dbt string and returns an

AssetSelection.keys()

). main issue here is actually performance, as load_assets_from_dbt_project requires compiling the project (which can be quite slow, and would need to be done in every subprocess that's executing dagster code, which can add up if you're calling this multiple times). You could use load_assets_from_dbt_manifest instead, which should be way faster

👍 1

7 Views

Open in Slack

Previous Next