Hi perhaps a naive question but is there any way to allow fo dagster #integration-dbt

Hi - perhaps a naive question, but is there any wa...

Brendan Jackson

05/31/2023, 12:10 PM

Hi - perhaps a naive question, but is there any way to allow for a software defined asset to depend on a dbt asset that is materialised as a view only? The motivation here is to remove much of the complexity of the IO manager into simply

select * from {table}

, as in one of the documentation examples. I imagine this is not done, because it would require the loading of the view to happen within a dbt run (where the view still exists)?

dagster bot responded by community 1

Guy McCombe

05/31/2023, 12:12 PM

Yeah I don’t see why not

Guy McCombe

05/31/2023, 12:13 PM

https://docs.dagster.io/integrations/dbt/using-dbt-with-dagster/part-four#using-dbt-with-dagster-part-four-add-a-downstream-asset Here’s the docs on how to set up a dependency to a dbt asset if you need them.

Brendan Jackson

05/31/2023, 12:14 PM

Adding the dependency is fine. But would an IO manager work to load data from the dbt view?

Brendan Jackson

05/31/2023, 12:14 PM

https://docs.dagster.io/integrations/dbt/reference#defining-an-io-manager

Brendan Jackson

05/31/2023, 12:15 PM

For instance, this example -

Copy code

def load_input(self, context) -> pd.DataFrame:
        """Load the contents of a table as a pandas DataFrame."""
        table_name = context.asset_key.path[-1]
        return pd.read_sql(f"SELECT * FROM {table_name}", con=self.connection_str)

But

{table_name}

will not exist.

Guy McCombe

05/31/2023, 12:16 PM

Yeah that should work no problem. As long as you can run SQL against the view, you can define an io_manager like that

Guy McCombe

05/31/2023, 12:16 PM

Is the

table_name

not just the name of your view?

Brendan Jackson

05/31/2023, 12:16 PM

Well that's from the documentation example, so I assume so.

Brendan Jackson

05/31/2023, 12:17 PM

Ah, perhaps you are right - the view is persisted.

Brendan Jackson

05/31/2023, 12:17 PM

How does this function if the asset is partitioned?

Guy McCombe

05/31/2023, 12:18 PM

You can define an io manager that handles partitions if that’s what you’re after

Guy McCombe

05/31/2023, 12:18 PM

If you give me a bit of context on the partitions and how you want to partition your view I can see if I can give you a kick start

Brendan Jackson

05/31/2023, 12:20 PM

Thanks! I think I see what you mean. I have a dbt asset A partitioned into days, produced via incremental loading. I would like to create a downstream dbt asset B that does some joins/transformations, also partitioned in this way. Then I have a software defined asset C that depends on B. I would like B to be a view, for the sake of storage space mainly.

👍 1

Brendan Jackson

05/31/2023, 12:20 PM

Currently I am creating B via a dbt view using the partition fn in the dagster:

Brendan Jackson

05/31/2023, 12:21 PM

Copy code

where date > '{{ var("start_date") }}' and date <= '{{ var("end_date") }}'

Brendan Jackson

05/31/2023, 12:21 PM

Where those parameters are set by the partition in dagster.

partition_key_to_vars_fn

Brendan Jackson

05/31/2023, 12:23 PM

Were

just a regular table, I would adopt incremental materialisation. I can't do that here. I could specify the view without where clause, and use the IO manager to do that part?

Guy McCombe

05/31/2023, 12:23 PM

Yeah I’d say do that ^

👍 1

Brendan Jackson

05/31/2023, 12:23 PM

Thanks!

Brendan Jackson

05/31/2023, 12:24 PM

I think I was assuming I would need to make the dbt view specific to the partition, but that isn't needed at all as it's just a view.

🎯 1

Guy McCombe

05/31/2023, 12:27 PM

You could probably do something similar to this for your IO manager:

Copy code

def load_input(self, context) -> pd.DataFrame:
    """Load the contents of a table as a pandas DataFrame."""
    table_name = context.asset_key.path[-1]
    if context.has_partition_key:
        return pd.read_sql(f"SELECT * FROM {table_name} WHERE date={context.asset_partition_key}",
                           con=self.connection_str)
    return pd.read_sql(f"SELECT * FROM {table_name}", con=self.connection_str)

👍 1

3 Views

Open in Slack

Previous Next