Hi ya ll new to dbt and dagster here Curious how ya ll prefe dagster #integration-dbt

Hi ya'll, new to dbt and dagster here. Curious how...

Alex Martin

08/28/2023, 9:43 PM

Hi ya'll, new to dbt and dagster here. Curious how ya'll prefer to handle storing and executing create table definitions for your raw (dagster defined) layer in a dagster + dbt pipeline? Context: • I'm extracting data from an API and writing it primarily to a 'raw' schema duckdb table (and secondarily writing it to csv) • Based on the day of the extract the response schema can vary (sometimes 8 fields, sometimes 15, sometimes 0) - so I can't rely on a

<http://df.to|df.to>_sql()

statement for the initial table creation. • For the sake of resiliency, I've included both create (schema and table) if not exists executions in the asset that writes that initial raw data My question is, what is your preferred implementation for those raw layer create table statements when working with dbt and dagster? • Are you hard-coding them in the software defined assets? • Do you have a 'raw' subdirectory nested in your dbt models for those statements? • Do you have a sql specific directory in your dagster project that you reference in your Dagster assets? • Something else? Thanks ya'll.

rex

08/28/2023, 9:51 PM

• Are you hard-coding them in the software defined assets?

Putting the logic to create the table before populating in the body of the software-defined asset sounds fine to me. We do that in the tutorial, fwiw: https://docs.dagster.io/integrations/dbt/using-dbt-with-dagster/upstream-assets. Separating that logic out from the population of the table seems too granular.

Alex Martin

08/28/2023, 9:59 PM

Thanks @rex. Yep I started out doing it just like that, but then when I realized that extract schema variability based on the day I recognized that the create table statement would be a bit longer that a select *. (And some sources have >100 columns) For much wider extracts such as that would you still recommend the DDL in the body of the SDA or think something else would be better? Much appreciated.

Brendan Jackson

08/29/2023, 12:53 PM

@Alex Martin I have a very similar use case - I extract data from various APIs (or even other dbt assets), stuff the data into a polars data frame, and then the IOManager handles it by putting it into the database. There are a number of advantages to hardcoding the table creation (index creation, foreign key references, null checks, correctly specifying types...), including and especially the one you noted. I did not include table creation in the python asset definitions. I didn't want to clutter that code, and assets shouldn't really care about their io details. I also didn't want to have many io managers - just a single one. Here is what I have done - I don't know if it is 'right', but it works for me and is pretty simple. I have all the table creation statements enumerated in my

resources.py

. In there is also an`SqlResource`, that holds things like sql credentials and is used by the io-manager. On run startup, that resource runs all the

CREATE TABLE IF NOT EXISTS

statements (just looping over all the enumeration values) -- implemented by overriding

setup_for_execution

on the resource.

Alex Martin

08/29/2023, 9:02 PM

Thanks @Brendan Jackson!

2 Views

Open in Slack

Previous Next