Hi ya'll, new to dbt and dagster here. Curious how...
# integration-dbt
a
Hi ya'll, new to dbt and dagster here. Curious how ya'll prefer to handle storing and executing create table definitions for your raw (dagster defined) layer in a dagster + dbt pipeline? Context: • I'm extracting data from an API and writing it primarily to a 'raw' schema duckdb table (and secondarily writing it to csv) • Based on the day of the extract the response schema can vary (sometimes 8 fields, sometimes 15, sometimes 0) - so I can't rely on a
<http://df.to|df.to>_sql()
statement for the initial table creation. • For the sake of resiliency, I've included both create (schema and table) if not exists executions in the asset that writes that initial raw data My question is, what is your preferred implementation for those raw layer create table statements when working with dbt and dagster? • Are you hard-coding them in the software defined assets? • Do you have a 'raw' subdirectory nested in your dbt models for those statements? • Do you have a sql specific directory in your dagster project that you reference in your Dagster assets? • Something else? Thanks ya'll.
r
• Are you hard-coding them in the software defined assets?
Putting the logic to create the table before populating in the body of the software-defined asset sounds fine to me. We do that in the tutorial, fwiw: https://docs.dagster.io/integrations/dbt/using-dbt-with-dagster/upstream-assets. Separating that logic out from the population of the table seems too granular.
a
Thanks @rex. Yep I started out doing it just like that, but then when I realized that extract schema variability based on the day I recognized that the create table statement would be a bit longer that a select *. (And some sources have >100 columns) For much wider extracts such as that would you still recommend the DDL in the body of the SDA or think something else would be better? Much appreciated.
b
@Alex Martin I have a very similar use case - I extract data from various APIs (or even other dbt assets), stuff the data into a polars data frame, and then the IOManager handles it by putting it into the database. There are a number of advantages to hardcoding the table creation (index creation, foreign key references, null checks, correctly specifying types...), including and especially the one you noted. I did not include table creation in the python asset definitions. I didn't want to clutter that code, and assets shouldn't really care about their io details. I also didn't want to have many io managers - just a single one. Here is what I have done - I don't know if it is 'right', but it works for me and is pretty simple. I have all the table creation statements enumerated in my
resources.py
. In there is also an`SqlResource`, that holds things like sql credentials and is used by the io-manager. On run startup, that resource runs all the
CREATE TABLE IF NOT EXISTS
statements (just looping over all the enumeration values) -- implemented by overriding
setup_for_execution
on the resource.
a
Thanks @Brendan Jackson!