Hi new to Dagster and trying to form a good mental model of dagster #ask-community

Hi, new to Dagster and trying to form a good menta...

Jonny Wray

02/14/2023, 2:08 PM

Hi, new to Dagster and trying to form a good mental model of the concepts and have a couple of questions. Not really specific issues, but more gaps in my understanding of concepts, so any input would be fantastic. The project I'm working on is remarkably similar to the modern data stack example, which is great. I have an external API that adds new data on a daily basis. I'm using Airbyte to sync that data, and I'm planning to use DBT to then transform. There will be downstream calculations of these transforms (e.g. aggregations over time) in Python, or maybe a mixture of Python and dbt. So, my current questions regarding concepts: 1. What IO manager is needed to bridge Airbyte and dbt steps? All work is done outside Dagster so I'd assume one isn't needed, but am I correct here? And if so, how is "no IO manager" specified? 2. The daily updates of the API data means I have a natural partition of my full data set - daily. All three technologies have their own concepts for dealing with this updating data. I'm using Incremental updates in Airbyte, dbt has incremental materializations, and Dagster has partitions. What is unclear to me is how these three concepts relate to each other and interact. Or, more concretely, how I would approach building a Dagster solution that deals with daily updates using these three technologies? Thanks a lot

owen

02/14/2023, 6:32 PM

Hi @Jonny Wray! For question 1: you're correct that no IO manager is necessary in this case. technically, all assets have an io_manager, but airbyte assets and dbt assets have outputs with a dagster-type of

Nothing

, which indicates to Dagster that the io_manager does not need to be invoked when handling the outputs/inputs. See: https://dagster.slack.com/archives/C01U954MEER/p1676310924565129?thread_ts=1676257567.623189&cid=C01U954MEER for a bit more discussion there. For question 2: I'm most familiar with dbt/Dagster here. The

load_assets_from_dbt_project/manifest

functions support a

partition_key_to_vars

function, which allows you to define a translation between a dagster partition key (e.g.

2023-01-01

) to a dictionary of dbt vars (e.g.

{"run_date": "2023-01-01"}

). So you can define your dbt assets to be daily partitioned, as well as a function to turn those partitions in to variables in the dbt runtime. The dagster-airbyte integration does not currently support partitions as far as I'm aware, but it seems like something similar might make sense here, if the incremental updates to the airbyte stream are regular (i.e. in 1-day chunks). If they're not, I think it'd be fine to just model the airbyte sync as an unpartitioned asset

Jonny Wray

02/14/2023, 6:53 PM

Thanks, that’s very useful. Q1 makes total sense. And I’ll dig into the docs more on q2, but looks like it’s pointing me down the right path. Thanks again.

🌈 1

Benoit Fayolle

02/24/2023, 9:06 AM

Hi @owen I managed to make

partition_key_to_vars

function working but I don't understand why only the partition key is exposed. In my view, massive value of partitions come from the fact that the user is able to backload or rerun past partitions. Without the partition end, I have to basically rewrite my own partition logic inside

partition_key_to_vars

Benoit Fayolle

02/24/2023, 9:12 AM

More generally, I have a hard time interfacing dagster partitions with "dbt partitions". I use dbt vars to filter our dbt models but support for that is really minimal in dagster. Am I missing something? Is there a more standard way for doing that ?

5 Views

Open in Slack

Previous Next