Hello! I have a conceptual question. The docs say...
# integration-dbt
n
Hello! I have a conceptual question. The docs say that dagster treats the whole dbt run as a single op (see point 3 of step 1 of part 2 of the dagster-dbt tutorial). If that's the case, my understanding is that each dbt model can't necessarily be fired off as its dependencies are met. Instead, the single dbt op can only be run once all dbt models have their dependencies met. Likewise, assets downstream of any dbt model can't be run as their dependencies are met. Instead, they have to wait until the whole dbt run is completed. Is this correct? Similarly, if the dbt run is treated as a single op, it would seem to be impossible to have a dagster asset that is downstream of one dbt model and upstream of another. Is that right? Thanks for the help.
o
Hey Noam, that’s not the behaviour I’m seeing. Meaning that dbt models can be run as independent assets. The attached screenshot is a subset of the assets I have on a project. Some are dbt models, and others are assets which have dbt models as pre-requisites: • My assets (dbt and others) have different freshness policies, so when they are stale, Dagster will automatically run the proper sequence of assets to make them “fresh”. • That means that referring to my example, if
events_fct
becomes stale, it will run only
int_events
,
events_fct
and whatever other upstream assets that need to be refreshed. • Whereas if my
hex_main_dashboard_refresh
becomes stale, it will check to see if
movements_dim
,
events_fct
and
observations_fct
are fresh. If not, Dagster will bundle them in the next job run. • That also applies to manually running assets. I could just manually select
int_events
,
events_fct
and
semantic_definitions
to be run as a single job. I think I see where in the tutorial you might get the impression that the whole dbt project needs to be run as a block. I’m not sure I understand what’s meant by “These assets share the same underlying op”, but regardless, dbt models are just loaded as any other assets with their list of dependencies. Was that even the question you were asking? 🙂
n
Hi Olivier, thanks for the response! I appreciate your insight 🙂 I understand that dagster loads each dbt model as a separate asset, which is what's shown in the screenshot. But my understanding of dagster is that the dependency graph for a job is composed of ops, and ops are not necessarily one-to-one with assets. For example, there can be assets corresponding to multiple ops, and vice versa. So the screenshot shows the asset dependency graph, but unless I'm mistaken that's not necessarily the same as the dependency graph used to run the job. So with that in mind, can I ask whether you've seen dagster rematerialize dbt assets separately (apart from your ability to manually rematerialize them separately)? Do you know if you've ever seen a dbt asset run before the dependencies were met for all dbt assets to be run? Thanks again
After reading your comment I went to take a look at the dagster dbt code. The docstring from the comment also says:
Loads a set of dbt models from a dbt project into Dagster assets.
Creates one Dagster asset for each dbt model. All assets will be re-materialized using a single
dbt run
or
dbt build
command.
And... this piece of code for loading dbt nodes to dagster assets also makes it seem they're all associated to the same op in the dagster job graph.
o
Hey Noam, I think (hopefully 🙂 ) I understand your question. Here’s attached an example of a job that ran on my project. As you mentioned, the dbt assets ran within a single “op”. But within that op, only a subset of the dbt models were materialized, not all of them. As for your other questions “Do you know if you’ve ever seen a dbt asset run before the dependencies were met for all dbt assets to be run?“, I’m not sure what scenario you have in mind. Could you clarify what’s the behaviour you’re looking for?
n
Hi Olivier, thanks for following up! This is helpful to see. Good to know that dagster can selectively materialize dbt tables. Since I originally wrote my question, I noticed that this behavior is also mentioned in this blog post by @owen (see attached screenshot). For the other question, I'm thinking about the following. Suppose we have 4 assets, with dependencies represented as:
Copy code
slow_asset -> dbt_asset1

fast_asset -> dbt_asset2
My worry is that because all the dbt assets share a single op,
dbt_asset2
can't be kicked off until
slow_asset
is done, even though all it really needs is for
fast_asset
to be done. Do you see what I mean?
o
Oh, I get it now. Yeah, I don’t know the answer to that 🤔 Hopefully, someone more knowledgeable can answer this, curious as well.
👍 1
o
hi @Noam Finkelstein! There are a few things at play here, the first is that, as you note, if an op has multiple outputs, a downstream op consuming one of those outputs will not be kicked off until all upstream outputs have been completed. This is likely not a hard requirement of the system, and could be updated in the future, but this is not currently on the roadmap. However, the specific thing you're talking about is actually kinda the reverse of this -- you're interested in a setup where the computation of the single dbt op starts before all of the upstreams are available. I think at that point, trying to conceptualize this dbt asset as a single operation which is part of a broader job breaks down. There are times when you want to slice through and execute just some subsets of the dbt assets (i.e. everything downstream of fast_asset) and other times where you want to execute the whole thing. In my mind, this maps fairly well onto something like freshness-based scheduling. This allows you to just express the scheduling requirements for the assets that you care about, and then allow dagster to calculate which subsets to run at which times in order to meet those requirements. This is one of the benefits of the asset model -- you don't need to be locked-in to a static pipeline definition, runs can be a bit more ad-hoc
n
Thanks for the response @owen! Very helpful. I see what you're saying about avoiding some of the dependency issues with different scheduling tactics. Because we don't have an existing dbt project, at this point it seems to make sense for us to just create our own SQL based assets using the
SourceAsset
class and the
non_argument_dependencies
parameter, and lean into giving dagster full flexibility in terms of scheduling execution. We'll have a lot of data flowing in and out of the database, so dealing with a blocking dbt call would be inefficient. It's good to know what the options are!