:wave: Hey all, I'm upgrading from Dagster `1.5.9`...
# integration-dbt
b
👋 Hey all, I'm upgrading from Dagster
1.5.9
to
1.6.5
and getting some breaking behavior that I want to verify. Before running my dbt models, I have an asset that generates numerous raw tables in Snowflake. For my use case, it doesn't make sense to have these be different assets. They're dynamic - one day there could be 30 the next there could be 40. Nor do I care about the majority of them within the context of Dagster. I most definitely want them written as raw tables to Snowflake, but aside from that only a select few of them are of use in Dagster or dbt. Only a subset of those tables serve as my source tables to some of my dbt staging models. In
1.5.9
I simply assigned all of the source tables to the same asset key and everything would worked as expected. My dependency graph was one asset that splits out into ~20 dbt staging models asset that carry on from there. I don't need the source tables to be their own assets within Dagster. That'd be inefficient and create unnecessary clutter. Is there still a way to accomplish this? Or is Dagster enforcing granularity here?
I believe this to be a bug and created an issue if anyone else is interested
r
A related thread: https://dagster.slack.com/archives/C04CW71AGBW/p1707481933328809 We enforce uniqueness of asset keys in Dagster. Although this may have been working before, this was not intended.
They're dynamic - one day there could be 30 the next there could be 40.
If you use
get_asset_keys_by_output_name_for_source
(link), you can generate the asset keys associated with your dbt source in a dynamic way. Does that help?
b
sadpanda ... Thanks @rex but not really. I don't want/need 40 more assets when they can be adequately defined as a single asset. I'm struggling to understand why it's necessary for dbt source models to align to a single dbt asset one-to-one. It doesn't seem to present any issues or provide any benefits. I'd love to hear the rationale.
s
I'm curious about this -- why do you think it's cluttered to represent the "real" space of assets in your asset graph? Feels like you're trying to operate at two different levels of granularity here -- your sources at more of a "group" level, and your dbt models at a "table" level. I do think it would be a nice behavior from the graph standpoint to be able to collapse the graph by "group" so that it doesn't explode out. I think this behavior does happen in some contexts, and it's nice. But it seems like i'd want to track the actual table/object-level lineage in most cases, and "opt-in" to the grouping at the visualization level.
b
Thanks for engaging, Stephen! I find it an interesting use case too. I'll readily admit that it's not the use case - most of the time one-table-per-asset is the way to go and makes complete sense. But not all the time - multi-assets are a nod to this but I don't think they're the full answer. Take a database snapshot. A snapshot itself is an asset. Snapshots don't require knowing or even caring about its constituent pieces. They have value in and of themselves, can be defined, computed, persisted, etc., without ever mentioning their constituent parts. So I am operating at the same level of granularity - the asset level. It's already accepted that assets come in many different shapes and sizes. Which makes sense, it's an abstraction. Yes, there are tables from the snapshot that I then use within dbt to create further assets, but why must they be defined as their own assets? They don't require further definition. The dbt source models certainly don't define them, they simply reference database tables. Database tables that are a logical consequence of my snapshot asset. I can even check for their existence, capture their metadata, etc., all within the construct and scale of my snapshot asset if I have any concerns. The lineage of assets is still very clear - the snapshot asset is the parent of many downstream assets. Creating more assets, putting another layer between my snapshot asset and the assets defined by my dbt staging models doesn't buy me anything or prevent any issues. It's "completeness" for the sake of completeness. It just adds boilerplate - clutter - and leaks implementation details about how I'm constructing and defining my downstream assets - Ope! there are source model assets, must be using dbt. I could use Python instead of dbt to define the exact same downstream assets without any issue. Why must the dbt integration force me to do otherwise? The perk of Dagster SDAs is that they can be defined in their own right and not as a piece of a larger task workflow like in Airflow. My snapshot asset should not care or even know about the dbt derived assets.
s
Ah, I see. We just set up the same flow, and we're breaking the snapshot into basically two sections -- 1. Create a snapshot from RDS --> S3 [one asset, but each snapshot is just a new "materialization" of the same asset] 2. Create external tables for each table that will be queried upstream [N assets] 3. Reference the external tables that are used in dbt [M assets] For us, that seems to capture the logical steps that actually occur, but I can see what you mean -- if the dbt process includes the sub-referencing within a single asset, e.g. then it can get boiler plate-y. At some level, though, adopting the asset framework really is committing yourself to making these pedantic decisions about "what is a single asset?" probably 15% of the time, i feel that this is a low-value decision to be made.
b
That certainly makes sense, especially if there's a step to create external tables. Your creation of external tables assets are actually analogous to my real "snapshot" asset that's in the picture as the parent to many assets downstream. I'm doing something similar just the granularity of the intermediary asset is different due to my requirements: 1. Take snapshot [one asset] 2. Generically materialize all tables within the snapshot [one asset] 3. dbt [many assets] For my use case it's important that all tables are materialized - in super raw, generic schemas - but only a small subset need any further individual recognition and only in the context that they are source models within dbt. Definitely agree it's typically a low-value decision and that's a big reason why I want the ability to decide for myself how to define an asset. Having to break apart an upstream asset just to appease an overfit integration interface feels like it violates the flexibility Dagster and SDAs are supposed to offer. To make my intermediary asset a multi-asset that only generates assets for the source models needed in dbt while ignoring the other materialized tables feels wonky and leaky. But creating individual assets for all materialized tables feels bloated and unnecessary. To me, keeping it as one asset feels like a non-issue and I should just be able to go on my merry way 🤷
s
Are they defined as multiple source models in your dbt project? i.e. that's the breaking point, right?
our dagster-dbt project and most of our dagster code are defined in separate code locations, so there's a loose coupling there for us.
b
That's certainly the point of contention. Whether or not it's "breaking" is where I differ. Intended or not, when it worked in
v1.5.9
nothing was awry or awkward, everything behaved as expected, I leveraged all the features, there were no conflicts, etc. I could certainly be wrong but I see no possible way it could truly break anything. It's just an opinionated constraint and not required to interface Dagster and dbt. I'm curious if @rex could set me straight on there. I could certainly create a single table that has all the data I need for all the downstream models - the intermediary asset creates all tables with the same generic schema so they align - and use it as a singleton dbt source model to align with my upstream Dagster asset, but why? Everyone would think that's silly. Why is it not silly the other way around - contorting my upstream Dagster asset to align with how dbt references a source?
🤷 1
s
Yeah, I guess there are tradeoffs. From my perspective, this decision --
I simply assigned all of the source tables to the same asset key
-- is unexpected / misaligned with expected topology of an asset graph. (i.e. one "thing" meaning many "things") But I get that there can be good reasons for it and that it's frustrating for the behavior to break on you.
b
I appreciate the back-and-forth and your willingness to talk it through! blob fistbumpl blob fistbumpr
🎉 1
If anyone is interested I submitted a PR that resolves the issue above - all features and functionality are preserved while allowing the user greater flexibility. TL;DR - dbt sources are not assets themselves, they reference assets. This provides the user more control over those references.
r
Wrapping this thread up (a fun case of Hyrum's Law at work!) I agree that we shouldn't be policing the implementation of your dbt source computations. But the unique asset key scheme is an existing behavior that folks want, and I want to continue to explicitly error on that in the out-of-the-box case. Especially if folks want history of the their assets to be separated in our asset catalog UI, have continued granularity of their asset definitions, etc. However, for folks that want to customize this out-of-the-box behavior, we'll be allowing this duplicate asset key behavior, but under an explicit flag in
DagsterDbtTranslatorSettings
called
enable_duplicate_source_asset_keys
that relaxes the duplicate asset key constraint for dbt sources. This will enable the behavior that you had before. Thanks for the feedback + collaborating on this @Brandon Freeman!
thank you box 1
dagster yay 2