Hi folks, this is more of a conceptual question. ...
# ask-community
s
Hi folks, this is more of a conceptual question. I've been playing with Dagster and found it to be quite clever, with the way that you can write like a normal Python program, but in the background the decorators can write to storage of your choice through your IO manager configuration. I've been considering the use of DBT in an upcoming project. The idea being that I get my raw(ish) data into a data warehouse and apply DBT transforms there. DBT is great though it still has the drawbacks of being based on SQL as opposed to a general purpose programming language. Having played with Dagster, in a way it appears it can perform a similar function: you write your Python code, then the function will materialise it 'somewhere' (depending on your IO manager). Given this, I'm considering leaning into Dagster more for my transforms, as I can have them done in Python code as opposed to SQL, with the benefits that provides. It would also have the benefit of a traceability like DBT due to those materialisations in intermediate tables/files. Dagster also has integration with DBT, though I haven't toyed with it yet. My thought is that either approach has merit, and it would come down to whatever language (python/sql) your users are most comfortable with. . Obviously this may have some tradeoffs, though the idea popped into my mind over the weekend and I'm wondering if it has some merit. Would love to hear your thoughts.
v
I think your thoughts are going in the right direction. If your IO manager supports loading data into a DB, then you can definitely hook dbt up anywhere into the asset graph, as long as it knows where to find the table (usually through setting up the source assets as dbt sources). The only thing to think about is that by default, the IO managers will load all the data (or the partition) into memory before processing it, so you have to make sure your process has enough memory/cpu to handle that. Not intimately familiar with how it works when you run your processing in something like Spark though.
s
Good to hear. That's a good point regarding the loading into memory, I'll keep that in mind thank you
a
If I recall correctly there’s a tutorial somewhere (probably a Github discussion) about how to return a generator from an IO manager or resource to avoid loading the all data into memory. Also, you can do SQL transformations straight from Python code within an asset and connect it to the asset graph via
non_argument_deps
, which also lets you avoid loading data into memory with the trade-off of maintaining SQL queries embedded into Python code (but with the added benefit of not adding yet another tool like dbt to the stack if the advantages are not clear for your use case — I’m in this boat myself).
s
In a previous role the benefits of adding DBT were clear: SQL was used extremely heavily, lots of manual deploy, copy paste of queries etc. DBT helped bring that under control. SQL was used and understood by the majority of the users, and I didn't want to have to introduce a massive change to them. My current project is in more of a software engineering shop, less use of SQL. The advantage of using DBT would be less obvious, hence my thought about leaning into Dagster more with it's Python based transforms, and utilising folks existing skills Quite happy building my own IO managers if need be, though not quite got as far as that yet.
q
Have you looked at Python models in dbt if your warehouse supports it?
s
It's not an approach I'd considered, though it's a possibility. The data warehouse is BigQuery, which I believe DBT Python models supports. I've never experimented with DBT Python models, which are fairly new. Do you have any opinion on them?