https://dagster.io/ logo
Title
n

Noah Sanor

06/21/2021, 7:42 PM
Hello! My team has a use case where we would like to use the dagster_dbt package to execute dbt operations against a data set. I noticed that there is a pattern for configuring the
dbt_rpc
resource that can work locally or on higher environements. However, dbt currently does not support running the rpc server on Windows (which unfortunately my team uses). The other approach is to run the cli commands locally and rpc commands on dev/prod/etc. Is there a defined pattern for doing this now? Please let me know if this question should be asked elsewhere.
o

owen

06/21/2021, 9:43 PM
hi @Noah Sanor -- you're in the right place šŸ™‚. Unfortunately, at the moment, there is not a built-in solution for doing this, but there is a pattern that should work for you. Conceptually, what you want is to use a "dbt" resource throughout your pipeline, which in dev/prod uses the dbt_rpc resource, and in your local mode uses a dbt_cli resource. Unfortunately, a dbt_cli resource doesn't exist at the moment (It's something we're looking into adding), so you would have to write your own. This can get slightly tricky because rpc and cli commands have slightly different interfaces and potentially different return types, but depending on how much you rely on the data from the RPC responses, this may not end up being too annoying to deal with.
n

Noah Sanor

06/21/2021, 10:16 PM
Thanks for the response Owen! For the first phase of this project, we will just be using the cli in each environment and will look to decouple that as a fast follow.
t

Todd Hendricks

12/22/2021, 7:16 PM
hi @owen thanks for this outline. following up to see if this was still on the roadmap. @Noah Sanor are you able to share what your team ended up implementing? i think i'm trying to solve the same problem.
o

owen

12/22/2021, 7:25 PM
hi @Todd Hendricks! This is actually all done at this point šŸ™‚ there is a dbt_cli_resource that implements the same interface as the dbt_rpc_sync_resource, so they can be used essentially interchangeable
ā¤ļø 1
all of the ops that dagster-dbt exports work with either of these resources
t

Todd Hendricks

12/22/2021, 7:29 PM
thank you @owen! i am new to rpc technology, so i'm trying to wrap my head around how the pieces fit together. use case: i want to get the data produced from an executed dbt model and ingest it for operations in my dagster job. i'm going to describe the process as i currently understand it. • Define the dbt rpc server as a resource • Have an op launch dbt model • Have an op "listen"/poll the server to see when the model is done • Once done, there is a mechanism in dagster that triggers the operation to go get the data from whatever endpoint (BigQuery, Snowflake)
is that the general idea?
i guess another way to express it is that the rpc server exists to facilitate communication. it does not touch data.
o

owen

12/22/2021, 7:50 PM
@Todd Hendricks RPC = remote procedure call, and defines a way for one machine to run a command on a different machine. In this case, the general model is that you would have a dbt RPC server running somewhere already (unrelated to dagster), and the dbt_rpc_resource allows you to connect to that dbt RPC server and run commands on it (so the resource is the client). All the dbt ops are responsible for is making sure a dbt command gets executed (so for instance
dbt run
or
dbt_test
), but they don't actually retrieve the data from wherever dbt is transforming it. In the case of the
dbt_cli_resource
, it ensures the dbt command gets executed by issuing a CLI command on the local machine, and in the case of the
dbt_rpc_resource
it ensures that the dbt command gets executed by sending an http request to the remote dbt RPC server.
āœ… 1
the RPC server itself doesn't actually touch data simply because dbt doesn't really touch data (it just issues SQL statements to the databases)
āœ… 1
to get the data that's produced by dbt back into a dagster context, how you do this depends on what database you're using. If you're using something like snowflake, we have a snowflake resource that allows you to issue queries against snowflake. So you could have an op after your dbt_run_op that does a
SELECT * FROM ...
-type query
we have a similar integration for BQ
t

Todd Hendricks

12/22/2021, 7:54 PM
Got it! Thank you for taking the time to clarify that, Owen. I very much appreciate it. šŸ‘šŸ¾
o

owen

12/22/2021, 7:54 PM
the dbt ops also produce a DbtOutput object that has a "result" field, which includes some parsed information about what happened during dbt execution, which you can poke into to find the names of all the updated tables
no problem šŸ™‚