Hi there! I'm trying to use Dagster to run full db...
# integration-dbt
n
Hi there! I'm trying to use Dagster to run full dbt DW builds using dagster-dbt library, but I can't get a full build to run in production... Details in šŸ§µ
The dev deployment is Docker compose on my local machine. The prod deployment is Docker compose on a GCP VM. ā€¢ šŸŸ¢ Full DW builds work fine in dev. (See first screenshot.) ā€¢ šŸŸ¢ Small dbt jobs work fine in prod. (See second screenshot.) ā€¢ šŸ”“ Full DW build does not work in prod. The dbt step just hangs at
Started execution of step "run_dbt_1236b"
for hours before I terminate the job. It never gets to the step that should be
Executing command: dbt --no-use-color --log-format json run --project-dir /opt/dagster/app/dagster_integrations/resources/../.. --profiles-dir /opt/dagster/app/dagster_integrations/resources/../../profile --select *
. (See third screenshot.)
I'm guessing it has something to do with the size of our dbt project (533 models, 1319 tests, 0 snapshots, 14 analyses, 859 macros, 0 operations, 18 seed files, 269 sources, 6 exposures, 0 metrics), but I can't get any observability into why things just stall.
ā€¢ prod VM.... 4 vCPUs w/ 16 GB memory ā€¢ local dev env Docker resources... 6 vCPUs w/ 9 GB memory
t
Hmm, 533 should be well within bounds for Dagster to run. If you go into the raw compute logs (screenshot below), do you see anything in the stderr or stdout?
n
No... I just see
No log file available
Thanks for replying btw
I should add... The image is running Dagster 1.2.7 and dbt 1.1.1
t
Odd that it doesn't have compute logs, which may indicate that the process isn't starting at all.
n
Here's a job that ran okay but doesn't have compute logs
t
Sorry, sent that message without reading your most recent one: I think streaming compute logs are a more recent addition to the integration, so it might not work with dbt 1.1.1. any particular reason preventing you from upgrading to dbt 1.4+? All good if there is; just trying to figure out the right ways to debug this.
n
The AEs who do a bulk of the dbt work aren't ready for an upgrade just yet... That's my impression anyway
I'm just reflecting the env they use in development and what's currently running in Airflow... Trying to migrate the full DW build away from Airflow but running into this
Would a dbt upgrade fix this?
t
It's a possibility. Let me reach out to more of the team to help out with this.
n
tysm!
My confusion is that it works fine in local dev env
o
hi @Nicolas May! definitely odd, and I'm struggling to come up with a super plausible explanation. how is dagster being deployed? I'm specifically curious about what database you're using for event log storage. another question would be how many of the sources in your dbt project connect to upstream dagster assets
n
how is dagster being deployed?
Dagster is deployed using Docker compose on a GCP VM.
I'm specifically curious about what database you're using for event log storage.
The backend DB is GCP CloudSQL Postgres 14.
how many of the sources in your dbt project connect to upstream dagster assets
There are quite a few Fivetran assets that are upstream of the dbt assets... and not all of them are included in this problematic job... It's hard to capture in a pic because it's big
You may be able to see from the screenshots that we've got three Fivetran syncs' asset groups upstream of the full dbt step... But we've excluded many that are upstream... Should I try to include them?
Would some of the excluded Fivetran assets being stale or never materialized be blocking this dbt step? ... even though Dagster isn't really materializing the Fivetran assets, but instead is just giving observability into the Fivetran syncs?
I'll add the the working prod dbt job does include all of its upstream Fivetran assets
Also... thanks so much for following up on this! It's def a head-scratcher for me
o
thanks for that info! my best guess at the moment (although it's a shaky one) is that this specific setup is causing the framework to make a bunch of queries to your event log database, which are causing the step to hang.
trying to confirm if that's plausible right now
n
thank you box
That's interesting... The local dev env just uses a Postgres service defined in the docker-compose.yaml... It doesn't point to any persistent DB
šŸ‘€ 1
o
I'm feeling a bit more confident on this hypothesis -- if you have easy observability into your backend postgres db, it'd be great to take a look at that to confirm if this is happening. I'd expect to see a large volume of queries right when the step starts. each one should be fairly quick but I could imagine this filling up a queue. also looking into what we can do on our side to make this not fire off as many queries
n
Okay... Thanks much owen! I'll see what I can do to track down queries around the time this step tries to start
šŸŒˆ 1
b
@Nicolas May that's gonna be db highly likely
i have had similar issues, it really comes down due to the amount of events your runs generate, at one point tables get too long for the workload that you have
you're gonna see those 60 second timeouts popping on the db side. Can you prune asset events etc
n
Thanks Bojan... I guess, how can I improve the situation?
b
or is that something that you under no circumstances want to do
what's the size of your pg instance ?
n
db-g1-small
Scale up?
b
ya for sure
n
Any size recommendations?
b
is that gcp or aws
n
GCP... CloudSQL Postgres 14
b
also how often do you run the whole workload, and back to my original question of can you prune assets and events
ya pg on gcp does not use t shirt sizes but it has its own way of picking sizes
n
Full DW build is once daily when nothing else is running... We can try to prune assets, but I'll need to to talk with my AEs to see what's minimally possible
b
ah no need then
šŸ‘ 1
i build mine every fivetran sync
which happens every 10 minutes hence the question
ya you need to up the db
go for 4 cpus, 8 gig
n
Okay... thanks for the rec
b
btw gcp pg should give some nice monitoring out of the box. You should be able to figure out io load there if this size doesn't pan out.
thank you box 1
n
Just wanna follow up on this thread... We turned on GCP CloudSQL [Query Insights](https://cloud.google.com/sql/docs/postgres/using-query-insights) to get some observability into the query load and definitely see spikes right before
dbt run
commands We upgraded the GCP CloudSQL DB instance from a shared-CPU, 1.7 GB memory
db-g1-small
machine type to a 2 vCPU 8 GB memory
db-custom-2-7680
and that did the trick... Now the jobs that include dbt are starting super quick and/or not failing Thanks again for your help @Tim Castillo @owen @Bojan
b
sweet, note that you might want to start prune-ing the old jobs/events/assets
n
Good call out... How have you gone about that?
I've set the sensor/schedule tick retention per this, but not sure how to prune old events
t
n
Thanks much!