Hi there I m trying to use Dagster to run full dbt DW builds dagster #integration-dbt

Hi there! I'm trying to use Dagster to run full db...

Nicolas May

05/01/2023, 1:53 PM

Hi there! I'm trying to use Dagster to run full dbt DW builds using dagster-dbt library, but I can't get a full build to run in production... Details in 🧵

Nicolas May

05/01/2023, 1:55 PM

The dev deployment is Docker compose on my local machine. The prod deployment is Docker compose on a GCP VM. • 🟢 Full DW builds work fine in dev. (See first screenshot.) • 🟢 Small dbt jobs work fine in prod. (See second screenshot.) • 🔴 Full DW build does not work in prod. The dbt step just hangs at

Started execution of step "run_dbt_1236b"

for hours before I terminate the job. It never gets to the step that should be

Executing command: dbt --no-use-color --log-format json run --project-dir /opt/dagster/app/dagster_integrations/resources/../.. --profiles-dir /opt/dagster/app/dagster_integrations/resources/../../profile --select *

. (See third screenshot.)

Nicolas May

05/01/2023, 1:57 PM

I'm guessing it has something to do with the size of our dbt project (533 models, 1319 tests, 0 snapshots, 14 analyses, 859 macros, 0 operations, 18 seed files, 269 sources, 6 exposures, 0 metrics), but I can't get any observability into why things just stall.

Nicolas May

05/01/2023, 2:23 PM

• prod VM.... 4 vCPUs w/ 16 GB memory • local dev env Docker resources... 6 vCPUs w/ 9 GB memory

Tim Castillo

05/01/2023, 4:43 PM

Hmm, 533 should be well within bounds for Dagster to run. If you go into the raw compute logs (screenshot below), do you see anything in the stderr or stdout?

Nicolas May

05/01/2023, 4:44 PM

No... I just see

No log file available

Nicolas May

05/01/2023, 4:45 PM

Thanks for replying btw

Nicolas May

05/01/2023, 4:47 PM

I should add... The image is running Dagster 1.2.7 and dbt 1.1.1

Tim Castillo

05/01/2023, 4:47 PM

Odd that it doesn't have compute logs, which may indicate that the process isn't starting at all.

Nicolas May

05/01/2023, 4:49 PM

Here's a job that ran okay but doesn't have compute logs

Tim Castillo

05/01/2023, 4:50 PM

Sorry, sent that message without reading your most recent one: I think streaming compute logs are a more recent addition to the integration, so it might not work with dbt 1.1.1. any particular reason preventing you from upgrading to dbt 1.4+? All good if there is; just trying to figure out the right ways to debug this.

Nicolas May

05/01/2023, 4:52 PM

The AEs who do a bulk of the dbt work aren't ready for an upgrade just yet... That's my impression anyway

Nicolas May

05/01/2023, 4:53 PM

I'm just reflecting the env they use in development and what's currently running in Airflow... Trying to migrate the full DW build away from Airflow but running into this

Nicolas May

05/01/2023, 4:54 PM

Would a dbt upgrade fix this?

Tim Castillo

05/01/2023, 4:55 PM

It's a possibility. Let me reach out to more of the team to help out with this.

Nicolas May

05/01/2023, 4:55 PM

tysm!

Nicolas May

05/01/2023, 4:56 PM

My confusion is that it works fine in local dev env

owen

05/01/2023, 6:08 PM

hi @Nicolas May! definitely odd, and I'm struggling to come up with a super plausible explanation. how is dagster being deployed? I'm specifically curious about what database you're using for event log storage. another question would be how many of the sources in your dbt project connect to upstream dagster assets

Nicolas May

05/01/2023, 6:12 PM

how is dagster being deployed?

Dagster is deployed using Docker compose on a GCP VM.

I'm specifically curious about what database you're using for event log storage.

The backend DB is GCP CloudSQL Postgres 14.

how many of the sources in your dbt project connect to upstream dagster assets

There are quite a few Fivetran assets that are upstream of the dbt assets... and not all of them are included in this problematic job... It's hard to capture in a pic because it's big

Nicolas May

05/01/2023, 6:14 PM

You may be able to see from the screenshots that we've got three Fivetran syncs' asset groups upstream of the full dbt step... But we've excluded many that are upstream... Should I try to include them?

Nicolas May

05/01/2023, 6:16 PM

Would some of the excluded Fivetran assets being stale or never materialized be blocking this dbt step? ... even though Dagster isn't really materializing the Fivetran assets, but instead is just giving observability into the Fivetran syncs?

Nicolas May

05/01/2023, 6:17 PM

I'll add the the working prod dbt job does include all of its upstream Fivetran assets

Nicolas May

05/01/2023, 6:20 PM

Also... thanks so much for following up on this! It's def a head-scratcher for me

owen

05/01/2023, 6:21 PM

thanks for that info! my best guess at the moment (although it's a shaky one) is that this specific setup is causing the framework to make a bunch of queries to your event log database, which are causing the step to hang.

owen

05/01/2023, 6:21 PM

trying to confirm if that's plausible right now

Nicolas May

05/01/2023, 6:22 PM

thank you box

Nicolas May

05/01/2023, 6:25 PM

That's interesting... The local dev env just uses a Postgres service defined in the docker-compose.yaml... It doesn't point to any persistent DB

👀 1

owen

05/01/2023, 6:48 PM

I'm feeling a bit more confident on this hypothesis -- if you have easy observability into your backend postgres db, it'd be great to take a look at that to confirm if this is happening. I'd expect to see a large volume of queries right when the step starts. each one should be fairly quick but I could imagine this filling up a queue. also looking into what we can do on our side to make this not fire off as many queries

Nicolas May

05/01/2023, 6:49 PM

Okay... Thanks much owen! I'll see what I can do to track down queries around the time this step tries to start

🌈 1

Bojan

05/01/2023, 7:14 PM

@Nicolas May that's gonna be db highly likely

Bojan

05/01/2023, 7:15 PM

i have had similar issues, it really comes down due to the amount of events your runs generate, at one point tables get too long for the workload that you have

Bojan

05/01/2023, 7:15 PM

you're gonna see those 60 second timeouts popping on the db side. Can you prune asset events etc

Nicolas May

05/01/2023, 7:15 PM

Thanks Bojan... I guess, how can I improve the situation?

Bojan

05/01/2023, 7:16 PM

or is that something that you under no circumstances want to do

Bojan

05/01/2023, 7:16 PM

what's the size of your pg instance ?

Nicolas May

05/01/2023, 7:17 PM

db-g1-small

Nicolas May

05/01/2023, 7:17 PM

Scale up?

Bojan

05/01/2023, 7:17 PM

ya for sure

Nicolas May

05/01/2023, 7:17 PM

Any size recommendations?

Bojan

05/01/2023, 7:18 PM

is that gcp or aws

Nicolas May

05/01/2023, 7:18 PM

GCP... CloudSQL Postgres 14

Bojan

05/01/2023, 7:19 PM

also how often do you run the whole workload, and back to my original question of can you prune assets and events

Bojan

05/01/2023, 7:19 PM

ya pg on gcp does not use t shirt sizes but it has its own way of picking sizes

Nicolas May

05/01/2023, 7:20 PM

Full DW build is once daily when nothing else is running... We can try to prune assets, but I'll need to to talk with my AEs to see what's minimally possible

Bojan

05/01/2023, 7:21 PM

ah no need then

👍 1

Bojan

05/01/2023, 7:21 PM

i build mine every fivetran sync

Bojan

05/01/2023, 7:22 PM

which happens every 10 minutes hence the question

Bojan

05/01/2023, 7:22 PM

ya you need to up the db

Bojan

05/01/2023, 7:23 PM

go for 4 cpus, 8 gig

Nicolas May

05/01/2023, 7:23 PM

Okay... thanks for the rec

Bojan

05/01/2023, 7:49 PM

btw gcp pg should give some nice monitoring out of the box. You should be able to figure out io load there if this size doesn't pan out.

thank you box 1

Nicolas May

05/03/2023, 11:30 AM

Just wanna follow up on this thread... We turned on GCP CloudSQL [Query Insights](https://cloud.google.com/sql/docs/postgres/using-query-insights) to get some observability into the query load and definitely see spikes right before

dbt run

commands We upgraded the GCP CloudSQL DB instance from a shared-CPU, 1.7 GB memory

db-g1-small

machine type to a 2 vCPU 8 GB memory

db-custom-2-7680

and that did the trick... Now the jobs that include dbt are starting super quick and/or not failing Thanks again for your help @Tim Castillo @owen @Bojan

Bojan

05/03/2023, 11:31 AM

sweet, note that you might want to start prune-ing the old jobs/events/assets

Nicolas May

05/03/2023, 12:33 PM

Good call out... How have you gone about that?

Nicolas May

05/03/2023, 12:34 PM

I've set the sensor/schedule tick retention per this, but not sure how to prune old events

Tim Castillo

05/03/2023, 12:45 PM

Here's how you can prune old events https://github.com/dagster-io/dagster/discussions/12047

Nicolas May

05/03/2023, 12:45 PM

Thanks much!

3 Views

Open in Slack

Previous Next