I m wonder what are the practical boundaries of operating Da dagster #deployment-kubernetes

I'm wonder what are the practical boundaries of op...

Simon Frid

04/05/2023, 2:52 PM

I'm wonder what are the practical boundaries of operating Dagit, before you'll see serious performance degradation? i.e. x ops per job? y jobs concurrently? The Dagit instance we've launch is really struggling (based closely on the helm template). For instance, it can take several minutes to load a run, and fails 50%+ of the time. Similarly, loading the list of runs may take 20 seconds (even with only <5 runs total). I've closely monitored Postgres, but it's using less than 1% of the db. The Dagit in our deployment uses ~16 GB RAM/instance * 4 instances, and both memory and cpu is under utilized. Getting a debug file (> 1Gb) was challenging as Dagit struggled to generate the whole thing (multiple failed attempts). One thing I noticed is that once I start to pull on a big debug file, the readinessProbe fails and the dagit instance becomes unresponsive for a few minutes. In order actually obtain a debugfile, I had to increase the number of instances and increase the resistant to healthcheck failures. Can I get some guidance on 1. how to tune Dagit to address these performance issues 2. what are Dagster's suggested boundaries for usable performance? EDIT: my estimate is that this run is using about 20-25k ops

Binoy Shah

04/05/2023, 2:56 PM

what dagit/dagster version are you using. I heard there were considerable performance improvements in dagit loading in last 2-3 releases

Binoy Shah

04/05/2023, 2:57 PM

you might also need to increase your timeouts and readiness/liveness probes

Binoy Shah

04/05/2023, 2:58 PM

check your network latencies and network policies if any

Simon Frid

04/05/2023, 3:03 PM

This is on v1.2.3. I checked network latencies (50-100Mbits/sec), and I also port-forwarded to remove any potential interfering load-balancer config.

Simon Frid

04/05/2023, 3:05 PM

yes, increasing timeouts and failure conditions is something i've already done, just to prevent the 500x error bad-gateways.

Binoy Shah

04/05/2023, 3:07 PM

can. you try reducing your dagit instance count to just 1 and see if it matters.. I feel the High Availability Dagit instances might be causing some race conditions… currently from your description it seems you’re running 4 dagit instances

Simon Frid

04/05/2023, 3:09 PM

On 1 instance, the request always failed. I moved to 4 to make it actually suceed.

Simon Frid

04/05/2023, 3:14 PM

and @Binoy Shah - thank you for the input. any other ideas?

daniel

04/05/2023, 3:25 PM

This doesn't sound expected - particularly the part about loading the list of runs taking 20 seconds on a beefy dagit box. Can you share more about how your postgres DB is setup (a cloud-provided DB like RDS? Local in its own pod as part of the Helm chart)? What kind of CPU/memory does it have available?

daniel

04/05/2023, 3:26 PM

I'm also curious about what's taking up the space in that 1GB run debug info - are you possibly doing a lot of Python logging within your ops and resources? Or is it just a lot of system events from many ops happening within the job?

Simon Frid

04/05/2023, 3:27 PM

Should be relatively little logging, mostly lots of ops.

daniel

04/05/2023, 3:27 PM

Can you quantify the number of ops? Dunno if you'd be able to share that debug file privately, but we could dig in and see if anything jumps out from the run

🙏🏻 1

daniel

04/05/2023, 3:28 PM

But from your description the first thing i'd check is the CPU/memory of the DB

Simon Frid

04/05/2023, 3:29 PM

Using Cloudsql instance - 2vCPU, 4GB Ram, 10GB SSD. During peak usage, only 23 CPU Utiliazation, and <50% Memory usage from the DB.

daniel

04/05/2023, 3:29 PM

Got it - but loading the runs list consistently takes >20 seconds even with only 5 runs?

Simon Frid

04/05/2023, 3:29 PM

I monitor slow queries, and am happy to share those. only a couple possibly stands out?

daniel

04/05/2023, 3:30 PM

that would be helpful, yeah

Simon Frid

04/05/2023, 3:31 PM

On the order of 1-10k ops. I don't know specifically at this time.

daniel

04/05/2023, 3:31 PM

A really large run loading slowly sounds plausible to me depending on the details - the runs list is a pretty simple SQL query though, so if that's slow it means something's misbehaving quite a bit

Simon Frid

04/05/2023, 3:31 PM

messaging you details

Binoy Shah

04/05/2023, 3:32 PM

10 k Ops.. are these all active ops ?

Simon Frid

04/05/2023, 3:33 PM

what do you mean by active?

Binoy Shah

04/05/2023, 3:33 PM

i meant concurrent Ops ?

Simon Frid

04/05/2023, 3:34 PM

no, they are staggered

Simon Frid

04/05/2023, 3:51 PM

@Binoy Shah it's about 20-25k ops.

daniel

04/13/2023, 9:13 PM

A few different performance improvements went out in 1.2.7 to help address some of the perf issues described here (slow runs page for runs with large job snapshots and large event log payloads for asset runs that were materializing large numbers of ops)

daniel

04/13/2023, 9:13 PM

I'd be curious if things look better in that version

Binoy Shah

04/13/2023, 9:30 PM

Going to upgrade after checking the change log for breaking changes

Simon Frid

04/13/2023, 9:40 PM

BIG thank you @daniel

8 Views

Open in Slack

Previous Next