I'm wonder what are the practical boundaries of op...
# deployment-kubernetes
s
I'm wonder what are the practical boundaries of operating Dagit, before you'll see serious performance degradation? i.e. x ops per job? y jobs concurrently? The Dagit instance we've launch is really struggling (based closely on the helm template). For instance, it can take several minutes to load a run, and fails 50%+ of the time. Similarly, loading the list of runs may take 20 seconds (even with only <5 runs total). I've closely monitored Postgres, but it's using less than 1% of the db. The Dagit in our deployment uses ~16 GB RAM/instance * 4 instances, and both memory and cpu is under utilized. Getting a debug file (> 1Gb) was challenging as Dagit struggled to generate the whole thing (multiple failed attempts). One thing I noticed is that once I start to pull on a big debug file, the readinessProbe fails and the dagit instance becomes unresponsive for a few minutes. In order actually obtain a debugfile, I had to increase the number of instances and increase the resistant to healthcheck failures. Can I get some guidance on 1. how to tune Dagit to address these performance issues 2. what are Dagster's suggested boundaries for usable performance? EDIT: my estimate is that this run is using about 20-25k ops
b
what dagit/dagster version are you using. I heard there were considerable performance improvements in dagit loading in last 2-3 releases
you might also need to increase your timeouts and readiness/liveness probes
check your network latencies and network policies if any
s
This is on v1.2.3. I checked network latencies (50-100Mbits/sec), and I also port-forwarded to remove any potential interfering load-balancer config.
yes, increasing timeouts and failure conditions is something i've already done, just to prevent the 500x error bad-gateways.
b
can. you try reducing your dagit instance count to just 1 and see if it matters.. I feel the High Availability Dagit instances might be causing some race conditions… currently from your description it seems you’re running 4 dagit instances
s
On 1 instance, the request always failed. I moved to 4 to make it actually suceed.
and @Binoy Shah - thank you for the input. any other ideas?
d
This doesn't sound expected - particularly the part about loading the list of runs taking 20 seconds on a beefy dagit box. Can you share more about how your postgres DB is setup (a cloud-provided DB like RDS? Local in its own pod as part of the Helm chart)? What kind of CPU/memory does it have available?
I'm also curious about what's taking up the space in that 1GB run debug info - are you possibly doing a lot of Python logging within your ops and resources? Or is it just a lot of system events from many ops happening within the job?
s
Should be relatively little logging, mostly lots of ops.
d
Can you quantify the number of ops? Dunno if you'd be able to share that debug file privately, but we could dig in and see if anything jumps out from the run
🙏🏻 1
But from your description the first thing i'd check is the CPU/memory of the DB
s
Using Cloudsql instance - 2vCPU, 4GB Ram, 10GB SSD. During peak usage, only 23 CPU Utiliazation, and <50% Memory usage from the DB.
d
Got it - but loading the runs list consistently takes >20 seconds even with only 5 runs?
s
I monitor slow queries, and am happy to share those. only a couple possibly stands out?
d
that would be helpful, yeah
s
On the order of 1-10k ops. I don't know specifically at this time.
d
A really large run loading slowly sounds plausible to me depending on the details - the runs list is a pretty simple SQL query though, so if that's slow it means something's misbehaving quite a bit
s
messaging you details
b
10 k Ops.. are these all active ops ?
s
what do you mean by active?
b
i meant concurrent Ops ?
s
no, they are staggered
@Binoy Shah it's about 20-25k ops.
d
A few different performance improvements went out in 1.2.7 to help address some of the perf issues described here (slow runs page for runs with large job snapshots and large event log payloads for asset runs that were materializing large numbers of ops)
I'd be curious if things look better in that version
b
Going to upgrade after checking the change log for breaking changes
s
BIG thank you @daniel