anyone know how i can debug why my job runs in 30 ...
# deployment-kubernetes
j
anyone know how i can debug why my job runs in 30 seconds on my local machine but 5 minutes on the k8s launcher??
d
Hey Jacob - is time to spin up the job pod the main reason its slower, or just overall job runtime is slower? If the latter, the two things I'd want to check are whether your job pods are getting enough resources / your cluster has enough CPU and memory available that your jobs aren't being slowed down by resource limits, and also whether it might be latency reading and writing to your postgres DB that could be a factor. Combined with your previous post about the daemon liveness checks failing it does seem like there could be some underlying resource issue causing overall slowdown in your cluster
j
It's actually just overall job runtime i'm seeing as pretty slow. I'm running on GKE Autopilot, so there's some understandable wait times for pods, but once the pod is up and happy (usually within 10 to 15 seconds) it takes an extra 5+ minutes to finish
i was running a sensor and this accidentally DOS'ed my database through a stampeding herd of connections because the first job didn't finish in time
d
Is there a way you can check if database latency might be contributing to the slowdown? If so that could point to wanting to use a larger database instance
We have some features to limit the number of runs that can happen at once if you want to make sure that the database overload issue can't happen due to too many concurrent runs https://docs.dagster.io/deployment/run-coordinator
j
yeah some amount of run concurrency would be helpful, thank you! It was strange, we ran a local version of the same job while the k8s launcher was running and the local one finished in 1/8th the time
d
I see - let me know if any of the tips above about investigating cluster resources or DB latency help with identifying the problem
👍 1
j
@daniel have some more info for you: upping the CPU given to each job to 1 cpu each massively improved performance. The greatest time contributor RN though is the multiprocess executor. Between each op in the pipeline is around an extra 10-20 seconds of wait time for the subprocess to begin. Any tips on that?
d
are you saying that spinning up a subprocess in k8s is taking much longer than spinning up a subprocess when running locally? i.e. both are using the multiprocess executor, but k8s is slower than local?
j
yep, that's exactly right
I think it could because dagster locally is using more CPUs which is allowing the GIL to more effectively fork processes
j
cc @alex
a
ya if you actually want process level parallelism from the multiprocess executor you will want more than 1 vCPU allocated
you will also want to make sure you have enough memory allocated for the multiple processes to exist simultaneously - you could be eating a slow down due to paging to disk edit: im less sure on this - i typically see processes killed by the
oom_reaper
if there is memory pressure