anyone know how i can debug why my job runs in 30 seconds on dagster #deployment-kubernetes

Join Slack

anyone know how i can debug why my job runs in 30 ...

# deployment-kubernetes

Jacob Aronoff

01/12/2022, 3:38 AM

anyone know how i can debug why my job runs in 30 seconds on my local machine but 5 minutes on the k8s launcher??

daniel

01/12/2022, 3:50 AM

Hey Jacob - is time to spin up the job pod the main reason its slower, or just overall job runtime is slower? If the latter, the two things I'd want to check are whether your job pods are getting enough resources / your cluster has enough CPU and memory available that your jobs aren't being slowed down by resource limits, and also whether it might be latency reading and writing to your postgres DB that could be a factor. Combined with your previous post about the daemon liveness checks failing it does seem like there could be some underlying resource issue causing overall slowdown in your cluster

Jacob Aronoff

01/12/2022, 3:58 AM

It's actually just overall job runtime i'm seeing as pretty slow. I'm running on GKE Autopilot, so there's some understandable wait times for pods, but once the pod is up and happy (usually within 10 to 15 seconds) it takes an extra 5+ minutes to finish

Jacob Aronoff

01/12/2022, 3:58 AM

i was running a sensor and this accidentally DOS'ed my database through a stampeding herd of connections because the first job didn't finish in time

daniel

01/12/2022, 4:02 AM

Is there a way you can check if database latency might be contributing to the slowdown? If so that could point to wanting to use a larger database instance

daniel

01/12/2022, 4:03 AM

We have some features to limit the number of runs that can happen at once if you want to make sure that the database overload issue can't happen due to too many concurrent runs https://docs.dagster.io/deployment/run-coordinator

Jacob Aronoff

01/12/2022, 2:48 PM

yeah some amount of run concurrency would be helpful, thank you! It was strange, we ran a local version of the same job while the k8s launcher was running and the local one finished in 1/8th the time

daniel

01/12/2022, 3:52 PM

I see - let me know if any of the tips above about investigating cluster resources or DB latency help with identifying the problem

👍 1

Jacob Aronoff

01/13/2022, 4:22 AM

@daniel have some more info for you: upping the CPU given to each job to 1 cpu each massively improved performance. The greatest time contributor RN though is the multiprocess executor. Between each op in the pipeline is around an extra 10-20 seconds of wait time for the subprocess to begin. Any tips on that?

daniel

01/13/2022, 4:42 AM

are you saying that spinning up a subprocess in k8s is taking much longer than spinning up a subprocess when running locally? i.e. both are using the multiprocess executor, but k8s is slower than local?

Jacob Aronoff

01/13/2022, 3:26 PM

yep, that's exactly right

Jacob Aronoff

01/13/2022, 3:27 PM

I think it could because dagster locally is using more CPUs which is allowing the GIL to more effectively fork processes

johann

01/13/2022, 4:11 PM

cc @alex

alex

01/13/2022, 4:28 PM

ya if you actually want process level parallelism from the multiprocess executor you will want more than 1 vCPU allocated

alex

01/13/2022, 4:31 PM

you will also want to make sure you have enough memory allocated for the multiple processes to exist simultaneously - you could be eating a slow down due to paging to disk edit: im less sure on this - i typically see processes killed by the

oom_reaper

if there is memory pressure

Open in Slack

Previous Next