Hello, in dagster cloud hybrid, the first time a w...
# dagster-plus
a
Hello, in dagster cloud hybrid, the first time a worker job is started, it takes almost 3 minutes. Is there anyway to cut it short?
This is what I have in my deployment config - Can I edit the limits here to cut short the above time?
d
Hi Abhishek - the main thing that's happening during this period of time is Dagster importing your code. does it take a similarly long amount of time when you load your code locally / are there any side effects that might be slow during the import? If not, increasing the resource limits is what I would recommend
a
It's just plain python code.. GCP libraries, that's all. I will see if I can get the docker image to br lighter but there isn't much.
Do let me know if there are any other ideas in your mind. Can I play around with
image_pull_policy
?
d
The size of the image could matter, yeah - I just realized I read this wrong earlier and this 3 minutes is before the process starts, so the size of your imports wouldn’t matter. You could check with kubectl logs on the pod to see what it is doing during those 3 minutes
a
Ah okay. So I should start the pipeline and then scoot over to logs? I will give this a go and report back here.
Hey Daniel, got this from the log - When we see "CloudK8sRunLauncher] Kubernetes run worker job created" in the UI. In k8s I see this - "0/4 nodes are available: 4 Insufficient cpu, 4 Insufficient memory. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod." See the top one in the screenshot. Also, there are bunch of errors but they don't really look like errors. What are those about?
I am using GKE Autopilot for this and 3 minutes time is just too much to accept for our scenario.
d
What you're seeing there might be specific to GKE Autopilot, it sounds like it's taking a while to provision pods for you - I don't think we have the expertise to give specific advice for GKE Autopilot but I believe we have other users using it in the community and the community members in #dagster-kubernetes are generally pretty helpful - you could try making a post there and see if others have seen this before and whether they were able to do anything to improve the performance there. There may be other GKE-specific support communities that can help as well
a
Thanks @daniel! I hope someone helps me.. if I were to have a standard GKE cluster, would you know what to do? I am a total noob when it comes to k8s..
d
in a standard GKE cluster, that might indicate that you should allocate more nodes to the cluster so that there are enough resources on the cluster to launch your pods
a
hmm okay, I will give it a go..
a
Hi Abhishek, from the logs you posted it looks like your cluster does not have enough resources to sping up the dagster job. For this reason it needs to provision a new node before starting the run (you can see it in
pod triggered scale-up
). Pulling the image does not take too long, around 40s. You can consider overprovisioning your cluster to avoid this problem, but you will incur in some extra cost.
a
Thanks @Andrea Giardini. I think extra cost is fine if I'm able to successfully bring down the wait time. I'm using GKE Autopilot. Do you know how to provision extra resources? I couldn't find it in the config settings..
a
https://medium.com/scout24-engineering/cluster-overprovisiong-in-kubernetes-79433cb3ed0e this article shows pretty well how to do it. The gist of it is that you will have an extra pod with the lowest priority using an empty node. As soon as a new pod comes up with a higher priority (e.g. your dagster pod), the extra pod will be evicted on a new node and the dagster pod will take its place.
a
Hmm that's smart. I will give it a go and see if it helps.
Hey @Andrea Giardini, thanks so much! I followed your advice, found this blog for GKE Autopilot and implemented it. It doesn't take too long to get the process running. 🙂 1. What advice would you give to reduce the image pull time? 2. What would be the cost implication of these balloon pods? I imagine it won't be too high..?
a
Great to hear it worked. Regarding the image pull time, you can either make the image smaller or pre-pull the image on the host using something like a daemonset. From the cost point of view, the only cost you will incur is the extra cluster size (since now you have a couple of extra nodes used as spare)
a
Thanks for daemonset suggestion. I was able to create it. Had a follow up question - this daemon set is creating 7 more pods and I already have those 10 balloon pods.. I think I missed something.. do I really need balloon pods if I have daemonset? Also, I didn't configure the number 7 anywhere for daemonset deployment. Am I doing something wrong?
Also, would you know how to mark these as info logs rather than error. They're coming as error by default and I don't know why.