Hello in dagster cloud hybrid the first time a worker job is dagster #dagster-plus

Hello, in dagster cloud hybrid, the first time a w...

Abhishek Agrawal

03/03/2023, 3:54 AM

Hello, in dagster cloud hybrid, the first time a worker job is started, it takes almost 3 minutes. Is there anyway to cut it short?

Abhishek Agrawal

03/03/2023, 5:28 AM

This is what I have in my deployment config - Can I edit the limits here to cut short the above time?

daniel

03/03/2023, 3:43 PM

Hi Abhishek - the main thing that's happening during this period of time is Dagster importing your code. does it take a similarly long amount of time when you load your code locally / are there any side effects that might be slow during the import? If not, increasing the resource limits is what I would recommend

Abhishek Agrawal

03/04/2023, 4:51 AM

It's just plain python code.. GCP libraries, that's all. I will see if I can get the docker image to br lighter but there isn't much.

Abhishek Agrawal

03/04/2023, 5:36 AM

Do let me know if there are any other ideas in your mind. Can I play around with

image_pull_policy

daniel

03/04/2023, 6:47 AM

The size of the image could matter, yeah - I just realized I read this wrong earlier and this 3 minutes is before the process starts, so the size of your imports wouldn’t matter. You could check with kubectl logs on the pod to see what it is doing during those 3 minutes

Abhishek Agrawal

03/04/2023, 6:51 AM

Ah okay. So I should start the pipeline and then scoot over to logs? I will give this a go and report back here.

Abhishek Agrawal

03/05/2023, 10:53 PM

Hey Daniel, got this from the log - When we see "CloudK8sRunLauncher] Kubernetes run worker job created" in the UI. In k8s I see this - "0/4 nodes are available: 4 Insufficient cpu, 4 Insufficient memory. preemption: 0/4 nodes are available: 4 No preemption victims found for incoming pod." See the top one in the screenshot. Also, there are bunch of errors but they don't really look like errors. What are those about?

Abhishek Agrawal

03/06/2023, 2:36 AM

I am using GKE Autopilot for this and 3 minutes time is just too much to accept for our scenario.

daniel

03/06/2023, 3:58 AM

What you're seeing there might be specific to GKE Autopilot, it sounds like it's taking a while to provision pods for you - I don't think we have the expertise to give specific advice for GKE Autopilot but I believe we have other users using it in the community and the community members in #dagster-kubernetes are generally pretty helpful - you could try making a post there and see if others have seen this before and whether they were able to do anything to improve the performance there. There may be other GKE-specific support communities that can help as well

Abhishek Agrawal

03/06/2023, 4:11 AM

Thanks @daniel! I hope someone helps me.. if I were to have a standard GKE cluster, would you know what to do? I am a total noob when it comes to k8s..

daniel

03/06/2023, 4:14 AM

in a standard GKE cluster, that might indicate that you should allocate more nodes to the cluster so that there are enough resources on the cluster to launch your pods

Abhishek Agrawal

03/06/2023, 4:18 AM

hmm okay, I will give it a go..

Andrea Giardini

03/06/2023, 8:28 AM

Hi Abhishek, from the logs you posted it looks like your cluster does not have enough resources to sping up the dagster job. For this reason it needs to provision a new node before starting the run (you can see it in

pod triggered scale-up

). Pulling the image does not take too long, around 40s. You can consider overprovisioning your cluster to avoid this problem, but you will incur in some extra cost.

Abhishek Agrawal

03/06/2023, 9:46 AM

Thanks @Andrea Giardini. I think extra cost is fine if I'm able to successfully bring down the wait time. I'm using GKE Autopilot. Do you know how to provision extra resources? I couldn't find it in the config settings..

Andrea Giardini

03/06/2023, 9:52 AM

https://medium.com/scout24-engineering/cluster-overprovisiong-in-kubernetes-79433cb3ed0e this article shows pretty well how to do it. The gist of it is that you will have an extra pod with the lowest priority using an empty node. As soon as a new pod comes up with a higher priority (e.g. your dagster pod), the extra pod will be evicted on a new node and the dagster pod will take its place.

Abhishek Agrawal

03/06/2023, 9:55 AM

Hmm that's smart. I will give it a go and see if it helps.

Abhishek Agrawal

03/07/2023, 3:54 AM

Hey @Andrea Giardini, thanks so much! I followed your advice, found this blog for GKE Autopilot and implemented it. It doesn't take too long to get the process running. 🙂 1. What advice would you give to reduce the image pull time? 2. What would be the cost implication of these balloon pods? I imagine it won't be too high..?

Andrea Giardini

03/07/2023, 7:36 PM

Great to hear it worked. Regarding the image pull time, you can either make the image smaller or pre-pull the image on the host using something like a daemonset. From the cost point of view, the only cost you will incur is the extra cluster size (since now you have a couple of extra nodes used as spare)

Abhishek Agrawal

03/09/2023, 6:09 AM

Thanks for daemonset suggestion. I was able to create it. Had a follow up question - this daemon set is creating 7 more pods and I already have those 10 balloon pods.. I think I missed something.. do I really need balloon pods if I have daemonset? Also, I didn't configure the number 7 anywhere for daemonset deployment. Am I doing something wrong?

Abhishek Agrawal

03/09/2023, 8:10 AM

Also, would you know how to mark these as info logs rather than error. They're coming as error by default and I don't know why.

6 Views

Open in Slack

Previous Next