I've deployed to Google Kubernetes Engine and want...
# deployment-kubernetes
I've deployed to Google Kubernetes Engine and want to allow my runs to use spot pods. According Google docs, I need to add a nodeSelector to the spec for the image. However, I'm unclear on where to do so in the values.yaml, as there doesn't appear to be an available yaml config for the pipeline pod. Does anyone have a working example or docs? Thanks in advance.
https://docs.dagster.io/deployment/guides/kubernetes/customizing-your-deployment There is a special tag you can set on your jobs or a config section for the run launcher (I haven’t used that one yet myself) to add a pod spec with a node selector - the docs have an example.
Thanks, @Adam Bloom. For others, here is what I ended up using on GKE (with autoscaling enabled, autopilot cluster):
Copy code
  # Type can be one of [K8sRunLauncher, CeleryK8sRunLauncher, CustomRunLauncher]
  type: K8sRunLauncher

    # This configuration will only be used if the K8sRunLauncher is selected

          terminationGracePeriodSeconds: 25
              - weight: 1
                  - key: <http://cloud.google.com/gke-spot|cloud.google.com/gke-spot>
                    operator: In
                    - "true"
      # Change with caution! If you're using a fixed tag for pipeline run images, changing the
      # image pull policy to anything other than "Always" will use a cached/stale image, which is
      # almost certainly not what you want.
      imagePullPolicy: "Always"
I can’t remember at the moment how this works exactly, but do you also need to set taints for these nodes to prevent other pods (and system pods) from running on these spot nodes?
I'm running on GKE autopilot cluster with autoscaling. The node taints get set automatically, but in general, I think you'd need to set node taints and the appropriate labels to select spot pods.
We are using GKE with Auto provisioning (Terraform has issues with setting the SA for the autopilot nodes back when we tested it a while back, and i think that we fixed a few weeks ago), so it’s probably doing the same but I didn’t notice 🙂 I’ll take a look. If I can use affinity instead of taints+tolerance this should simplify the config a bit. How’s your experience with Autopilot, by the way? We are planning some big changes to our K8s cluster soon, and I’m thinking of giving it another try
In my setup, I'm using a terraformed SA with workflow identity (on the k8s default account) on this autopilot cluster. I haven't found it to be a huge win in terms of setup, etc., but I'm not running a lot of complicated workloads and don't have any multitenancy needs or high security needs. One of the main reasons I wanted to give it a try is that the cost of running the control plane is covered by the free tier.
Good to know 🙂 We couldn’t use the default SA for security reasons (the bug is described here https://github.com/GoogleCloudPlatform/magic-modules/pull/6733). What’s interesting in Autopilot, for me, is that it’s supposed to charge you only for the resources you requested, as opposed to the node resources that got created. For example, if you ask for 4cpu and 4 gb of memory, GKE will spin up a node with 8cpu and 8gb of memory (or something like that). So unless other pods get allocated to this node, you’re paying 2x. Sure, you can lower the cpu and ram request so the pod will fit in the smaller machine, but it requires all devs to be more aware of this behavior