Hello, we are running a hybrid setup in GKE and we...
# deployment-kubernetes
q
Hello, we are running a hybrid setup in GKE and we tried to point our runs to a new node pool by using this configuration:
Copy code
dagster-k8s/config: {"container_config": {
                    "resources": {
                        "limits": {"memory": "7Gi"},
                        "requests": {"cpu": "1000m", "memory": "5Gi"},
                    }
                }, "pod_spec_config": {"node_selector": {"app": "new_node_pool_label"}}}
However, after adding these to our jobs, the runs are failing to start. In the Dagster logs, we see
Run timed out due to taking longer than 1200 seconds to start.
and in the cluster, the attached image shows what we see. Any reason or something we are missing?
o
Debugging scaling issues in GKE can be a bit challenging (at least based on the tooling I could find so far). You can find more information on scaling issues when you go to you cluster info (console->kubernetes engine->clusters and click on your cluster)
If you click on these buttons, it’ll give you some additional information on why pods didn’t get scheduled and (if relevant) why your node pool didn’t scale up.
What’s the size of the machines you use in the pool? Remember that k8s system pods take some CPU and memory. For example, if you use a machine with 8gb and try and run a pod that needs 7gb, it probably won’t get scheduled because there are less than 7gb of memory available on the node
q
Thanks @Oren Lederman, Do you think there is something wrong with the way I am targeting the node pool?
o
I can’t tell if your node selector tags are correct (it depend on your node pools), but i’d look into: 1. try and give your pod less memory and CPU (both in request and limits). See if it gets assigned a node. If the node pool auto scales, it can take a few minutes (and you’ll see the unscheduable warning for a while) 2. is there anything special about your node pool? are these spot instances? if so, you might need to add tolerances tags
q
@Oren Lederman I don't think there is anything special about the node pool. I managed to get it to work using this config instead
Copy code
"pod_spec_config": {
                "affinity": {
                    "node_affinity": {
                        "required_during_scheduling_ignored_during_execution": {
                            "node_selector_terms": [
                                {
                                    "match_expressions": [
                                        {
                                            "key": "app",
                                            "operator": "In",
                                            "values": ["node-pool-label"],
                                        }
                                    ]
                                }
                            ]
                        }
                    }
                },
            },
👍 1