Hello we are running a hybrid setup in GKE and we tried to p dagster #deployment-kubernetes

Hello, we are running a hybrid setup in GKE and we...

Qwame

08/18/2023, 6:59 PM

Hello, we are running a hybrid setup in GKE and we tried to point our runs to a new node pool by using this configuration:

Copy code

dagster-k8s/config: {"container_config": {
                    "resources": {
                        "limits": {"memory": "7Gi"},
                        "requests": {"cpu": "1000m", "memory": "5Gi"},
                    }
                }, "pod_spec_config": {"node_selector": {"app": "new_node_pool_label"}}}

However, after adding these to our jobs, the runs are failing to start. In the Dagster logs, we see

Run timed out due to taking longer than 1200 seconds to start.

and in the cluster, the attached image shows what we see. Any reason or something we are missing?

Oren Lederman

08/18/2023, 11:19 PM

Debugging scaling issues in GKE can be a bit challenging (at least based on the tooling I could find so far). You can find more information on scaling issues when you go to you cluster info (console->kubernetes engine->clusters and click on your cluster)

Oren Lederman

08/18/2023, 11:20 PM

If you click on these buttons, it’ll give you some additional information on why pods didn’t get scheduled and (if relevant) why your node pool didn’t scale up.

Oren Lederman

08/18/2023, 11:22 PM

What’s the size of the machines you use in the pool? Remember that k8s system pods take some CPU and memory. For example, if you use a machine with 8gb and try and run a pod that needs 7gb, it probably won’t get scheduled because there are less than 7gb of memory available on the node

Qwame

09/12/2023, 3:48 PM

Thanks @Oren Lederman, Do you think there is something wrong with the way I am targeting the node pool?

Oren Lederman

09/12/2023, 3:56 PM

I can’t tell if your node selector tags are correct (it depend on your node pools), but i’d look into: 1. try and give your pod less memory and CPU (both in request and limits). See if it gets assigned a node. If the node pool auto scales, it can take a few minutes (and you’ll see the unscheduable warning for a while) 2. is there anything special about your node pool? are these spot instances? if so, you might need to add tolerances tags

Qwame

09/12/2023, 5:11 PM

@Oren Lederman I don't think there is anything special about the node pool. I managed to get it to work using this config instead

Copy code

"pod_spec_config": {
                "affinity": {
                    "node_affinity": {
                        "required_during_scheduling_ignored_during_execution": {
                            "node_selector_terms": [
                                {
                                    "match_expressions": [
                                        {
                                            "key": "app",
                                            "operator": "In",
                                            "values": ["node-pool-label"],
                                        }
                                    ]
                                }
                            ]
                        }
                    }
                },
            },

👍 1

6 Views

Open in Slack

Previous Next