I have a kubernetes deployment with resource request for all dagster #ask-ai

I have a kubernetes deployment with resource reque...

Marie Hénaff

02/28/2024, 10:28 PM

I have a kubernetes deployment with resource request for all the pods. When I run a backfill, only 2 jobs are run in parallel instead of running more and trigger a node group scale up

Scout

02/28/2024, 10:28 PM

Based on the documentation, the behavior you're observing with only 2 jobs running in parallel during a backfill could be influenced by several factors related to your Kubernetes deployment and Dagster's configuration. 1. Kubernetes Resource Requests: If you have set resource requests for your pods, Kubernetes will only schedule as many pods as can be accommodated by the available resources in your cluster. If the cluster does not have enough resources to schedule additional pods, it will not schedule more until resources become available, either through the completion of running jobs or through scaling up the node group. 2. Dagster's
max_concurrent
Configuration: In the Kubernetes executor configuration for Dagster, there is a

max_concurrent

setting that limits the number of pods that will execute concurrently for one run. If this is set, it could limit the parallelism of your jobs. However, the documentation states that by default, there is no limit, and it will parallelize as allowed by the DAG. 3. Kubernetes Cluster Auto-Scaling: If your Kubernetes cluster is configured to auto-scale, it should automatically provision new nodes when existing ones do not have enough resources to schedule new pods. However, auto-scaling is not instantaneous and can take some time to detect the need for more resources and to provision new nodes. 4. Dagster's Queued Run Coordinator: If you are using the

QueuedRunCoordinator

, it manages the queue of runs to be executed. The configuration for this component, such as

max_concurrent_runs

, could also affect how many jobs are run in parallel. 5. Backfill Submission: There is a known issue where submitting large backfills to the

QueuedRunCoordinator

can be slow, and jobs from earlier backfills must be enqueued before later backfills can be enqueued. This could potentially impact the parallelism of your jobs if multiple backfills are being submitted. To troubleshoot this issue, you may want to: - Check the

max_concurrent

setting in your Dagster Kubernetes executor configuration to ensure it's not limiting your parallelism. - Review the configuration of your

QueuedRunCoordinator

if you're using one, particularly the

max_concurrent_runs

setting. - Monitor the auto-scaling behavior of your Kubernetes cluster to see if it's scaling up as expected when resource requests exceed the current capacity. - Look for any resource quotas or limits set on the namespace where your jobs are running, as these could also restrict the number of pods that can be scheduled. If you continue to experience issues, you might want to consider consulting the Kubernetes cluster logs and Dagster's logs to gain more insight into why additional jobs are not being scheduled and to verify if the node group scaling

Scout

02/28/2024, 10:29 PM

is being triggered as expected.

Marie Hénaff

02/28/2024, 10:34 PM

My node group has autoscaling but the pods don't get schedule, so they are not as pending and don't trigger the node group scale up

25 Views

Open in Slack

Previous Next