I have a kubernetes deployment with resource reque...
# ask-ai
m
I have a kubernetes deployment with resource request for all the pods. When I run a backfill, only 2 jobs are run in parallel instead of running more and trigger a node group scale up
s
Based on the documentation, the behavior you're observing with only 2 jobs running in parallel during a backfill could be influenced by several factors related to your Kubernetes deployment and Dagster's configuration. 1. Kubernetes Resource Requests: If you have set resource requests for your pods, Kubernetes will only schedule as many pods as can be accommodated by the available resources in your cluster. If the cluster does not have enough resources to schedule additional pods, it will not schedule more until resources become available, either through the completion of running jobs or through scaling up the node group. 2. Dagster's
max_concurrent
Configuration
: In the Kubernetes executor configuration for Dagster, there is a
max_concurrent
setting that limits the number of pods that will execute concurrently for one run. If this is set, it could limit the parallelism of your jobs. However, the documentation states that by default, there is no limit, and it will parallelize as allowed by the DAG. 3. Kubernetes Cluster Auto-Scaling: If your Kubernetes cluster is configured to auto-scale, it should automatically provision new nodes when existing ones do not have enough resources to schedule new pods. However, auto-scaling is not instantaneous and can take some time to detect the need for more resources and to provision new nodes. 4. Dagster's Queued Run Coordinator: If you are using the
QueuedRunCoordinator
, it manages the queue of runs to be executed. The configuration for this component, such as
max_concurrent_runs
, could also affect how many jobs are run in parallel. 5. Backfill Submission: There is a known issue where submitting large backfills to the
QueuedRunCoordinator
can be slow, and jobs from earlier backfills must be enqueued before later backfills can be enqueued. This could potentially impact the parallelism of your jobs if multiple backfills are being submitted. To troubleshoot this issue, you may want to: - Check the
max_concurrent
setting in your Dagster Kubernetes executor configuration to ensure it's not limiting your parallelism. - Review the configuration of your
QueuedRunCoordinator
if you're using one, particularly the
max_concurrent_runs
setting. - Monitor the auto-scaling behavior of your Kubernetes cluster to see if it's scaling up as expected when resource requests exceed the current capacity. - Look for any resource quotas or limits set on the namespace where your jobs are running, as these could also restrict the number of pods that can be scheduled. If you continue to experience issues, you might want to consider consulting the Kubernetes cluster logs and Dagster's logs to gain more insight into why additional jobs are not being scheduled and to verify if the node group scaling
is being triggered as expected.
m
My node group has autoscaling but the pods don't get schedule, so they are not as pending and don't trigger the node group scale up