Marie Hénaff
02/28/2024, 10:28 PMScout
02/28/2024, 10:28 PMmax_concurrent
Configuration: In the Kubernetes executor configuration for Dagster, there is a max_concurrent
setting that limits the number of pods that will execute concurrently for one run. If this is set, it could limit the parallelism of your jobs. However, the documentation states that by default, there is no limit, and it will parallelize as allowed by the DAG.
3. Kubernetes Cluster Auto-Scaling: If your Kubernetes cluster is configured to auto-scale, it should automatically provision new nodes when existing ones do not have enough resources to schedule new pods. However, auto-scaling is not instantaneous and can take some time to detect the need for more resources and to provision new nodes.
4. Dagster's Queued Run Coordinator: If you are using the QueuedRunCoordinator
, it manages the queue of runs to be executed. The configuration for this component, such as max_concurrent_runs
, could also affect how many jobs are run in parallel.
5. Backfill Submission: There is a known issue where submitting large backfills to the QueuedRunCoordinator
can be slow, and jobs from earlier backfills must be enqueued before later backfills can be enqueued. This could potentially impact the parallelism of your jobs if multiple backfills are being submitted.
To troubleshoot this issue, you may want to:
- Check the max_concurrent
setting in your Dagster Kubernetes executor configuration to ensure it's not limiting your parallelism.
- Review the configuration of your QueuedRunCoordinator
if you're using one, particularly the max_concurrent_runs
setting.
- Monitor the auto-scaling behavior of your Kubernetes cluster to see if it's scaling up as expected when resource requests exceed the current capacity.
- Look for any resource quotas or limits set on the namespace where your jobs are running, as these could also restrict the number of pods that can be scheduled.
If you continue to experience issues, you might want to consider consulting the Kubernetes cluster logs and Dagster's logs to gain more insight into why additional jobs are not being scheduled and to verify if the node group scalingScout
02/28/2024, 10:29 PMMarie Hénaff
02/28/2024, 10:34 PM