Is there a k8s configuration that would produce th...
# ask-ai
t
Is there a k8s configuration that would produce this error?
Detected run worker status FAILED: 'Run has not completed but K8s job has no active pods'. Marking run 173d0710-e78b-4361-833f-a9272b8b85cb as failed, because it has surpassed the configured maximum attempts to resume the run: 0.
s
The error message you're referring to suggests that a Kubernetes job associated with a Dagster run has failed because there are no active pods for the job, and the run has exceeded the maximum number of attempts to resume. This situation could arise due to a configuration in Kubernetes or Dagster that limits the number of retries or due to issues with the Kubernetes cluster itself. In the context of Dagster's Kubernetes integration, there are several configurations that could potentially lead to this error: 1. Run Retries Configuration: If the
dagster/max_retries
tag is set for a run, it will retry if it fails. However, if the maximum number of retries is set to 0, the run will not be retried, and you could see this error if the Kubernetes job fails. The documentation mentions that you can configure a global number of retries in the Helm chart by setting
run_retries.max_retries
. 2. Kubernetes Job Configuration: Kubernetes jobs can be configured with a
backoffLimit
, which specifies the number of times a job will be retried before it is considered failed. If this limit is reached, no new pods will be created, and the job will be marked as failed. 3. Run Monitoring: Dagster has a feature for run monitoring that detects hanging runs and moves them into a FAILURE state or starts a retry if the run is configured to allow retries. If a run is in a STARTING state for more than a certain amount of time (default is 5 minutes), it will be assumed to be hanging and will be terminated. 4. Kubernetes Cluster Issues: Issues with the Kubernetes cluster, such as insufficient resources, node failures, or eviction policies, could cause pods to fail to start or be terminated prematurely. 5. Dagster Configuration: Specific configurations in Dagster, such as the
dagsterDaemon.runMonitoring.maxResumeAttempts
setting, could affect how runs are monitored and resumed. To troubleshoot this error, you would typically check the Kubernetes job and pod status using
kubectl
commands to understand why there are no active pods. You would also review the Dagster run configuration and the Kubernetes job specification to ensure that the retry policies and resource requests are set appropriately. If you are encountering this error and need more specific guidance, please provide additional context about your Dagster and Kubernetes configurations, and we can look into the documentation to provide more detailed assistance.