Is there a k8s configuration that would produce this error ` dagster #ask-ai

Is there a k8s configuration that would produce th...

Tara Larrue

03/13/2024, 4:00 PM

Is there a k8s configuration that would produce this error?

Detected run worker status FAILED: 'Run has not completed but K8s job has no active pods'. Marking run 173d0710-e78b-4361-833f-a9272b8b85cb as failed, because it has surpassed the configured maximum attempts to resume the run: 0.

Scout

03/13/2024, 4:00 PM

The error message you're referring to suggests that a Kubernetes job associated with a Dagster run has failed because there are no active pods for the job, and the run has exceeded the maximum number of attempts to resume. This situation could arise due to a configuration in Kubernetes or Dagster that limits the number of retries or due to issues with the Kubernetes cluster itself. In the context of Dagster's Kubernetes integration, there are several configurations that could potentially lead to this error: 1. Run Retries Configuration: If the

dagster/max_retries

tag is set for a run, it will retry if it fails. However, if the maximum number of retries is set to 0, the run will not be retried, and you could see this error if the Kubernetes job fails. The documentation mentions that you can configure a global number of retries in the Helm chart by setting

run_retries.max_retries

. 2. Kubernetes Job Configuration: Kubernetes jobs can be configured with a

backoffLimit

, which specifies the number of times a job will be retried before it is considered failed. If this limit is reached, no new pods will be created, and the job will be marked as failed. 3. Run Monitoring: Dagster has a feature for run monitoring that detects hanging runs and moves them into a FAILURE state or starts a retry if the run is configured to allow retries. If a run is in a STARTING state for more than a certain amount of time (default is 5 minutes), it will be assumed to be hanging and will be terminated. 4. Kubernetes Cluster Issues: Issues with the Kubernetes cluster, such as insufficient resources, node failures, or eviction policies, could cause pods to fail to start or be terminated prematurely. 5. Dagster Configuration: Specific configurations in Dagster, such as the

dagsterDaemon.runMonitoring.maxResumeAttempts

setting, could affect how runs are monitored and resumed. To troubleshoot this error, you would typically check the Kubernetes job and pod status using

kubectl

commands to understand why there are no active pods. You would also review the Dagster run configuration and the Kubernetes job specification to ensure that the retry policies and resource requests are set appropriately. If you are encountering this error and need more specific guidance, please provide additional context about your Dagster and Kubernetes configurations, and we can look into the documentation to provide more detailed assistance.

16 Views

Open in Slack

Previous Next