< Scout> if a dagster op fails to schedule to a pod due to n dagster #ask-ai

<@U0667DNC02Y> if a dagster op fails to schedule t...

Ian Macleod

02/14/2024, 5:43 PM

@Scout if a dagster op fails to schedule to a pod due to nodeaffinity restrictions, can I set a number of retries? Currently, the job fails when this happens, but I'd like the op to continue retrying.

Scout

02/14/2024, 5:43 PM

In Dagster, you can configure retries for ops that fail for various reasons, including transient issues that might cause a failure to schedule to a pod. To set up retries for an op, you can attach a

RetryPolicy

to the op definition. This policy allows you to specify how many times to retry, how long to wait between each retry, and other parameters such as backoff strategy and jitter. Here's an example of how to attach a

RetryPolicy

to an op:

Copy code

python from dagster import op, RetryPolicy  @op(retry_policy=RetryPolicy(max_retries=3, delay=0.2)) def my_op():     # Your op logic here     pass

In this example, the

my_op

will be retried up to 3 times with a delay of 0.2 seconds between each retry. However, it's important to note that if the failure to schedule to a pod is due to node affinity restrictions, this might be considered a non-transient, non-retryable issue. Retries are typically used for transient issues that are expected to resolve themselves after some time, such as temporary network outages or intermittent service availability. If the node affinity restrictions are due to cluster configuration and resource constraints that are not expected to change, simply retrying the op may not resolve the issue. In such cases, you may need to address the underlying cluster configuration or resource allocation to ensure that the pod can be scheduled successfully. If you are using Kubernetes, you might also want to look into Kubernetes-specific settings that can influence pod scheduling, such as tolerations and affinity rules, to ensure that your pods can be scheduled on the appropriate nodes. The documentation does not explicitly mention handling node affinity restrictions through retries, so if the issue is specific to Kubernetes scheduling and not a transient error within the op's execution, you may need to explore Kubernetes-specific solutions or consider adjusting your cluster's scheduling configuration.

Ian Macleod

02/14/2024, 9:30 PM

@Scout Can I put a retry policy on a job?

Scout

02/14/2024, 9:30 PM

Yes, you can configure a retry policy for a job in Dagster. While the

RetryPolicy

can be directly attached to individual ops, you can also apply a default retry policy to all ops within a job. This is done by setting the

op_retry_policy

argument on the

@job

decorator. Here's an example of how to apply a default retry policy to a job:

Copy code

python
from dagster import job, op, RetryPolicy

@op
def my_op():
    # Your op logic here
    pass

@job(op_retry_policy=RetryPolicy(max_retries=3, delay=0.2))
def my_job():
    my_op()

In this example, the

my_job

has a default retry policy that applies to all ops within the job, which means

my_op

will be retried up to 3 times with a delay of 0.2 seconds between each retry. Additionally, you can override the default retry policy for specific ops within the job by using the

with_retry_policy

method:

Copy code

python
@job(op_retry_policy=RetryPolicy(max_retries=1))
def my_job():
    my_op.with_retry_policy(RetryPolicy(max_retries=10))()

In this example,

my_job

has a default retry policy of 1 retry, but

my_op

has been given a specific retry policy that allows up to 10 retries. Furthermore, you can also configure retries using tags either on job definitions or in the Dagster UI Launchpad. For example:

Copy code

python
from dagster import job

@job(tags={"dagster/max_retries": 3})
def sample_job():
    pass

This job has a tag that specifies a maximum of 3 retries. The

dagster/retry_strategy

tag can also be used to control which ops the retry will run, with options like

FROM_FAILURE

to retry from the failed op or

ALL_STEPS

to retry all ops. Remember that the retry policy is intended for transient errors. If a job fails due to more systemic issues, such as misconfiguration or persistent resource constraints, retries may not be effective, and the underlying issues should be addressed. Op Retries | Dagster button Retry_policy on define_asset_job with de... button Add retry_policy to graph_asset and job ... button Run retries | Dagster button Thank you for your feedback!

3 Views

Open in Slack

Previous Next