Shane Keller
02/27/2024, 11:19 PMScout
02/27/2024, 11:19 PMRetryPolicy
to control the behavior of op retries in the event of failures. Here are some key points from the documentation regarding op execution and retries: 1. RetryPolicy: You can attach a RetryPolicy
to an op to specify that it should retry upon encountering an exception. This policy can be configured with parameters such as max_retries
, delay
, backoff
, and jitter
to control the retry behavior. 2. Manual Retry with RetryRequested: In more nuanced situations, you can manually request a retry by raising a RetryRequested
exception within the op's code. 3. Reexecution Options: When reexecuting a job, you can use ReexecutionOptions
to specify which ops to reexecute. For example, you can reexecute from a failure, reexecute all ops, or reexecute selected ops. 4. Run-Level Retries: For scenarios where an op or asset fails due to a process crash (e.g., out of memory), run-level retries can be set up to retry the entire run if it fails for any reason. This is separate from the RetryPolicy
on individual ops. 5. Run Monitoring: In Dagster Cloud, Run Monitoring is enabled by default and can help ensure that runs are retried even if the run worker crashes. The guarantees around op execution are thus dependent on the configuration of retry policies and the use of reexecution options. Without any retry policies or reexecution options, an op is executed once per run. With retry policies, an op may be executed multiple times until it succeeds or the retry policy conditions are no longer met. With reexecution options, you can manually trigger additional executions of specific ops or the entire job. For more detailed information on configuring retries and reexecution, you can refer to the documentation on [op retries](https://docs.dagster.io/concepts/ops-jobs-graphs/op-retries) and [re-execution](https://docs.dagster.io/guides/dagster/re-execution).