The cloud-native orchestrator for the whole development lifecycle, with integrated lineage and observability.

dagster

Hello! I'm exploring the best practices to retry/re-execute the failed ops and/or jobs. As I understand, the rock-solid solution should include both:
• <https://docs.dagster.io/concepts/ops-jobs-graphs/op-retries|op retries within the same job run> and
• <https://docs.dagster.io/deployment/run-retries#run-retries|run retries for the runs that failed nevertheless>
Then, if after we still don't succeed we want to use the <https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors#run-failure-sensor|run failure sensor> to notify us about that and potentially start another <https://docs.dagster.io/guides/dagster/re-execution#re-execution-using-python-apis|re-execution via Python API>.
Does it sound sensible or there are better approaches, considering we cannot afford leaving any tiny bit failed as well as performing accidental re-execution of the op that has been already re-executed successfully?

Hi Arsenii, that sounds like a sensible approach to me. You can implement run retries from failure to ensure that successful ops do not reexecute.

Ok, thank you! I will probably ask more questions further during the implementation :slightly_smiling_face: