VxD
04/11/2023, 8:10 AMdagster_celery
executor and it has been working great for us, but when the connection to the broker is lost even for a few seconds, the whole run will fail with a DagsterExecutionInterruptedError
exception.
I have been looking at Run Retries which may definitely help here, but we already have logic to handle failures in a failure_hook
we consistently attach to all our ops.
Obviously, we do not want the "Run Retry" logic to retry jobs where the failure hook has had the chance to run. Is there a way to mark a job inside from inside the hook in a way to ensure Dagster will not retry it?
Alternatively, I have been looking into implementing my own @run_failure_sensor
that would simply re-dispatch the same job, but I do not think the retry count is accessible from the RunFailureSensorContext
and I do not want to retry failed jobs in an infinite loop.
How can I set up Dagster so it will retry jobs that have failed in such a way that my failure_hook
did not run, and only those?
Thanks in advance!auto_run_reexecution
logic and how it keeps track of the retry count by using a tag on the job. Clever!claire
04/11/2023, 11:30 PMcontext.instance.add_run_tags(...)
and in the run failure sensor, check whether that tag exists.VxD
04/11/2023, 11:33 PMclaire
04/11/2023, 11:35 PMadd_run_tags
isn't a public API currently, which means that it is meant for internal use and can unexpectedly change. You could file a feature request to petition to add it as part of the public APIVxD
04/11/2023, 11:36 PM