Hi Dagster Team! I am currently trying to make my ...
# ask-community
v
Hi Dagster Team! I am currently trying to make my Dagster deployment more resilient to transient network issues. We use the
dagster_celery
executor and it has been working great for us, but when the connection to the broker is lost even for a few seconds, the whole run will fail with a
DagsterExecutionInterruptedError
exception. I have been looking at Run Retries which may definitely help here, but we already have logic to handle failures in a
failure_hook
we consistently attach to all our ops. Obviously, we do not want the "Run Retry" logic to retry jobs where the failure hook has had the chance to run. Is there a way to mark a job inside from inside the hook in a way to ensure Dagster will not retry it? Alternatively, I have been looking into implementing my own
@run_failure_sensor
that would simply re-dispatch the same job, but I do not think the retry count is accessible from the
RunFailureSensorContext
and I do not want to retry failed jobs in an infinite loop. How can I set up Dagster so it will retry jobs that have failed in such a way that my
failure_hook
did not run, and only those?
Thanks in advance!
Oh I just had a look at the
auto_run_reexecution
logic and how it keeps track of the retry count by using a tag on the job. Clever!
Unless there is a native solution, I guess I will implement my own run failure sensor and do something similar.
Anyone please? 🥲
c
Hi VxD. Yep, there currently isn't a way to customize run retries to not run in certain situations.
I think your run failure sensor solution sounds reasonable--within your hook, you could add a tag via
context.instance.add_run_tags(...)
and in the run failure sensor, check whether that tag exists.
v
OK, will do! Thanks Claire for the help! Much appreciated. dagster angel
🌈 1
c
Though
add_run_tags
isn't a public API currently, which means that it is meant for internal use and can unexpectedly change. You could file a feature request to petition to add it as part of the public API
v
Sure! In the meantime it's OK, we have our own DB where we keep the status of jobs separately.